hepcat72 / vcfSampleCompare

Filter and rank variant call files (VCF) based on comparative evidence ratios between groups of samples.
GNU General Public License v3.0
2 stars 1 forks source link

VcfSampleCompare - empty output with warnings #26

Open nevosmic opened 3 years ago

nevosmic commented 3 years ago

Hello to all,

I have 10 vcf files - 5 female fish and 5 male fish, I have merged all 10 fish to one vcf file.(all_fish.vcf) I performed the VcfSampleCompare analysis on 'all_fish.vcf' , My command (in my bash file):

merged_fish=/storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/Merged_vcf/all_fish.vcf

vcfSampleCompare.pl --sample-group '/storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/F1.sorted.bam.gz /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/F2.sorted.bam.gz /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/F4.sorted.bam.gz /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/F6.sorted.bam.gz /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/F8.sorted.bam.gz' --sample-group '/storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/M1.sorted.bam.gz /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/M3.sorted.bam.gz /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/M5.sorted.bam.gz /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/M7.sorted.bam.gz /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/M9.sorted.bam.gz' $merged_fish

And this is my all_fish.vcf:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/M1.sorted.bam.gz /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/M3.sorted.bam.gz /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/M5.sorted.bam.gz /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/M7.sorted.bam.gz /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/M9.sorted.bam.gz /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/F1.sorted.bam.gz /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/F2.sorted.bam.gz /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/F4.sorted.bam.gz /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/F6.sorted.bam.gz /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/F8.sorted.bam.gz

NC_048323.1 53 . C T 26.2715 . VDB=0.0249187;SGB=-0.511536;RPBZ=-1.41421;MQBZ=1.41421;BQBZ=1.41421;SCBZ=0.57735;FS=0;MQ0F=0;MQ=18;DP=4;DP4=1,0,3,0;AN=2;AC=2 GT:PL ./.:. ./.:. ./.:. ./.:. ./.:. 1/1:53,5,0 ./.:. ./.:. ./.:. ./.:. NC_048323.1 129 . C CA 28.7159 . INDEL;IDV=3;IMF=0.6;VDB=0.444886;SGB=-0.511536;RPBZ=0.57735;MQBZ=0.888523;FS=0;MQ0F=0.2;MQ=24;DP=5;DP4=2,0,3,0;AN=2;AC=1 GT:PL ./.:. 0/1:62,0,16 ./.:. ./.:. ./.:. ./.:../.:. ./.:. ./.:. ./.:. .....................

But this is my error file:

WARNING1:vcfSampleCompare.pl: Runtime warning: [Use of uninitialized value $dp in numeric gt (>) at /home/nevosmic/.conda/envs/mic/bin/vcfSampleCompare.pl line 2149.]. WARNING2:vcfSampleCompare.pl: Runtime warning: [Use of uninitialized value $dp in addition (+) at /home/nevosmic/.conda/envs/mic/bin/vcfSampleCompare.pl line 2149.]. WARNING3:vcfSampleCompare.pl: Runtime warning: [Use of uninitialized value $dp in numeric gt (>) at /home/nevosmic/.conda/envs/mic/bin/vcfSampleCompare.pl line 2149.]. WARNING3:vcfSampleCompare.pl: NOTE: Further warnings of this type will be suppressed. WARNING3:vcfSampleCompare.pl: Set --error-limit to 0 to turn off error suppression WARNING11:vcfSampleCompare.pl: Runtime warning: [Use of uninitialized value $dp in numeric gt (>) at /home/nevosmic/.conda/envs/mic/bin/vcfSampleCompare.pl line 2154.]. WARNING12:vcfSampleCompare.pl: Runtime warning: [Use of uninitialized value $dp in addition (+) at /home/nevosmic/.conda/envs/mic/bin/vcfSampleCompare.pl line 2154.]. WARNING13:vcfSampleCompare.pl: Runtime warning: [Use of uninitialized value $dp in numeric gt (>) at /home/nevosmic/.conda/envs/mic/bin/vcfSampleCompare.pl line 2154.]. WARNING13:vcfSampleCompare.pl: NOTE: Further warnings of this type will be suppressed. WARNING13:vcfSampleCompare.pl: Set --error-limit to 0 to turn off error suppression

Done. STATUS: [EXIT-CODE: 0 ERRORS: 0 WARNINGS: 1487319520 TIME: 71590s] SUMMARY: 743659760 WARNINGS LIKE: [WARNING1:vcfSampleCompare.pl: Runtime warning: [Use of uninitialized value $dp in numeric gt (>) at ...] 743659760 WARNINGS LIKE: [WARNING11:vcfSampleCompare.pl: Runtime warning: [Use of uninitialized value $dp in numeric gt (>) at...] Scroll up to inspect full errors/warnings in-place.

And this is the output:

vcfSampleCompare.pl Version 2.013

Directory:

/storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/vcfsamplecompare

Command: /home/nevosmic/.conda/envs/mic/bin/perl /home/nevosmic/.conda/envs/mic/bin/vcfSampleCompare.pl --sample-group "/storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/F1.sorted.bam.gz /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/F2.sorted.bam.gz /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/F4.sorted.bam.gz /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/F6.sorted.bam.gz /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/F8.sorted.bam.gz" --sample-group "/storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/M1.sorted.bam.gz /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/M3.sorted.bam.gz /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/M5.sorted.bam.gz /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/M7.sorted.bam.gz /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/M9.sorted.bam.gz" /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/Merged_vcf/all_fish.vcf

CHROM POS ID REF ALT BEST_PAIR BEST_GT_SCORE BEST_OR_SCORE BEST_DP_SCORE PAIR_ID PAIR_GT_SCORE PAIR_OR_SCORE PAIR_DP_SCORE STATES_USED_GT STATES_USED_OR GROUP1_SAMPLES GROUP1_GTS GROUP1_ORS GROUP2_SAMPLES GROUP2_GTS GROUP2_ORS

hepcat72 commented 3 years ago

Just acknowledging your request for now. When I have time, I will look into it. Thanks for using vcfSampleCompare.

Rob

hepcat72 commented 3 years ago

OK. I do have some quick feedback. Take a look at the list of required input file components in the README under the subheading INPUT FORMAT. Particularly, the part about DP:

DP (the number of reads that map at or over the variant position)

Those are the warnings you're getting - about DP not being defined. Looking at the snippet of the file you pasted, DP isn't present in the FORMAT strings.

Note under the details section:

This script works with VCF files generated by freeBayes (for SNP and small nucleotide variants) and svTyper (for structural variants). However, it will also work with any VCF data produced by other tools as long as the output includes GT and/or (AO and RO), and DP tags in the FORMAT data column.

Your FORMAT strings are all GT:PL, but it is missing DP (required for vcfSampleCompare) and AO & RO. You have GT, but that one is optional.

I have dealt with data that was missing DP before and I recall running tools to calculate it, but it's been a long time and I'd have to consult my notes on those older projects.

You can run in genotype mode which for the most part, ignores DP, but DP is still used for some things like sorting and is thus still required.

Do you think you can add DP to your data or re-run and select options that will result in DP being in the output? If you need help with that, I'd have to dig through my notes. At the very least, I should add a format check to the code that yields an error that directly states DP is required.

nevosmic commented 3 years ago

Hi Robert, I appreciate your answer, I was so lost. I created my files using the following steps: 1) bcftools mpileup which created a BCF file (bcftools mpileup -O b -o $ OUTPUT_BCF -f $ REF $ FISH_SORTED_BAM) 2) bcftools call which created a VCF file (bcftools call --ploidy 1 -m -v -o $ BCF_File $ VCF_output) 3) vcfutils.pl varFilter which created a filtered VCF file (vcfutils.pl varFilter $ F_vcf> $ F_final_variants_vcf)

Do you think I should run FreeBayes to create new VCF files and it will give me different results? Or Should I run the bcftools mpileup with DP tag and so on..? (like mentioned here: https://www.biostars.org/p/9481584/#9481585

hepcat72 commented 3 years ago

Any tool that outputs in VCF format and includes the required elements (header line with sample names and the format string elements mentioned above) should work, though that said, I have not tested bcftools mpileup. At the lowest level, the comparison metrics and heuristics are quite simple, albeit become complex when strung together.

Then again, if the dataset is amenable to freebayes (i.e. tractable - don't take forever to run - which I have dealt with for very deep sequencing runs), then there's no reason you can't run both and compare results.

Let me know how it goes.

PS. My documentation is admittedly detailed and verbose. Any suggestions for changing the doc to make it easier to understand are welcome.

nevosmic commented 3 years ago

Hello Robert, So I have a vcf file with DP tag that looks like that:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/M1.sorted.bam   /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/M3.sorted.bam   /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/M5.sorted.bam  /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/M7.sorted.bam      /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/M9.sorted.bam   /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/F1.sorted.bam   /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/F2.sorted.bam   /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/F4.sorted.bam   /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/F6.sorted.bam     /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/F8.sorted.bam
NC_048323.1     53      .       C       T       26.2715 .       VDB=0.0249187;SGB=-0.511536;RPBZ=-1.41421;MQBZ=1.41421;BQBZ=1.41421;SCBZ=0.57735;FS=0;MQ0F=0;MQ=18;DP=4;DP4=1,0,3,0;ADF=1,3;ADR=0,0;AD=1,3;AN=2;AC=2  GT:PL:DP:SP:ADF:ADR:AD  ./.:.:.:.:.:.:.       ./.:.:.:.:.:.:. ./.:.:.:.:.:.:. ./.:.:.:.:.:.:. ./.:.:.:.:.:.:. 1/1:53,5,0:4:0:1,3:0,0:1,3      ./.:.:.:.:.:.:. ./.:.:.:.:.:.:.       ./.:.:.:.:.:.:. ./.:.:.:.:.:.:.

But when I run this command:

vcfSampleCompare.pl -o $output_file --sample-group '/storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/F1.sorted.bam /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/F2.sorted.bam /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/F4.sorted.bam /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/F6.sorted.bam /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/F8.sorted.bam' --sample-group '/storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/M1.sorted.bam /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/M3.sorted.bam /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/M5.sorted.bam /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/M7.sorted.bam /storage/users/IsanaRNA/FISH_DATA/MappingToAcipenserRuthenusGenome/results/Bam_sorted/M9.sorted.bam' $merged_fish

I get the error file:

WARNING1:vcfSampleCompare.pl: Runtime warning: [Use of uninitialized value in division (/) at /home/nevosmic/.conda/envs/mic/bin/vcfSampleCompare.pl line 1972.].
WARNING2:vcfSampleCompare.pl: Runtime warning: [Use of uninitialized value in division (/) at /home/nevosmic/.conda/envs/mic/bin/vcfSampleCompare.pl line 1972.].
WARNING3:vcfSampleCompare.pl: Runtime warning: [Use of uninitialized value in division (/) at /home/nevosmic/.conda/envs/mic/bin/vcfSampleCompare.pl line 1972.].
WARNING3:vcfSampleCompare.pl: NOTE: Further warnings of this type will be suppressed.
WARNING3:vcfSampleCompare.pl: Set --error-limit to 0 to turn off error suppression
WARNING10:vcfSampleCompare.pl: Runtime warning: [Use of uninitialized value in string at /home/nevosmic/.conda/envs/mic/bin/vcfSampleCompare.pl line 1996.].
WARNING11:vcfSampleCompare.pl: Runtime warning: [Use of uninitialized value in string at /home/nevosmic/.conda/envs/mic/bin/vcfSampleCompare.pl line 1996.].
WARNING12:vcfSampleCompare.pl: Runtime warning: [Use of uninitialized value in string at /home/nevosmic/.conda/envs/mic/bin/vcfSampleCompare.pl line 1996.].
WARNING12:vcfSampleCompare.pl: NOTE: Further warnings of this type will be suppressed.
WARNING12:vcfSampleCompare.pl: Set --error-limit to 0 to turn off error suppression
WARNING14:vcfSampleCompare.pl: Runtime warning: [Use of uninitialized value in string at /home/nevosmic/.conda/envs/mic/bin/vcfSampleCompare.pl line 2000.].
WARNING15:vcfSampleCompare.pl: Runtime warning: [Use of uninitialized value in string at /home/nevosmic/.conda/envs/mic/bin/vcfSampleCompare.pl line 2000.].
WARNING16:vcfSampleCompare.pl: Runtime warning: [Use of uninitialized value in string at /home/nevosmic/.conda/envs/mic/bin/vcfSampleCompare.pl line 2000.].
WARNING16:vcfSampleCompare.pl: NOTE: Further warnings of this type will be suppressed.
WARNING16:vcfSampleCompare.pl: Set --error-limit to 0 to turn off error suppression
hepcat72 commented 3 years ago

I see. You're still missing required input file components, specifically AO and RO in the FORMAT string. AO is the number of observations that support the variant (i.e. number of reads that have a specific nucleotide different from the reference). RO is the number of observations that support the reference.

DP, AO, and RO (and optionally GT) are what vcfSampleCompare uses to determine degree of difference between sample groups. Without them, it doesn't have any way to assess the degrees of difference.

This is another instance where I should probably perform a format check on the input to make sure it generates a nice understandable error to say that the input doesn't meet the requirements. If you'd be willing to share your input data with me, I could use it for testing of these features.

nevosmic commented 3 years ago

It is very large..how can I pass it to you? (33036.94 MB)

hepcat72 commented 3 years ago

I don't actually need the whole thing. Probably the first 1k lines would be more than enough. I only need a few lines past the commented lines at the top. Not sure how many "chromosomes" or contigs you have in your reference, so if it's more than 1k, then I'd need more lines. Just send it to the email in my profile.

Note, my only intent here is to use it to add error messages about missing format requirements.