knausb / vcfR

Tools to work with variant call format files
240 stars 54 forks source link

vcfr, extract.gt, bcftools, vcf format incompatible? #189

Closed wbsimey closed 2 years ago

wbsimey commented 2 years ago

Hello, I have been unable to extract depth 'DP' info from my vcf files (generated with latest bcftools variant calling (SNPs)) after being converted to vcfr for R analyses.

bcftools may break the vcf formatting. I cannot extract depth data in R packages, for e.g., all I get is 'NA' for each depth (DP). I can see the DP value in the vcf, but R packages cannot see it.

VCF format after bcftools (DP values are clearly present - DP=38):

Chr10_nm_RagTag 187 . A T 4.82181 LowQual DP=38;VDB=0.407747;SGB=-0.0966945;RPB=0.0621765;MQB=0.984496;MQSB=0.618783;BQB=0.810528;MQ0F=0.157895;AC=2;AN=26;DP4=7,2,3,3;MQ=9 GT:PL:AD:QS ./.:0,0,0:0,0:0,0 0/0:0,3,4:1,0:4,0 ./.:0,0,0:0,0:0,0 ./.:0,0,0:0,0:0,0 ./.:0,0,0:0,0:0,0 ./.:0,0,0:0,0:0,0 ./.:0,

but the often used function "extract.gt" misses the DP values and only returns items highlighted in BOLD.

dp <- extract.gt(vcf, element = "DP", as.numeric=TRUE) dp[1:4,1:6] MZ195912 MZ195913 MZ195914 MZ195915 MZ195916 MZ195930 Chr10_nm_RagTag_187 NA NA NA NA NA NA Chr10_nm_RagTag_188 NA NA NA NA NA NA Chr10_nm_RagTag_194 NA NA NA NA NA NA Chr10_nm_RagTag_195 NA NA NA NA NA NA

I think this is because bcftools filtering placed DP info in the wrong field and when data are converted to the required R format, the info is lost. If I do a head on the converted data I see no DP info:

head(vcf) [1] " Object of class 'vcfR' " [1] " Meta section " [1] "##fileformat=VCFv4.2" [1] "##FILTER=" [1] "##bcftoolsVersion=1.12+htslib-1.12" [1] "##bcftoolsCommand=mpileup -Ou -I -B -C 50 -a QS -a AD --threads 32 - [Truncated]" [1] "##reference=file:///home/bsimison/Projects/Neotoma/96_LC-genomes/Nbr [Truncated]" [1] "##contig=" [1] "First 6 rows." [1] [1] " Fixed section " CHROM POS ID REF ALT QUAL FILTER [1,] "Chr10_nm_RagTag" "187" NA "A" "T" "4.82181" "LowQual" [2,] "Chr10_nm_RagTag" "188" NA "G" "A" "42.0559" "LowQual" [3,] "Chr10_nm_RagTag" "194" NA "G" "C" "3.74867" "LowQual" [4,] "Chr10_nm_RagTag" "195" NA "A" "C" "5.61943" "LowQual" [5,] "Chr10_nm_RagTag" "488" NA "G" "C" "32.0871" "PASS" [6,] "Chr10_nm_RagTag" "518" NA "T" "C" "157.534" "PASS" [1] [1] " Genotype section " FORMAT MZ195912 MZ195913 [1,] "GT:PL:AD:QS" "./.:0,0,0:0,0:0,0" "0/0:0,3,4:1,0:4,0" [2,] "GT:PL:AD:QS" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" [3,] "GT:PL:AD:QS" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" [4,] "GT:PL:AD:QS" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" [5,] "GT:PL:AD:QS" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" [6,] "GT:PL:AD:QS" "1/1:29,3,0:0,1:0,29" "./.:0,0,0:0,0:0,0" MZ195914 MZ195915 MZ195916 [1,] "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" [2,] "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" [3,] "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" [4,] "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" [5,] "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" [6,] "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" "./.:0,0,0:0,0:0,0" [1] "First 6 columns only." [1] [1] "Unique GT formats:" [1] "GT:PL:AD:QS" [1]

You can see that only "GT:PL:AD:QS" are captured, but not DP. Anybody know what's going on here?

knausb commented 2 years ago

Hi,

I think we have some confusion here. "DP" can occur in multiple locations in VCF data. The below example shows "DP" in the 'gt' slot as well as in the 'fix' slot. Your example appears to have "DP" in the fix[,"INFO"] but not in the 'gt' region. I feel the behavior you report is therefore expected. If you want the "DP" from your data I think you want extract.info(). However, it's been my experience that the data in the INFO column is a summary over all samples, as I've illustrated below. This makes me wonder if your variant caller reported this and it was some downstream step that removed it?

library(vcfR)
#> 
#>    *****       ***   vcfR   ***       *****
#>    This is vcfR 1.12.0 
#>      browseVignettes('vcfR') # Documentation
#>      citation('vcfR') # Citation
#>    *****       *****      *****       *****
data("vcfR_test")

vcfR_test@fix
#>      CHROM POS       ID          REF   ALT      QUAL FILTER
#> [1,] "20"  "14370"   "rs6054257" "G"   "A"      "29" "PASS"
#> [2,] "20"  "17330"   NA          "T"   "A"      "3"  "q10" 
#> [3,] "20"  "1110696" "rs6040355" "A"   "G,T"    "67" "PASS"
#> [4,] "20"  "1230237" NA          "T"   NA       "47" "PASS"
#> [5,] "20"  "1234567" "microsat1" "GTC" "G,GTCT" "50" "PASS"
#>      INFO                               
#> [1,] "NS=3;DP=14;AF=0.5;DB;H2"          
#> [2,] "NS=3;DP=11;AF=0.017"              
#> [3,] "NS=2;DP=10;AF=0.333,0.667;AA=T;DB"
#> [4,] "NS=3;DP=13;AA=T"                  
#> [5,] "NS=3;DP=9;AA=G"
vcfR_test@gt
#>      FORMAT        NA00001          NA00002          NA00003       
#> [1,] "GT:GQ:DP:HQ" "0|0:48:1:51,51" "1|0:48:8:51,51" "1/1:43:5:.,."
#> [2,] "GT:GQ:DP:HQ" "0|0:49:3:58,50" "0|1:3:5:65,3"   "0/0:41:3"    
#> [3,] "GT:GQ:DP:HQ" "1|2:21:6:23,27" "2|1:2:0:18,2"   "2/2:35:4"    
#> [4,] "GT:GQ:DP:HQ" "0|0:54:7:56,60" "0|0:48:4:51,51" "0/0:61:2"    
#> [5,] "GT:GQ:DP"    "0/1:35:4"       "0/2:17:2"       "1/1:40:3"

dp <- extract.gt(vcfR_test, element = "DP", as.numeric = TRUE)
dp
#>            NA00001 NA00002 NA00003
#> rs6054257        1       8       5
#> 20_17330         3       5       3
#> rs6040355        6       0       4
#> 20_1230237       7       4       2
#> microsat1        4       2       3

Idp <- extract.info(vcfR_test, element = "DP", as.numeric = TRUE)
Idp
#> [1] 14 11 10 13  9

rowSums(dp) == Idp
#>  rs6054257   20_17330  rs6040355 20_1230237  microsat1 
#>       TRUE       TRUE       TRUE       TRUE       TRUE

Created on 2021-08-19 by the reprex package (v2.0.1)

wbsimey commented 2 years ago

Thank you BK! I used the following bcftools variant calling commands, then used bcftools filter options with depth, quality, and maf cutoff values.

"bcftools mpileup -Ou -I -B -C 50 -a QS -a AD --threads $THREADS -f $REF $bam_files"
call_cmd="bcftools call -mv -Oz --threads $THREADS -o $OUTFILE"

I copied the first few lines of the first few SNPs from my vcf file and I can see that each SNP has a unique DP value in the INFO slot, but as you pointed out, not in the 'gt' slot. Do you know of a method to extract data from the INFO slot?

Chr26_nm_RagTag 112     .       C       T       224.913 LowQual DP=201;VDB=0.0640986;SGB=8.14225;RPB=0.314329;MQB=0.000275577;MQSB=0.200908;BQB=0.774805;MQ0F=0;AC=7;AN=122;DP4=167,10,9,3;MQ=47        GT:PL:AD:QS     0/0:0,6,57:2,0:62,0     0/0:0,18,165:6,0:222,0  0/0:0,6,67:2,0:74,0  ...

Chr26_nm_RagTag 144     .       C       T       232.799 LowQual DP=288;VDB=0.975423;SGB=13.5445;RPB=0.996943;MQB=0.00345399;MQSB=0.689649;BQB=0.36863;MQ0F=0;AC=5;AN=132;DP4=208,33,11,0;MQ=47  GT:PL:AD:QS     0/1:28,0,49:2,1:62,37   0/0:0,18,165:6,0:222,0  0/0:0,6,67:2,0:74,0  ...

Chr26_nm_RagTag 212     .       C       G       3.06037 LowQual DP=368;VDB=0.00169601;SGB=-13.8502;RPB=0.000774644;MQB=0.000149028;MQSB=0.91717;BQB=0.945818;MQ0F=0.0108696;AC=6;AN=144;DP4=190,137,6,0;MQ=46   GT:PL:AD:QS     0/0:0,18,165:6,0:222,0  0/0:0,18,213:6,0:261,0  0/0:0,6,74:2,0:74,0  ...
wbsimey commented 2 years ago

Ah, I see you said "extract.info()" - missed that. I will try that.

wbsimey commented 2 years ago

I am rerunning the bcftools mpileup to include -a DP (-a is the bcftools mpileup option for including annotations and is required when using the HW option, but I did not realize it actually suppressed all other annotations from the 'gt' slot). I did a quick bcftools view file on the new output and sure enough DP is included in the 'gt' slot.

Chr10_nm_RagTag 518 . T C 157.534 . DP=20;VDB=0.0513284;SGB=1.27099;RPB=0.222222;MQB=0.111111;MQSB=0.81179;BQB=1;MQ0F=0.2;AC=32;AN=32;DP4=0,2,11,7;MQ=17 GT:PL:DP:AD:QS 1/1:29,3,0:1:0,1:0,29

I think you helped me to a solution, thanks again BK!

JuanitaGil commented 1 year ago

Hello Brian,

I am having a similar issue. I have a VCF (v4.2) file generated by NGSEP v4.1.0 and although I can extract the DP values from the file using extract.gt, when I convert the VCF to an object of class chromR and try to plot it, it says that no read depths and no mapping qualities are found and all it plots is the the QUAL values.

Here is the header of my VCF file:

fileformat=VCFv4.2

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT

And an example for the first variant and the first sample: JACYCF010000001.1 929 . AT ATT 200 . NS=1;AN=1;AFS=0,2;OH=0.0;MAF=0.0;CNV=38;TYPE=INDEL;TA=frameshift_variant;TID=rna-gnl|WGS:JACYCF|g1.1_mrna;TGN=RHS01_00001;TCO=98.2;TACH=NH98NS GT:PL:GQ:DP:ADP:ACN

I haven't filetered the VCF, so it contains SNPs, STRs, indels and multiallelic variants.

Do you think it's a problem of the file format or is it something else I am not considering when using the options in vcfR?

Thank you very much for your help! Juanita

knausb commented 1 year ago

Hi @JuanitaGil , The acronym 'DP' may occur in multiple locations in VCF data. I suspect you are finding it in one location but the process of creating a chrome object is looking for DP in a different place.

I typically look for 'DP' in the 'gt' slot where each genotype may have a 'DP' associated with it. This is what you get from extract.gt(). When you use create.chromR() it's looking for 'DP' in the 'INFO' column, which is usually a summary over all samples. I do not see this in the information you've provided and suspect it may be the issue you've encountered. However, because you do appear to have 'DP' in the genotype region, I think we can create a 'fix' for this data, if you'd like.

If you'd like to pursue this I suggest you review the following suggestions.

https://knausb.github.io/vcfR_documentation/reporting_issue.html

If you can create a minimal reproducible example it should help me 'validate' instead of 'suspect'. Even if you can only get your example 'close' to the behavior you're experiencing, we should be able to modify it together to first reproduce the issue you're experiencing and then work towards a solution.