samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
664 stars 240 forks source link

Loss of data from genome VCF may be related to -g 10 #1596

Closed BrandonColbyMD closed 2 years ago

BrandonColbyMD commented 3 years ago

There appears to be a loss of lines of data when -g 10 is used to group blocks together.

Example

The following line is lost when -g 10 is used with bcftools v1.11 and the line is also lost with bcftools 1.13 regardless of whether -g 10 is used.

1 1607341 . C . 275 . MinDP=14;AN=2 GT:DP 0/0:14

WITH -g 10 (bcftools 1.11) bcftools mpileup source.bam --fasta-ref fasta.fa -g 0,100 -a AD,DP -Q 5 --regions 1:1607300-1607400 --output-type u | bcftools call --multiallelic-caller -f 'GQ' -g 10

1 1607300 . T . . . END=1607340;MinDP=14 GT:DP 0/0:14 1 1607341 . CTTTTTTTTTTTTTTTTTTTT CTTTTTTTTTTTTTTTTTT 21.7324 . INDEL;IDV=11;IMF=0.733333;DP=15;VDB=0.369045;SGB=-0.680642;MQSB=1;MQ0F=0;AC=2;AN=2;DP4=3,0,8,4;MQ=60GT:PL:DP:AD:GQ 1/1:49,15,0:15:3,12:18 1 1607342 . T . 93 . MinDP=3;AN=2 GT:DP 0/0:3 1 1607343 . T . 30.0794 . DP=5;SGB=-0.379885;RPB=1;MQB=1;BQB=1;MQ0F=0;AN=2;DP4=3,0,1,0;MQ=60;MinDP=4 GT:DP:AD 0/0:4:3 1 1607344 . T . 46.5868 . END=1607366;MinDP=2;AN=2 GT:DP 0/0:2

WITHOUT -g 10 (bcftools v1.11) bcftools mpileup source.bam --fasta-ref fasta.fa -g 0,100 -a AD,DP -Q 5 --regions 1:1607300-1607400 --output-type u | bcftools call --multiallelic-caller -f 'GQ'

1 1607300 . T . 285 . END=1607304;MinDP=24;AN=2 GT:DP 0/0:24 1 1607305 . C . 29.695 . DP=25;SGB=-0.379885;RPB=1;MQB=1;MQSB=1;BQB=1;MQ0F=0;AN=2;DP4=14,10,1,0;MQ=60 GT:DP:AD 0/0:25:24 1 1607306 . A . 285 . END=1607317;MinDP=22;AN=2 GT:DP 0/0:22 1 1607318 . A . 29.6743 . DP=23;SGB=-0.379885;RPB=1;MQB=1;MQSB=1;BQB=1;MQ0F=0;AN=2;DP4=14,7,0,1;MQ=60 GT:DP:AD 0/0:22:21 1 1607319 . C . 285 . END=1607320;MinDP=22;AN=2 GT:DP 0/0:22 1 1607321 . G . 29.6951 . DP=23;SGB=-0.379885;RPB=1;MQB=1;MQSB=1;BQB=1;MQ0F=0;AN=2;DP4=14,7,0,1;MQ=60 GT:DP:AD 0/0:22:21 1 1607322 . T . 285 . END=1607332;MinDP=20;AN=2 GT:DP 0/0:20 1 1607333 . T . 29.6625 . DP=20;SGB=-0.379885;RPB=1;MQB=1;MQSB=1;BQB=1;MQ0F=0;AN=2;DP4=11,7,1,0;MQ=60 GT:DP:AD 0/0:19:18 1 1607334 . T . 285 . END=1607339;MinDP=15;AN=2 GT:DP 0/0:15 1 1607340 . CCT . 5.26224 . INDEL;IDV=1;IMF=0.0666667;DP=15;VDB=0.357394;SGB=-0.670168;MQSB=1;MQ0F=0;AN=2;DP4=4,1,7,3;MQ=60 GT:DP:AD 0/0:15:5 1 1607341 . C . 275 . MinDP=14;AN=2 GT:DP 0/0:14 1 1607341 . CTTTTTTTTTTTTTTTTTTTT CTTTTTTTTTTTTTTTTTT 21.7324 . INDEL;IDV=11;IMF=0.733333;DP=15;VDB=0.369045;SGB=-0.680642;MQSB=1;MQ0F=0;AC=2;AN=2;DP4=3,0,8,4;MQ=60GT:PL:DP:AD:GQ 1/1:49,15,0:15:3,12:18 1 1607342 . T . 93 . MinDP=3;AN=2 GT:DP 0/0:3 1 1607343 . T . 30.0794 . DP=5;SGB=-0.379885;RPB=1;MQB=1;BQB=1;MQ0F=0;AN=2;DP4=3,0,1,0;MQ=60 GT:DP:AD 0/0:4:3 1 1607344 . T . 46.5868 . END=1607366;MinDP=2;AN=2 GT:DP 0/0:2

--

The above was using bcftools v1.11. We also tested this using bcftools v1.13 and the results are interesting as removing -g 10 does not bring back the missing line as it did with v1.11, so the bug persists in v1.13 even when -g 10 is not used.

WITHOUT -g 10 (bcftools v1.13) bcftools mpileup source.bam --fasta-ref fasta.fa -g 0,100 -a AD,DP -Q 5 --regions 1:1607300-1607400 --output-type u | bcftools call --multiallelic-caller -f 'GQ'

bcftools_callVersion=1.13+htslib-1.13

bcftools_callCommand=call --multiallelic-caller -f GQ; Date=Wed Oct 13 18:18:37 2021

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NG1LZLXIA2

1 1607300 . T . 284.59 . END=1607304;MinDP=24;AN=2 GT:DP 0/0:24 1 1607305 . C . 29.695 . DP=25;SGB=-0.379885;RPBZ=-0.0693909;BQBZ=-1.83812;SCBZ=-0.368383;FS=0;MQ0F=0;AN=2;DP4=14,10,1,0;MQ=60 GT:DP:AD 0/0:25:24 1 1607306 . A . 284.59 . END=1607317;MinDP=22;AN=2 GT:DP 0/0:22 1 1607318 . A . 29.6743 . DP=23;SGB=-0.379885;RPBZ=-1.49995;BQBZ=-1.69926;SCBZ=-0.396098;FS=0;MQ0F=0;AN=2;DP4=14,7,0,1;MQ=60 GT:DP:AD 0/0:22:21 1 1607319 . C . 284.59 . END=1607320;MinDP=22;AN=2 GT:DP 0/0:22 1 1607321 . G . 29.6951 . DP=23;SGB=-0.379885;RPBZ=-1.49995;BQBZ=-1.71986;SCBZ=-0.396098;FS=0;MQ0F=0;AN=2;DP4=14,7,0,1;MQ=60 GT:DP:AD 0/0:22:21 1 1607322 . T . 284.59 . END=1607332;MinDP=20;AN=2 GT:DP 0/0:20 1 1607333 . T . 29.6625 . DP=20;SGB=-0.379885;RPBZ=0.640416;BQBZ=-1.71133;SCBZ=-0.342467;FS=0;MQ0F=0;AN=2;DP4=11,7,1,0;MQ=60 GT:DP:AD 0/0:19:18 1 1607334 . T . 261.589 . END=1607340;MinDP=14;AN=2 GT:DP 0/0:14 1 1607341 . CTTTTTTTTTTTTTTTTTTTT CTTTTTTTTTTTTTTTTTT 225.417 . INDEL;IDV=11;IMF=0.733333;DP=15;VDB=0.274834;SGB=-0.680642;RPBZ=-1.82945;SCBZ=-0.662051;FS=0;MQ0F=0;AC=2;AN=2;DP4=0,0,8,4;MQ=60 GT:PL:DP:AD:GQ 1/1:255,36,0:12:0,12:127 1 1607342 . T . 92.5872 . MinDP=3;AN=2 GT:DP 0/0:3 1 1607343 . T . 30.0794 . DP=5;SGB=-0.379885;RPBZ=-1.41421;BQBZ=-1.41421;SCBZ=-0.57735;FS=0;MQ0F=0;AN=2;DP4=3,0,1,0;MQ=60 GT:DP:AD 0/0:4:3 1 1607344 . T . 160.588 . END=1607366;MinDP=6;AN=2 GT:DP 0/0:6

daviesrob commented 2 years ago

This looks like a bcftools problem, so I'll transfer the issue over there...

pd3 commented 2 years ago

Hi, thanks for bringing up this interesting issue.

The use of mpileup -g 0,100 combined with call -g 0,10 is not practical. The first creates big blocks (anything with depth 0-100 goes into a single bin), but the second with finer granularity cannot split these into smaller bins.

Next thing, there is a difference between running with -g 10 and -g 0,10. The mpileup documentation describes that the first value in the -g list defines the minimum depth of gVCF blocks and records smaller than that value are printed as is. That explains why the following records

17  3491    T   <*>
17  3492    T   <*>
17  3493    C   <*>
17  3493    C   CA

can give an output like this with -g 0,10

17  3491    T   <*>  END=3492
17  3493    C   CA

but this with -g 10

17  3491    T   <*>  END=3492
17  3493    C   <*>
17  3493    C   CA

Finally, it is a valid question to ask whether the record at 3493 with no variation should be part of the gVCF block or not as technically the indel starts one position after. BCFtools regard this as a site that should be dropped because it looks at the starting position only, not at the start of the variation.

I can see how the current behavior is debatable and that the gVCF block could expand to END=3493. However, since only non-variant sites are dropped this way and SNVs would be preserved, there is no harm. In case you can provide a motivation how this could be harmful on a real life example, I'm happy to reconsider. For now I don't see this as a bug, the program works as intended.

jferlee commented 2 years ago

@pd3 Thank you for the response. I just joined Brandon's team and have another question about the results from these parameters I was hoping you could explain. In all 3 cases above we end up with these lines after the InDel

1 1607342 . T . 93 . MinDP=3;AN=2 GT:DP 0/0:3 1 1607343 . T . 30.0794 . DP=5;SGB=-0.379885;RPB=1;MQB=1;BQB=1;MQ0F=0;AN=2;DP4=3,0,1,0;MQ=60;MinDP=4 GT:DP:AD 0/0:4:3 1 1607344 . T . 46.5868 . END=1607366;MinDP=2;AN=2 GT:DP 0/0:2

The InDel: 1 1607341 . CTTTTTTTTTTTTTTTTTTTT CTTTTTTTTTTTTTTTTTT 21.7324 . INDEL;IDV=11;IMF=0.733333;DP=15;VDB=0.369045;SGB=-0.680642;MQSB=1;MQ0F=0;AC=2;AN=2;DP4=3,0,8,4;MQ=60GT:PL:DP:AD:GQ 1/1:49,15,0:15:3,12:18 1 Is a deletion of 2 T's in a homopolymer region, which if we use the left shifted VCF it would be positions 1607342 & 1607343 that are deleted. Those 2 positions are called out as a separate line in all the returns that Brandon tested but it appears it is being reported as reference. The block starting at position 1607344 includes the rest of the homopolymer as we would expect.

pd3 commented 2 years ago

First how the program works. It walks along the genome site by site and at each position checks the evidence for a SNV or an indel. The SNV is reported by mpileup always (even when there are no mismatches in the reads and in such case outputs ALT=. and GT=0/0), whereas indel is printed only when there is a supporting read. Normally you would pipe the through bcftools call -mv which removes non-variant sites. However, that's not what has been done in this case, therefore all records are printed. In other words, the program operates locally and can sometimes produce mutually conflicting variants emitted from nearby sites.

Another way to look at it, such VCF output can be interpreted as a set of "actions" that should be performed on the reference sequence to get the biological sequence. Reference genotypes (0/0) do nothing, non-reference genotypes replace the REF allele with the ALT allele.

As this is unrelated to the original problem, please open a new issue next time.