samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
669 stars 240 forks source link

bcftools v1.17 no longer finds known variants from v1.8 #1930

Open fd-bnt opened 1 year ago

fd-bnt commented 1 year ago

Hi all,

I inherited a setup that uses bcftools 1.8 for variant calling and decided to update to v1.17, but can no longer reproduce the (known) INDEL calls for some samples no matter how much i relax indel calling parameters.

The test experiment we set up was NGS sequencing of a plasmid library (thus the coverages in the thousands) with different concentrations (1%, 5%, 10%) of spike ins of a slightly modified plasmid containing 3 known INDEL variants at positions 115, 138 and 148.

In the original workflow we could detect all 3 positions using bcftools v1.8 at approximately the correct frequencies. Now after switching bcftools to version 1.17 I can no longer detect any of the 3 INDELs in the 1% or 5% samples while the 10% sample produces the same output for version 1.8 and 1.17.

I tried the experimental –indels-2.0 but unfortunately it did not finish after 6h so we tried to relax all other parameters using:

# Parameters for bcftools mpileup
-a INFO/AD,INFO/ADF,INFO/ADR --max-depth 200000 --max-idepth 200000 -Q 0 --indel-bias 2.0 --open-prob 20

# Parameters for bcftools call
--ploidy 1 -A -m -P 0

For 1.8 and a 1.17 “normal” run I used the same values, except –indel-bias and –open-prob were left as defaults.

I added the variant call results below. Why can I no longer detect the INDELS in the new version? Is this an expected result?

Results of bcftools 1.8 vs 1.17

V1.17 5% not finding any INDELs at expected positions 115, 138, 148: ``` name pos id ref alt qual filter info format data q-cutoff version library spike-in Q0_1.17_L14_relax 115 . A C,G,T 0.669821 lowQual DP=12509;ADF=10439,32,5,7;ADR=2025,0,1,0;AD=12464,32,6,7;VDB=4.60008e-05;SGB=-0.693147;RPBZ=5.81608;MQBZ=-0.254136;MQSBZ=0.034046;BQBZ=-7.58792;SCBZ=0.754134;MQ0F=0;AC=0,0,0;AN=1;DP4=10439,2025,44,1;MQ=59 GT:PL 0:0,255,255,255 Q0 1.17 L14 5% Q0_1.17_L14_relax 138 . G A,T,C 0.670060 lowQual DP=13128;ADF=10675,21,18,22;ADR=2391,0,1,0;AD=13066,21,19,22;VDB=0.927333;SGB=-0.693147;RPBZ=5.84939;MQBZ=0.00208167;MQSBZ=0.0142642;BQBZ=-9.31408;SCBZ=-0.65444;MQ0F=0;AC=0,0,0;AN=1;DP4=10675,2391,61,1;MQ=59 GT:PL 0:0,255,255,255 Q0 1.17 L14 5% Q0_1.17_L14_relax 148 . C T,A 1.249830 lowQual DP=13118;ADF=10387,34,0;ADR=2692,4,1;AD=13079,38,1;VDB=0.00119279;SGB=-0.693144;RPBZ=8.71266;MQBZ=0.00165207;MQSBZ=0.0153911;BQBZ=-8.53877;SCBZ=-0.186703;MQ0F=0;AC=0,0;AN=1;DP4=10387,2692,34,5;MQ=59 GT:PL 0:0,255,255 Q0 1.17 L14 5% ``` V1.8 5% ``` name pos id ref alt qual filter info format data q-cutoff version library spike-in Q0_1.8_L14 115 . A C,G,T 0.669819 lowQual DP=12509;ADF=10439,32,5,7;ADR=2025,0,1,0;AD=12464,32,6,7;VDB=6.93008e-05;SGB=-0.693147;RPB=4.45644e-08;MQB=0.968236;MQSB=0.999421;BQB=1.54514e-13;MQ0F=0;AC=0,0,0;AN=1;DP4=10439,2025,44,1;MQ=59 GT:PL 0:0,255,255,255 Q0 1.8 L14 5% Q0_1.8_L14 115 . AC A 3.036510 PASS INDEL;IDV=591;IMF=0.0472422;DP=12510;ADF=9941,543;ADR=1938,88;AD=11879,631;VDB=0.504203;SGB=-0.693147;MQSB=0.999421;MQ0F=0;AC=0;AN=1;DP4=9941,1938,543,88;MQ=59 GT:PL 0:0,255 Q0 1.8 L14 5% Q0_1.8_L14 138 . G A,T,C 0.670058 lowQual DP=13128;ADF=10675,21,18,22;ADR=2391,0,1,0;AD=13066,21,19,22;VDB=0.93761;SGB=-0.693147;RPB=3.62158e-08;MQB=0.999998;MQSB=0.999898;BQB=1.03409e-19;MQ0F=0;AC=0,0,0;AN=1;DP4=10675,2391,61,1;MQ=59 GT:PL 0:0,255,255,255 Q0 1.8 L14 5% Q0_1.8_L14 138 . GC G 3.025680 PASS INDEL;IDV=632;IMF=0.0481377;DP=13129;ADF=10164,573;ADR=2288,104;AD=12452,677;VDB=0.19033;SGB=-0.693147;MQSB=0.999898;MQ0F=0;AC=0;AN=1;DP4=10164,2288,573,104;MQ=59 GT:PL 0:0,255 Q0 1.8 L14 5% Q0_1.8_L14 148 . C T,A 1.249830 lowQual DP=13118;ADF=10387,34,0;ADR=2692,4,1;AD=13079,38,1;VDB=0.00155996;SGB=-0.693144;RPB=3.66195e-17;MQB=0.999999;MQSB=0.999882;BQB=1.2957e-16;MQ0F=0;AC=0,0;AN=1;DP4=10387,2692,34,5;MQ=59 GT:PL 0:0,255,255 Q0 1.8 L14 5% Q0_1.8_L14 148 . CTTT CTT 3.039690 PASS INDEL;IDV=655;IMF=0.0499314;DP=13118;ADF=9783,638;ADR=2548,149;AD=12331,787;VDB=0.0882275;SGB=-0.693147;MQSB=0.999882;MQ0F=0;AC=0;AN=1;DP4=9783,2548,638,149;MQ=59 GT:PL 0:0,255 Q0 1.8 L14 5% ``` V1.17 10% finding all 3 INDELS ``` name pos id ref alt qual filter info format data q-cutoff version library spike-in Q0_1.17_L15_relax 115 . A C,T,G 0.669860 lowQual DP=16371;ADF=13557,48,9,5;ADR=2751,0,1,0;AD=16308,48,10,5;VDB=0.000196249;SGB=-0.693147;RPBZ=6.05133;MQBZ=-0.431482;MQSBZ=-0.102198;BQBZ=-8.76334;SCBZ=0.873;MQ0F=0;AC=0,0,0;AN=1;DP4=13557,2751,62,1;MQ=59 GT:PL 0:0,255,255,255 Q0 1.17 L15 10% Q0_1.17_L15_relax 115 . AC A 3.297230 PASS INDEL;IDV=1493;IMF=0.0911589;DP=16378;ADF=12397,1229;ADR=2488,264;AD=14885,1493;VDB=0.101671;SGB=-0.693147;RPBZ=0.728506;MQBZ=0.029997;MQSBZ=-0.102198;BQBZ=-8.76334;SCBZ=-1.34247;MQ0F=0;AC=0;AN=1;DP4=12397,2488,1229,264;MQ=59 GT:PL 0:0,152 Q0 1.17 L15 10% Q0_1.17_L15_relax 138 . G A,T,C 0.669957 lowQual DP=17241;ADF=13907,26,22,33;ADR=3253,0,0,0;AD=17160,26,22,33;VDB=0.457625;SGB=-0.693147;RPBZ=7.28345;MQBZ=-0.186605;MQSBZ=-0.0565974;BQBZ=-11.803;SCBZ=0.339357;MQ0F=0;AC=0,0,0;AN=1;DP4=13907,3253,81,0;MQ=59 GT:PL 0:0,255,255,255 Q0 1.17 L15 10% Q0_1.17_L15_relax 138 . GC G 3.289400 PASS INDEL;IDV=1649;IMF=0.0956275;DP=17244;ADF=12664,1327;ADR=2931,322;AD=15595,1649;VDB=0.246366;SGB=-0.693147;RPBZ=-2.14384;MQBZ=0.0300275;MQSBZ=-0.0565974;BQBZ=-11.803;SCBZ=-0.0819827;MQ0F=0;AC=0;AN=1;DP4=12664,2931,1327,322;MQ=59 GT:PL 0:0,167 Q0 1.17 L15 10% Q0_1.17_L15_relax 148 . C T,A,G 0.669713 lowQual DP=17330;ADF=13655,44,4,4;ADR=3621,2,0,0;AD=17276,46,4,4;VDB=0.0780127;SGB=-0.693147;RPBZ=8.97576;MQBZ=0.00735484;MQSBZ=-0.0617568;BQBZ=-10.2599;SCBZ=-0.0513985;MQ0F=0;AC=0,0,0;AN=1;DP4=13655,3621,52,2;MQ=59 GT:PL 0:0,255,255,255 Q0 1.17 L15 10% Q0_1.17_L15_relax 148 . CTTT CTT 3.279090 PASS INDEL;IDV=1650;IMF=0.0952051;DP=17331;ADF=12402,1306;ADR=3279,344;AD=15681,1650;VDB=0.111354;SGB=-0.693147;RPBZ=-1.79498;MQBZ=0.0426755;MQSBZ=-0.0617568;BQBZ=-10.2599;SCBZ=0.799944;MQ0F=0;AC=0;AN=1;DP4=12402,3279,1306,344;MQ=59 GT:PL 0:0,173 Q0 1.17 L15 10% ``` V1.8 10% ``` name pos id ref alt qual filter info format data q-cutoff version library spike-in Q0_1.8_L15 115 . A C,T,G 0.669850 lowQual DP=16371;ADF=13557,48,9,5;ADR=2751,0,1,0;AD=16308,48,10,5;VDB=0.000271031;SGB=-0.693147;RPB=1.23442e-08;MQB=0.911134;MQSB=0.994793;BQB=6.4333e-19;MQ0F=0;AC=0,0,0;AN=1;DP4=13557,2751,62,1;MQ=59 GT:PL 0:0,255,255,255 Q0 1.8 L15 10% Q0_1.8_L15 115 . AC A 3.066710 PASS INDEL;IDV=1493;IMF=0.0911589;DP=16378;ADF=12307,1319;ADR=2476,276;AD=14783,1595;VDB=0.655657;SGB=-0.693147;MQSB=0.994791;MQ0F=0;AC=0;AN=1;DP4=12307,2476,1319,276;MQ=59 GT:PL 0:0,255 Q0 1.8 L15 10% Q0_1.8_L15 138 . G A,T,C 0.669956 lowQual DP=17241;ADF=13907,26,22,33;ADR=3253,0,0,0;AD=17160,26,22,33;VDB=0.491096;SGB=-0.693147;RPB=3.08703e-12;MQB=0.982737;MQSB=0.998399;BQB=3.60759e-31;MQ0F=0;AC=0,0,0;AN=1;DP4=13907,3253,81,0;MQ=59 GT:PL 0:0,255,255,255 Q0 1.8 L15 10% Q0_1.8_L15 138 . GC G 3.041500 PASS INDEL;IDV=1649;IMF=0.0956275;DP=17244;ADF=12608,1383;ADR=2926,327;AD=15534,1710;VDB=0.307998;SGB=-0.693147;MQSB=0.998399;MQ0F=0;AC=0;AN=1;DP4=12608,2926,1383,327;MQ=59 GT:PL 0:0,255 Q0 1.8 L15 10% Q0_1.8_L15 148 . C T,A,G 0.669694 lowQual DP=17330;ADF=13655,44,4,4;ADR=3621,2,0,0;AD=17276,46,4,4;VDB=0.0996844;SGB=-0.693147;RPB=3.41775e-18;MQB=0.999973;MQSB=0.998094;BQB=2.92277e-24;MQ0F=0;AC=0,0,0;AN=1;DP4=13655,3621,52,2;MQ=59 GT:PL 0:0,255,255,255 Q0 1.8 L15 10% Q0_1.8_L15 148 . CTTT CTT 3.064180 PASS INDEL;IDV=1650;IMF=0.0952051;DP=17331;ADF=12215,1493;ADR=3222,401;AD=15437,1894;VDB=0.184847;SGB=-0.693147;MQSB=0.998094;MQ0F=0;AC=0;AN=1;DP4=12215,3222,1493,401;MQ=59 GT:PL 0:0,255 Q0 1.8 L15 10% ```

pd3 commented 1 year ago

Is there any chance you could provide a small test case? I am unable to comment without seeing the data unfortunately.

jkbonfield commented 1 year ago

I produced some test files by starting with a bit of covid-19 data and replicating up one record that had an errant 1bp deletion.

https://github.com/samtools/bcftools/pull/1824 broke the call. In theory this was meant to be guarded behind --indels-2.0, but it also changed the normal behaviour. Ie try builds from (currently) git checkout develop~49 vs git checkout develop~48.

My test data at Sanger is seq4d:/local/scratch01/jkb/3000.sam and 1000.sam. 3000.sam is about 14% af while 3000 is 33%. Oddly I can't get the 1000.sam to call with 1.8 either, but maybe I'm not using the correct options.,

fd-bnt commented 1 year ago

Is there any chance you could provide a small test case? I am unable to comment without seeing the data unfortunately.

Unfortunately I can not share the exact sequences. I may be able to generate an artificial dataset with similar characteristics for testing purposes.

jkbonfield commented 1 year ago

For #1824 bug, we can already reproduce it locally.

# on seq4d
cd /local/scratch01/jkb
bcftools mpileup -B -L 99999 -d 999999 --fasta-ref /nfs/srpipe_references/references/SARS-CoV-2/MN908947.3/all/fasta/MN908947.3.fa -Q 1 -a AD 3000.sam 2>/dev/null |bcftools call -vm - 2>/dev/null|grep -v '^#'

MN908947.3  78  .   TAAAA   TAAA    44.2646 .   INDEL;IDV=3001;IMF=0.330252;DP=9087;VDB=0;SGB=-0.693147;RPBZ=-0.510721;MQBZ=0;BQBZ=-2.5731;NMBZ=-22.2016;SCBZ=-0.612311;FS=0;MQ0F=0;AC=1;AN=2;DP4=6083,0,3001,0;MQ=60   GT:PL:AD    0/1:77,0,59:6083,3001

That's with b5cbcd5d. One commit later, 9d948cf3, no longer finds this. Comparing the two lines of mpileup (ie not call) I see:

#fails
MN908947.3  78  .   TAAAA   TAAA    0   .   INDEL;IDV=3001;IMF=0.330252;DP=9087;I16=6086,0,3001,0,243440,9.7376e+06,120040,4.8016e+06,365160,2.19096e+07,180060,1.08036e+07,152150,3.80375e+06,75025,1.87562e+06;QS=0.589033,0.410967;VDB=0;SGB=-0.693147;RPBZ=-0.510721;MQBZ=0;BQBZ=-2.5731;NMBZ=-22.2308;SCBZ=-0.612311;FS=0;MQ0F=0   PL:AD   24,0,27:6086,3001

#works
MN908947.3  78  .   TAAAA   TAAA    0   .   INDEL;IDV=3001;IMF=0.330252;DP=9087;I16=6083,0,3001,0,243320,9.7328e+06,120040,4.8016e+06,364980,2.18988e+07,180060,1.08036e+07,152075,3.80188e+06,75025,1.87562e+06;QS=0.588985,0.411015;VDB=0;SGB=-0.693147;RPBZ=-0.510721;MQBZ=0;BQBZ=-2.5731;NMBZ=-22.2016;SCBZ=-0.612311;FS=0;MQ0F=0   PL:AD   77,0,59:6083,3001

These are almost identical, so I'm not sure what broke. It's possibly the call side of things.

Neither can cope with the 1000.sam file, which shows this from bcftools mpileup:

MN908947.3  78  .   TAAAA   TAAA    0   .   INDEL;IDV=1001;IMF=0.141245;DP=7087;I16=6083,0,1001,0,243320,9.7328e+06,40040,1.6016e+06,364980,2.18988e+07,60060,3.6036e+06,152075,3.80188e+06,25025,625625;QS=0.811183,0.188817;VDB=0;SGB=-0.693147;RPBZ=-0.333316;MQBZ=0;BQBZ=-2.56641;NMBZ=-14.4851;SCBZ=-0.400981;FS=0;MQ0F=0  PL:AD   0,235,67:6083,1001

I don't have time to delve into this, and I expect @pd3 has a much better understanding of what changed in that large merge.

jkbonfield commented 1 year ago

Ah yes I can confirm using the older mpileup piped into the newer call fails, but piped into older call works. That was my test data anyway. That imlies the change we need to investigate is on the call side and not pileup side.

Can you confirm that you see the same behaviour?

pd3 commented 1 year ago

Unfortunately I can not share the exact sequences. I may be able to generate an artificial dataset with similar characteristics for testing purposes.

@fd-bnt Could you perhaps create an anonymized test with https://github.com/pd3/mpileup-tests/blob/main/misc/create-bam-test?

pd3 commented 1 year ago

I tried the experimental –indels-2.0 but unfortunately it did not finish after 6h

I just pushed a commit that will likely fix the issue of --indels-2.0 never finishing.

pd3 commented 1 year ago
# on seq4d
cd /local/scratch01/jkb
bcftools mpileup -B -L 99999 -d 999999 --fasta-ref /nfs/srpipe_references/references/SARS-CoV-2/MN908947.3/all/fasta/MN908947.3.fa -Q 1 -a AD 3000.sam 2>/dev/null |bcftools call -vm - 2>/dev/null|grep -v '^#'

MN908947.3    78  .   TAAAA   TAAA    44.2646 .   INDEL;IDV=3001;IMF=0.330252;DP=9087;VDB=0;SGB=-0.693147;RPBZ=-0.510721;MQBZ=0;BQBZ=-2.5731;NMBZ=-22.2016;SCBZ=-0.612311;FS=0;MQ0F=0;AC=1;AN=2;DP4=6083,0,3001,0;MQ=60   GT:PL:AD    0/1:77,0,59:6083,3001

I can see the likelihoods changed significantly in between these versions, that should not happen. This is a good test case, thanks for that.

pd3 commented 1 year ago

What I found thus far:

1) The --indels-2.0 calls the indel variant okay, so likely there is a good chance it will help, @fd-bnt.

2) Regarding the difference between https://github.com/samtools/bcftools/commit/b5cbcd5d047a8478ac892d8fec4a5c4ffc4e0a48 and https://github.com/samtools/bcftools/commit/9d948cf3a77b5e13028d87c9c5bf2c80012af7a5, it comes down to just three reads

A00971:186:H75HJDRXY:1:2258:24361:29277
A00971:186:H75HJDRXY:1:2227:4670:21950 
A00971:186:H75HJDRXY:1:2158:3531:14465

which are included by the new version but not by the old version.

The rest seems to be a consequence of random selection of 255 or so reads in errmod_cal, where different reads happen to be sampled, I believe (to be verified). In the end this leads to a ~4000x fold change in PL estimate for the heterozygous genotype. Needless to say, this is incredibly fragile, one certainly expects similar answer given the deep coverage data.

fd-bnt commented 1 year ago

Hey,

So i can confirm a newly built bcftools is finishing with --indels-2.0 when run on my original data sets. Though the INDEL calling is still missing the INDELs detected by version 1.8.

Using mpileup -a INFO/AD,INFO/ADF,INFO/ADR --max-depth 200000 --max-idepth 200000 -Q 0 --indels-2.0 and call --ploidy 1 -A -m -P 0 for version 1.17-52-g0773541c+htslib-1.17-29-g878cff4 i get:

test_name       115     .       A       C,G,T   0.669821        lowQual DP=12509;ADF=10439,32,5,7;ADR=2025,0,1,0;AD=12464,32,6,7;VDB=4.60008e-05;SGB=-0.693147;RPBZ=5.81608;MQBZ=-0.254136;MQSBZ=0.034046;BQBZ=-7.58792;SCBZ=0.754134;MQ0F=0;AC=0,0,0;AN=1;DP4=10439,2025,44,1;MQ=59    GT:PL   0:0,255,255,255
test_name       138     .       G       A,T,C   0.67006         lowQual DP=13128;ADF=10675,21,18,22;ADR=2391,0,1,0;AD=13066,21,19,22;VDB=0.927333;SGB=-0.693147;RPBZ=5.84939;MQBZ=0.00208167;MQSBZ=0.0142642;BQBZ=-9.31408;SCBZ=-0.65444;MQ0F=0;AC=0,0,0;AN=1;DP4=10675,2391,61,1;MQ=59 GT:PL   0:0,255,255,255
test_name       148     .       C       T,A     1.24983         lowQual DP=13118;ADF=10387,34,0;ADR=2692,4,1;AD=13079,38,1;VDB=0.00119279;SGB=-0.693144;RPBZ=8.71266;MQBZ=0.00165207;MQSBZ=0.0153911;BQBZ=-8.53877;SCBZ=-0.186703;MQ0F=0;AC=0,0;AN=1;DP4=10387,2692,34,5;MQ=59  GT:PL   0:0,255,255

I am trying to create an artificial data set I can share to re-create the behaviour, though it's not something i have done before so I am trying to figure out how to best produce this. If you have any best practice or similar for this it would be great.

pd3 commented 1 year ago

I'd suggest to use https://github.com/pd3/mpileup-tests/blob/main/misc/create-bam-test. It creates a slice of the bam file and the reference genome, so the new coordinates are completely nonsensical. It also renames the sample and read group.

This could go a bit further and also randomly shuffle A,C,G,T bases, but of course even that cannot prevent a motivated person from finding the source genome and the position when the sequence is unique enough.

Also read names could be removed.

Yet another level, you could share the data with us offline, we'll delete them after. My email address is on my github profile page.

Let me know if any of these enhancements would make sharing the data possible.

pd3 commented 1 year ago

I just added a new option -a, --anonymize to https://github.com/pd3/mpileup-tests/blob/main/misc/create-bam-test, which should add a sufficient level of anonymization for most data.

fd-bnt commented 1 year ago

I think I managed to create an artificial dataset that reproduces the behaviour as far as i can tell. I took the original mapping and replaced reference sequence and alignments read sequence in the sam-file column 10 according to the cigar string. The reads should now contain the deletions at the exact same place and frequency as the original dataset, while having a 100% random (reference)sequence. Though any mutations are lost, but for the current problem i think that is irrelevant.

I have attached the reference and mapping file as well as the results for mpileup v1.8 and v1.17. synthetic_sequences.zip synth_mpileup_results.zip

Produced with v1.8: bcftools mpileup -a INFO/AD,INFO/ADF,INFO/ADR --max-depth 200000 --max-idepth 200000 -Q 0 -f synth.fa -o synth.bcf8.mpileup.vcf synth.sam (shows INDELs)

and v1.17: bcftools mpileup -a INFO/AD,INFO/ADF,INFO/ADR --max-depth 200000 --max-idepth 200000 -Q 0 --indels-2.0 -f synth.fa -o synth.mpileup.vcf synth.sam (missing INDELs)

I hope this is reproducible.

pd3 commented 1 year ago

@fd-bnt Thank you, got it and can confirm that it is reproducible. The difference happened between 1.12 and 1.13.