samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
659 stars 240 forks source link

bcftools can not call low allele frequency #1707

Open YS-HZAU opened 2 years ago

YS-HZAU commented 2 years ago

Hello there, I'm doing some SNP calling with samtools mpileup -Q 0 -uvf | bcftools -vm -Oz but after running the pipeline, I noticed that all my SNPs are with frequencies above 20% (calculate using (DP4[3]+DP4[2]) /(DP4[3]+DP4[2]+DP4[1]+DP4[0]). I don't know why. Is there a default paremeter for control the frequency filltering ??? How can I get the rest one (<20%)??? Hope for help!!! Thanks!!!

YS-HZAU commented 2 years ago

my bcftools is Version: 1.10.2 (using htslib 1.10.2-3) I'm thinking increase the -P parameter to get the rest sites with AF < 20%, but what value shuold I use? Is increase -P help?

Nazeim commented 2 years ago

I also have the same issue, please could you support to solve it? @pd3

YS-HZAU commented 2 years ago

Well, I tried -P 0.001 (instead of the older 1.1e-3) to work on this problem, still, I got much more SNVs than without -P, none of these sites have a AF less than 20%. So, what's the problem? And another question arose, I want to calculate the AF by DP4 as above, but if I do my calling without -Q 0, all DP4 only show high quality mapped reads, eg: DP=21; DP4: 0,3,5,6; Based on DP4, the AF of this site is (5+6)/(0+3+5+6)=78.6%; However, it can not stand for the real situation because sum of DP4 is not equal to DP. What if the real situation of this site is DP=21; DP4: 7,3,5,6; So the AF becomes (5+6)/(7+3+5+6)=52.4% (7 stands for the reads with low map quality). So I have to use -Q 0 to get all reads that cover this site, which will cause much more SNVs than without -Q 0. But I only want to get the real AF for SNVs I detected without -Q 0. It is really confusing.

jkbonfield commented 2 years ago

I can repeat this using our own test data by artificially stuffing one allele much more frequently than the other:

$ (samtools view -h test/mpileup/mpileup.1.bam;for i in `seq 1 14`;do samtools view test/mpileup/mpileup.1.bam|egrep '\.(484376|216886|9045126|1210272|798432)';done )|samtools sort -o _.bam; samtools index _.bam

$ bcftools mpileup --fasta-ref test/mpileup/mpileup.ref.fa _.bam -r 17:2041 2>/dev/null | egrep -v '^#'
17  2041    .   G   A,<*>   0   .   DP=91;I16=47,33,7,4,2809,100077,407,15639,4800,288000,629,36841,1764,40416,209,4315;QS=0.876717,0.123283,0;VDB=0.768075;SGB=-0.676189;RPBZ=-0.905203;MQBZ=-2.6968;MQSBZ=0.827759;BQBZ=2.24194;SCBZ=2.6968;FS=0;MQ0F=0   PL  46,0,255,255,255,255

$ bcftools mpileup --fasta-ref test/mpileup/mpileup.ref.fa _.bam -r 17:2000-2100 2>/dev/null| bcftools call -vm - 2>/dev/null| egrep -v '^#'
17  2041    .   G   A   10.1985 .   DP=91;VDB=0.768075;SGB=-0.676189;RPBZ=-0.905203;MQBZ=-2.6968;MQSBZ=0.827759;BQBZ=2.24194;SCBZ=2.6968;FS=0;MQ0F=0;AC=1;AN=2;DP4=47,33,7,4;MQ=59  GT:PL   0/1:46,0,255

That's with 14 extra copies of a bunch of ref-matching sequences. If I change the for i in $(seq 1 14) to seq 1 15 then mpileup still reports the SNP as the most probably event in the PL field, but bcftools call fails to call it:

$ (samtools view -h test/mpileup/mpileup.1.bam;for i in `seq 1 15`;do samtools view test/mpileup/mpileup.1.bam|egrep '\.(484376|216886|9045126|1210272|798432)';done )|samtools sort -o _.bam; samtools index _.bam

$ bcftools mpileup --fasta-ref test/mpileup/mpileup.ref.fa _.bam -r 17:2041 2>/dev/null | egrep -v '^#'
17  2041    .   G   A,<*>   0   .   DP=96;I16=50,35,7,4,2984,106288,407,15639,5100,306000,629,36841,1875,42947,209,4315;QS=0.883102,0.116898,0;VDB=0.768075;SGB=-0.676189;RPBZ=-0.920931;MQBZ=-2.7798;MQSBZ=0.82717;BQBZ=2.25885;SCBZ=2.7798;FS=0;MQ0F=0    PL  32,0,255,255,255,255

$ bcftools mpileup --fasta-ref test/mpileup/mpileup.ref.fa _.bam -r 17:2000-2100 2>/dev/null| bcftools call -vm - 2>/dev/null| egrep -v '^#'

So PL is still 32,0,255,.... meaning it things REF/ALT is the correct call, with 1 in 10^3.2 likelihood of it being pure REF (so almost never).

I can rescue this with call -P0.01, but it's a low confidence (6.3). At 16 copies it needs -P0.1 to call it, At 17 copies it refuses to ever call it, even with -P1. This is despite PL still claiming REF/ALT is the most likely outcome.

I don't entirely understand what's going on here, nor why the -P is on the call side and not the mpileup side when it's the mpileup step that computes the PL values and not the caller.

pd3 commented 2 years ago

BCFtools is a germline variant caller and assumes that the reads were sampled from a diploid genome. There are other variables entering the calculation, not just genotype likelihoods. Consider that in the last example there are 85 reference and 11 non-reference reads. The binomial probability of sampling this many or fewer alternate reads is 1.27367e-15.

The mutation rate prior -P does not affect the calculation of genotype likelihoods (obviously), therefore it is a parameter to call, not to mpileup.

If you are interested in a naive determination of genotypes based on the fraction of alternate reads (e.g. call a het if VAF>20% and VAF<80%), a trivial user script can do that and is not what bcftools aims to do.

The math notes that explain the details of the calling algorithm can be found here http://samtools.github.io/bcftools/call-m.pdf.