samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
671 stars 240 forks source link

bcftools mpileup v1.15 (wrongly) does not call SNPs when bcftools mpileup v1.10.2 does #1721

Open alnam3 opened 2 years ago

alnam3 commented 2 years ago

Hi,

I am using bcftools consensus using v1.15 and got SNPs which were weirdly not called, so I went back to the vcf files. I have SNPs which I know exist (I am working on simulated reads) which are called when using mpileup from version 1.10.2 and not when using version 1.15, even if a lot of alternative sequences are found (see below). DP4 changes too. I use the same reference file and the same bam file. I used the options -q 10 -d 15000 in both cases.

For instance: with v1.10.2 gene1 51 . A T 22.268 . DP=17932;VDB=0;SGB=-0.693147;RPB=1.51927e-05;MQB=0;BQB=0.0361916;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=4774,0,3504,0;MQ=49 GT:PL 0/1:55,0,70

with v1.15 gene1 51 . A . 17.2269 . DP=17932;VDB=0;SGB=-0.693147;RPBZ=-6.58337;MQBZ=-88.0103;BQBZ=-2.09656;SCBZ=5.45755;FS=0;MQ0F=0;AN=2;DP4=10424,0,7508,0;MQ=49 GT 0/0

I read manual pages for both versions and cannot find a default parameter which changed. Thanks a lot!

jkbonfield commented 2 years ago

There were a lot of changes to mpileup. See the release notes in https://github.com/samtools/bcftools/releases/tag/1.13.

You may be able to just add --config 1.12 when using 1.15 to get the old behaviour, but note while there will be specific calls that the old version correctly made and the new one does not, overall the changes are a significant improvement in SNP precision vs recall tradeoffs. So I'd only advise using the old parameters if you absolutely need compatibility with old versions.

pd3 commented 2 years ago

It would be interesting to see what the missed SNV looked like. It would be helpful to see an IGV screenshot and the raw mpileup output for that site.

alnam3 commented 2 years ago

Thanks for your answers, I will go and look at the release notes more in depth! @pd3 I'll come back with an IGV screenshot soon. I tested the latest release on simulated nanopore reads, for which I know the mutation, and played a bit with the options. There are much fewer missed SNPs when I increase the -P, but there are still some, even with (unrealistically) high values for P, and in some positions, SNPs aren't called even if there is a really high coverage for the alternative base, so there seems to be an issue.

For instance, using mpileup -q 10 -d 75000 -a AD --config ont and call -mA, I get a problem at the following position

gene1        201     .       A       T,C,G   2.08851 .       DP=19787;VDB=0;SGB=-0.693147;RPBZ=-2.78045;MQBZ=0;BQBZ=0.28492;SCBZ=0.114625;FS=0;MQ0F=0;AC=0,0,0;AN=2;DP4=10934,0,8819,0;MQ=60 GT:PL:AD
        0/0:0,8,29,255,255,74,255,255,78,74:10934,8618,108,93

where the SNP is not called even if 8000 reads support it. I thought it might be for matters of ploidy, but there are position with similar distributions of reference and alternative reads which are properly called, e.g. in the same call:

gene1        192     .       T       A,C,G   37.9403 .       DP=19848;VDB=0;SGB=-0.693147;RPBZ=-28.0788;MQBZ=0;BQBZ=0.55589;SCBZ=0.184087;FS=0;MQ0F=0;AC=1,0,0;AN=2;DP4=10754,0,9060,0;MQ=60 GT:PL:AD
        0/1:41,0,50,255,255,108,255,255,112,108:10754,8870,101,89

Note: I have all the alternative bases called, which is quite unrealistic, but there are a lot of errors in nanopore sequences so I tried to take that into account in my sequence simulation. Thanks again.

pd3 commented 2 years ago

This would require also the BAM and the reference for testing. Showing just the result again is not helpful.

alnam3 commented 2 years ago

Sorry for the delay, I had a cluster issue and lost all my intermediate files, I will provide the files asap.

Iredcg commented 1 year ago

I am using bcftools v1.15to call variants. However, I also found out checking the vcf that SNPs were not called at all. Is there yet a solution to this problem? Thank you so much

pd3 commented 1 year ago

@Iredcg Variant calling is and always will be a compromise between sensitivity and specificity. If you have a specific case you'd like to discuss, please open a new issue and provide a small test case to reproduce the behavior. It is impossible to respond to general queries like this. Note this issue carries the label "requires test case". Sorry that I can't be of more help here.