samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
663 stars 240 forks source link

"bcftools csq" introducing reference-based bias #2094

Closed mmokrejs closed 7 months ago

mmokrejs commented 8 months ago

Hi, I think bcftools csq v1.1[7,9] is terribly broken in its predictions as it infers the remaining two nucleotides (to fill up a codon) from the reference sequence instead of parsing the real nucleotides from the input reads.

Here is an example. The reference contains AAG codon at position 1249-1251 which encodes Lysine (K).

The sample reads contain AAA (Lysine, K) and ATA (Isoleucine, I).

When bcftools csq --local-csq predicts the consequence of A1250T it grabs the assumption from the reference that the adjacent nucleotide 1251 is G and due to that it assumes the aminoacid change in the reads is ATG (Methionine, M). That is wrong. There are no reads evidencing there was ATG, ever in that position. Neither in the reference sequence nor in the sample-based reads.

In the haplotype-aware mode bcftools csq predicts synonymous AAG to AAA (when interpreting the position 1251 ) but completely misses when interpreting position 1250 the fact ATA (Isoleucine, I) is present in the sample in 30%.

Local-only csq caling:

$ bcftools mpileup -a FORMAT/AD,FORMAT/ADF,FORMAT/ADR,FORMAT/DP,FORMAT/DPR,INFO/AD,INFO/ADF,INFO/ADR,INFO/DPR -d 200000000 -f /tmp/7-WU-FF1.fasta /tmp/testcases.sam | bcftools call --ploidy 1 --threads 32 -mA -Ov | bcftools csq --local-csq -c CSQ -f /tmp/7-WU-FF1.fasta -g /tmp/7-WU-FF1.gff3 -p m -n 150 | bcftools plugin fill-tags - -- -t 'INFO/AF,INFO/MAF,INFO/TYPE,FORMAT/VAF,FORMAT/VAF1' 
...
7-WU-FF 1249    .   A   .   284.59  .   DP=30;ADF=30;ADR=0;AD=30;DPR=30;MQ0F=0;AN=1;DP4=30,0,0,0;MQ=60;TYPE=REF GT:DP:ADF:ADR:AD:DPR    0:30:30:0:30:30
7-WU-FF 1250    .   A   T   30.0483 .   DP=30;ADF=20,10;ADR=0,0;AD=20,10;DPR=20,10;VDB=6.99472e-07;SGB=-0.670168;RPBZ=0;MQBZ=0;BQBZ=0;SCBZ=0;MQ0F=0;AC=0;AN=1;DP4=20,0,10,0;MQ=60;CSQ=missense|S|43740568|protein_coding|+|91K>91M|1250A>T;AF=0;MAF=0;TYPE=SNP  GT:PL:DP:ADF:ADR:AD:DPR:VAF:VAF1    0:245,255:30:20,10:0,0:20,10:20,10:0.333333:0.333333
7-WU-FF 1251    .   G   A   225.417 .   DP=30;ADF=0,30;ADR=0,0;AD=0,30;DPR=0,30;VDB=3.84674e-18;SGB=-0.693097;MQ0F=0;AC=1;AN=1;DP4=0,0,30,0;MQ=60;CSQ=synonymous|S|43740568|protein_coding|+|91K|1251G>A;AF=1;MAF=0;TYPE=SNP    GT:PL:DP:ADF:ADR:AD:DPR:CSQ:VAF:VAF11:255,0:30:0,30:0,0:0,30:0,30:1:1:1

Haplotype-aware csq caling:

$ bcftools mpileup -a FORMAT/AD,FORMAT/ADF,FORMAT/ADR,FORMAT/DP,FORMAT/DPR,INFO/AD,INFO/ADF,INFO/ADR,INFO/DPR -d 200000000 -f /tmp/7-WU-FF1.fasta /tmp/testcases.sam | bcftools call --ploidy 1 --threads 32 -mA -Ov | bcftools csq -c CSQ -f /tmp/7-WU-FF1.fasta -g /tmp/7-WU-FF1.gff3 -p m -n 150 | bcftools plugin fill-tags - -- -t 'INFO/AF,INFO/MAF,INFO/TYPE,FORMAT/VAF,FORMAT/VAF1'
...
7-WU-FF 1249    .   A   .   284.59  .   DP=30;ADF=30;ADR=0;AD=30;DPR=30;MQ0F=0;AN=1;DP4=30,0,0,0;MQ=60;TYPE=REF GT:DP:ADF:ADR:AD:DPR    0:30:30:0:30:30
7-WU-FF 1250    .   A   T   30.0483 .   DP=30;ADF=20,10;ADR=0,0;AD=20,10;DPR=20,10;VDB=6.99472e-07;SGB=-0.670168;RPBZ=0;MQBZ=0;BQBZ=0;SCBZ=0;MQ0F=0;AC=0;AN=1;DP4=20,0,10,0;MQ=60;AF=0;MAF=0;TYPE=SNP   GT:PL:DP:ADF:ADR:AD:DPR:VAF:VAF1    0:245,255:30:20,10:0,0:20,10:20,10:0.333333:0.333333
7-WU-FF 1251    .   G   A   225.417 .   DP=30;ADF=0,30;ADR=0,0;AD=0,30;DPR=0,30;VDB=3.84674e-18;SGB=-0.693097;MQ0F=0;AC=1;AN=1;DP4=0,0,30,0;MQ=60;CSQ=synonymous|S|43740568|protein_coding|+|91K|1251G>A;AF=1;MAF=0;TYPE=SNP    GT:PL:DP:ADF:ADR:AD:DPR:CSQ:VAF:VAF11:255,0:30:0,30:0,0:0,30:0,30:1:1:1

testcases.sam.txt 7-WU-FF1.gff3.txt 7-WU-FF1.fasta.txt

BTW: It would be handy to have an --offset argument to increase the nucleotide positions by certain amount, for example if one works with amplicon sequences it is helpful to obtain nt and aa positions like in the real, full-length gene and not to have always recalculate K91 into its real positions manually by adding to it say 978 (at the DNA level) to get K417 ( 91 + (978/3.0) = 417 ).

The https://github.com/virus-evolution/gofasta tool does the csq calling correctly:

$ cat /tmp/7-WU-FF.fasta > /tmp/testcases.aln
$ ./gofasta sam toMultiAlign -r /tmp/7-WU-FF1.fasta -s /tmp/testcases.sam >> /tmp/testcases.aln
$ ./gofasta variants -a /tmp/7-WU-FF.gff -r 7_WU_FF --msa /tmp/testcases.aln --aggregate --append-snps
mutation,frequency
aa:S:K417I(nuc:A1250T;nuc:G1251A),0.333333333
nuc:G1251A,0.666666667
$

testcases.aln.txt 7-WU-FF.gff.txt

Ideally, the tool(s) would also output the codon frequence in a single sweep.

pd3 commented 7 months ago

I think bcftools csq v1.1[7,9] is terribly broken

The program works as expected. Maybe terribly off is your understanding of how the program works?

First, let's break your pipeline into constituent steps. In variant calling, you requested --ploidy 1, therefore forcing the model to choose between A and T at the position 1250. The allelic depths at that position are 20 reads in support of A and 10 reads in support of T. At the position 1251 there were 0 reads in support of G and 30 in support of A. No wonder the caller did choose the reference allele A for the first and the alternate allele A for the latter.

The consequence caller then made the right decision of working with the AAA, the genotypes being 0,0,1 (ref,ref,alt). In both the local and the haplotype-aware mode it used correctly 1251G>A and never ATG as you are suggesting in your post.

Your report is confusing and incoherent. The very example you are showing demonstrates that the output is as expected.

Maybe you'd like to use call --ploidy 2 .. | csq -p a instead, then you would see consequences called for both haplotypes.

Frequencies might be a nice addition, please open a separate feature request.