"bcftools csq" introducing reference-based bias

Hi, I think bcftools csq v1.1[7,9] is terribly broken in its predictions as it infers the remaining two nucleotides (to fill up a codon) from the reference sequence instead of parsing the real nucleotides from the input reads.

Here is an example. The reference contains AAG codon at position 1249-1251 which encodes Lysine (K).

The sample reads contain AAA (Lysine, K) and ATA (Isoleucine, I).

When bcftools csq --local-csq predicts the consequence of A1250T it grabs the assumption from the reference that the adjacent nucleotide 1251 is G and due to that it assumes the aminoacid change in the reads is ATG (Methionine, M). That is wrong. There are no reads evidencing there was ATG, ever in that position. Neither in the reference sequence nor in the sample-based reads.

In the haplotype-aware mode bcftools csq predicts synonymous AAG to AAA (when interpreting the position 1251 ) but completely misses when interpreting position 1250 the fact ATA (Isoleucine, I) is present in the sample in 30%.

Local-only csq caling:

$ bcftools mpileup -a FORMAT/AD,FORMAT/ADF,FORMAT/ADR,FORMAT/DP,FORMAT/DPR,INFO/AD,INFO/ADF,INFO/ADR,INFO/DPR -d 200000000 -f /tmp/7-WU-FF1.fasta /tmp/testcases.sam | bcftools call --ploidy 1 --threads 32 -mA -Ov | bcftools csq --local-csq -c CSQ -f /tmp/7-WU-FF1.fasta -g /tmp/7-WU-FF1.gff3 -p m -n 150 | bcftools plugin fill-tags - -- -t 'INFO/AF,INFO/MAF,INFO/TYPE,FORMAT/VAF,FORMAT/VAF1' 
...
7-WU-FF 1249    .   A   .   284.59  .   DP=30;ADF=30;ADR=0;AD=30;DPR=30;MQ0F=0;AN=1;DP4=30,0,0,0;MQ=60;TYPE=REF GT:DP:ADF:ADR:AD:DPR    0:30:30:0:30:30
7-WU-FF 1250    .   A   T   30.0483 .   DP=30;ADF=20,10;ADR=0,0;AD=20,10;DPR=20,10;VDB=6.99472e-07;SGB=-0.670168;RPBZ=0;MQBZ=0;BQBZ=0;SCBZ=0;MQ0F=0;AC=0;AN=1;DP4=20,0,10,0;MQ=60;CSQ=missense|S|43740568|protein_coding|+|91K>91M|1250A>T;AF=0;MAF=0;TYPE=SNP  GT:PL:DP:ADF:ADR:AD:DPR:VAF:VAF1    0:245,255:30:20,10:0,0:20,10:20,10:0.333333:0.333333
7-WU-FF 1251    .   G   A   225.417 .   DP=30;ADF=0,30;ADR=0,0;AD=0,30;DPR=0,30;VDB=3.84674e-18;SGB=-0.693097;MQ0F=0;AC=1;AN=1;DP4=0,0,30,0;MQ=60;CSQ=synonymous|S|43740568|protein_coding|+|91K|1251G>A;AF=1;MAF=0;TYPE=SNP    GT:PL:DP:ADF:ADR:AD:DPR:CSQ:VAF:VAF11:255,0:30:0,30:0,0:0,30:0,30:1:1:1

Haplotype-aware csq caling:

$ bcftools mpileup -a FORMAT/AD,FORMAT/ADF,FORMAT/ADR,FORMAT/DP,FORMAT/DPR,INFO/AD,INFO/ADF,INFO/ADR,INFO/DPR -d 200000000 -f /tmp/7-WU-FF1.fasta /tmp/testcases.sam | bcftools call --ploidy 1 --threads 32 -mA -Ov | bcftools csq -c CSQ -f /tmp/7-WU-FF1.fasta -g /tmp/7-WU-FF1.gff3 -p m -n 150 | bcftools plugin fill-tags - -- -t 'INFO/AF,INFO/MAF,INFO/TYPE,FORMAT/VAF,FORMAT/VAF1'
...
7-WU-FF 1249    .   A   .   284.59  .   DP=30;ADF=30;ADR=0;AD=30;DPR=30;MQ0F=0;AN=1;DP4=30,0,0,0;MQ=60;TYPE=REF GT:DP:ADF:ADR:AD:DPR    0:30:30:0:30:30
7-WU-FF 1250    .   A   T   30.0483 .   DP=30;ADF=20,10;ADR=0,0;AD=20,10;DPR=20,10;VDB=6.99472e-07;SGB=-0.670168;RPBZ=0;MQBZ=0;BQBZ=0;SCBZ=0;MQ0F=0;AC=0;AN=1;DP4=20,0,10,0;MQ=60;AF=0;MAF=0;TYPE=SNP   GT:PL:DP:ADF:ADR:AD:DPR:VAF:VAF1    0:245,255:30:20,10:0,0:20,10:20,10:0.333333:0.333333
7-WU-FF 1251    .   G   A   225.417 .   DP=30;ADF=0,30;ADR=0,0;AD=0,30;DPR=0,30;VDB=3.84674e-18;SGB=-0.693097;MQ0F=0;AC=1;AN=1;DP4=0,0,30,0;MQ=60;CSQ=synonymous|S|43740568|protein_coding|+|91K|1251G>A;AF=1;MAF=0;TYPE=SNP    GT:PL:DP:ADF:ADR:AD:DPR:CSQ:VAF:VAF11:255,0:30:0,30:0,0:0,30:0,30:1:1:1

testcases.sam.txt 7-WU-FF1.gff3.txt 7-WU-FF1.fasta.txt

BTW: It would be handy to have an --offset argument to increase the nucleotide positions by certain amount, for example if one works with amplicon sequences it is helpful to obtain nt and aa positions like in the real, full-length gene and not to have always recalculate K91 into its real positions manually by adding to it say 978 (at the DNA level) to get K417 ( 91 + (978/3.0) = 417 ).

The https://github.com/virus-evolution/gofasta tool does the csq calling correctly:

$ cat /tmp/7-WU-FF.fasta > /tmp/testcases.aln
$ ./gofasta sam toMultiAlign -r /tmp/7-WU-FF1.fasta -s /tmp/testcases.sam >> /tmp/testcases.aln
$ ./gofasta variants -a /tmp/7-WU-FF.gff -r 7_WU_FF --msa /tmp/testcases.aln --aggregate --append-snps
mutation,frequency
aa:S:K417I(nuc:A1250T;nuc:G1251A),0.333333333
nuc:G1251A,0.666666667
$

testcases.aln.txt 7-WU-FF.gff.txt

Ideally, the tool(s) would also output the codon frequence in a single sweep.

samtools / bcftools

"bcftools csq" introducing reference-based bias #2094