Closed mmokrejs closed 7 months ago
I think bcftools csq v1.1[7,9] is terribly broken
The program works as expected. Maybe terribly off is your understanding of how the program works?
First, let's break your pipeline into constituent steps. In variant calling, you requested --ploidy 1, therefore forcing the model to choose between A and T at the position 1250. The allelic depths at that position are 20 reads in support of A and 10 reads in support of T. At the position 1251 there were 0 reads in support of G and 30 in support of A. No wonder the caller did choose the reference allele A for the first and the alternate allele A for the latter.
The consequence caller then made the right decision of working with the AAA, the genotypes being 0,0,1 (ref,ref,alt). In both the local and the haplotype-aware mode it used correctly 1251G>A and never ATG as you are suggesting in your post.
Your report is confusing and incoherent. The very example you are showing demonstrates that the output is as expected.
Maybe you'd like to use call --ploidy 2 .. | csq -p a
instead, then you would see consequences called for both haplotypes.
Frequencies might be a nice addition, please open a separate feature request.
Hi, I think
bcftools csq
v1.1[7,9]
is terribly broken in its predictions as it infers the remaining two nucleotides (to fill up a codon) from the reference sequence instead of parsing the real nucleotides from the input reads.Here is an example. The reference contains AAG codon at position 1249-1251 which encodes Lysine (K).
The sample reads contain AAA (Lysine, K) and ATA (Isoleucine, I).
When
bcftools csq --local-csq
predicts the consequence of A1250T it grabs the assumption from the reference that the adjacent nucleotide 1251 is G and due to that it assumes the aminoacid change in the reads is ATG (Methionine, M). That is wrong. There are no reads evidencing there was ATG, ever in that position. Neither in the reference sequence nor in the sample-based reads.In the haplotype-aware mode
bcftools csq
predicts synonymous AAG to AAA (when interpreting the position 1251 ) but completely misses when interpreting position 1250 the fact ATA (Isoleucine, I) is present in the sample in 30%.Local-only csq caling:
Haplotype-aware csq caling:
testcases.sam.txt 7-WU-FF1.gff3.txt 7-WU-FF1.fasta.txt
BTW: It would be handy to have an
--offset
argument to increase the nucleotide positions by certain amount, for example if one works with amplicon sequences it is helpful to obtain nt and aa positions like in the real, full-length gene and not to have always recalculateK91
into its real positions manually by adding to it say978
(at the DNA level) to getK417
(91 + (978/3.0) = 417
).The https://github.com/virus-evolution/gofasta tool does the csq calling correctly:
testcases.aln.txt 7-WU-FF.gff.txt
Ideally, the tool(s) would also output the codon frequence in a single sweep.