Bugs when using vcf profile on lofreq output

Hi Jody

When running TBProfiler (using --vcfprofile) on data generated by lofreq, the frequencies of the variants are computed incorrectly. If there are two variants on the same positions (see example below), you compute the frequency by dividing (dp4_3 + dp4_4)/sum(DP4) or 18/19 for the first and 153/154 for the second. The DP4 tag however only counts the depth of the ref and alt allele for that variant, meaning that the dept of variant 1 is excluded in 2. In this case they would both get a frequency of >0.90, while the first variant only occurs in .10 of the reads and the second 0.88 as given in the AF field of the info tag. If you want to compute them yourself, you should divide by the DP tag instead of the sum of the DP4.

NC-000962-3-H37Rv 4039946 . T C 49314 PASS DP=172;AF=0.104651;SB=0;DP4=0,1,9,9 NC-000962-3-H37Rv 4039946 . T G 5278 PASS DP=172;AF=0.889535;SB=3;DP4=0,1,89,64

When looking at the resulting JSON file you get the following output for these two variants: genome_pos: 4039946 type: synonymous change: c.759T>C

genome_pos: 4039946 type: non_coding change: r.759t>g

I havent checked whether the first one really is a syn variant (I am also not sure whether it synonymous variants are supposed to show the nucleotide changes), but the second one is marked as non coding and is displayed as an RNA variant (the r. tag in the HGVS notation). I also didnt check whether this would have been a synonymous variant, but I am failry certain it should not be a RNA variant, so it should be p.XXX or c.XXX.

Lastly, also when using lofreq output, when there are two variants in the same codon, they are automatically grouped together and seen as one amino acid change, this however can cause problems when using lofreq output as it lists all variants. Given the two variants below, from the AF it seems clear that these two variants occur in different strains.

NC-000962-3-H37Rv 4039943 . C G 274 PASS DP=170;AF=0.105882;SB=2;DP4=88,64,9,9 NC-000962-3-H37Rv 4039945 . A C 5496 PASS DP=170;AF=0.900000;SB=3;DP4=8,9,89,64

In the JSON output they are however grouped as one amino acid change even though these are likely two different aa changes with different frequencies (see below).

type: missense change: p.Ser254Ala freq: 0.9 nucleotide_change: 4039943C>G+4039945A>C

I understand that untangling the information whether two variants in the same codon belong to the same strain is not easy as there are many edge cases involved, I wanted to make sure you were aware of the issue and was interested to know whether you could come up with a fix

I have added a minimal vcf file containing the two scenarios below. I am using a custom database for now with no resistance associated variants, but am using a watchlist which looks at gene clpC1. Both examples are from the cplC1 gene and should pop up when using a watchlist looking at that gene.

minimal_2.vcf.zip

Kind regards Lennert

jodyphelan / TBProfiler

Bugs when using vcf profile on lofreq output #168