rpetit3 / vcf-annotator

Add biological annotations to variants in a given VCF file.
MIT License
26 stars 7 forks source link

Multiple mutations in a codon #8

Closed joanmarticarreras closed 3 years ago

joanmarticarreras commented 3 years ago

Hi Robert!

First, Congrats with the project! It's pretty cool!

Currently I am using vcf-annotator in another project and I've found a small glitch. I have a codon GCG in which for roughly 95% of my reads C>T changing an A>V. Additionally, the second G (third position) G>A (30%). This change is conservative for both GCA (A) and GTA (V).

In my VCF file I only have GCG>GTG, however VCF-annotator anotates the mutated second position as if it was the third position. Here the line:

X17403  78205   .   C   T   59.694  PASS    RefCodon=GCG;AltCodon=GCA;RefAminoAcid=A;AltAminoAcid=A;CodonPosition=809;SNPCodonPosition=2;AminoAcidChange=A809A;IsSynonymous=1;IsTransition=1;IsGenic=1;IsPseudo=0;LocusTag=.;Gene=HCMVUL54;Note=DNA[space]polymerase[space](8);Inference=.;Product=.;ProteinID=CAA35413.1;Comments=Negative:G->A;VariantType=SNP;FeatureType=CDS    GT:GQ   1:59.694

It correctly marks the changed position as the 2 (SNPCodonPosition=2) and gives the right codon coordinate. However, the RefCodon and AltCodon are wrong, thus calling the SNP as IsSynonymous=1 instead of IsSynonymous=0.

In case you want to have a detailed look, I attach the genbank file and the vcf file (as txt):

annotated_round_1_hap_1_filtered.txt X17403.txt

Cheers

Joan

rpetit3 commented 3 years ago

Hi Joan!

Thanks for giving vcf-annotator a go! I'm looking into this and will update soon!

Would it be ok to post the original VCF as well?

Thanks again! Robert

joanmarticarreras commented 3 years ago

Hi Robert,

Here you go! round_1_hap_1_filtered .txt

I checked with snpEff and it gives the same result: X17403 78205 . C T 59.694 PASS ANN=T|synonymous_variant|LOW|HCMVUL54|Gene_76902_80630|transcript|CAA35413.1|protein_coding|1/1|c.2427G>A|p.Ala809Ala|2427/3729|2427/3729|809/1242||,T|upstream_gene_variant|MODIFIER|HCMVUL50|Gene_72068_73261|transcript|CAA35409.1|protein_coding||c.-4943G>A|||||4943|,T|upstream_gene_variant|MODIFIER|HCMVUL51|Gene_73283_73756|transcript|CAA35410.1|protein_coding||c.-4448G>A|||||4448|,T|downstream_gene_variant|MODIFIER|HCMVUL52|Gene_73795_75801|transcript|CAA35411.1|protein_coding||c.*2403C>T|||||2403|,T|downstream_gene_variant|MODIFIER|HCMVUL53|Gene_75794_76924|transcript|CAA35412.1|protein_coding||c.*1280C>T|||||1280| GT:GQ 1:59.694

The genbank entry does not have any event of alternative splicing or any other unusual event that might alter the annotation. Checking the position 78205 on the original Genbank entry (https://www.ncbi.nlm.nih.gov/nuccore/X17403.1?report=genbank&from=76988&to=79600) it is indeed a C. Might it be an issue with 0- 1-based numbering for the genomic coordinates?

Joan

rpetit3 commented 3 years ago

Hi Joan!

SNPCodonPosition=2 is a bit confusing, because its 0 indexed to match BioPython. (0: first, 1: second, 2: third)

Here's what I've done so far: I took the sequence for the gene in question from here: https://www.ncbi.nlm.nih.gov/nuccore/X17403.1?from=76903&to=80631&report=fasta&strand=2

and made a script (fasta-to-codon.py) to convert the sequence to codons with position info.

The full output is here (https://gist.github.com/rpetit3/fc1fdc61a37cbcebe62056ea4dc2de80#file-output), and for position 78205 I get:

codon_pos   codon   first_pos   second_pos  third_pos
809 GCG 78207   78206   78205

I take this as 78205 is in the third position of codon 809 GCG, which is a G. The gene however is on the reverse strand so the variant caller reports it as a C in the VCF, instead of G. In order to get GCG>GTG, I think a G>A mutation would have to be in position 78206

I wonder if, the variant caller reporting base in context of the chromosome (only forward), and vcf-annotator reporting in the context of the gene (forward or reverse), might be causing the confusion.

I may have missed something, so please take a moment to review what I did, but I think the annotation is correct.

Let me know what you think. Robert

joanmarticarreras commented 3 years ago

Hi Robert,

I checked codon 809 and indeed both the 2nd (index 1) and 3rd (index 2) positions both have the variants you mentioned (looking at the alignments). The VC did not report the change in the 2nd position (index 1), thus the confusion.

Thank you very much for taking the time to verify it.

Joan