monarch-initiative / genophenocorr

Genotype Phenotype Correlation
https://monarch-initiative.github.io/genophenocorr/stable
MIT License
4 stars 1 forks source link

Is there an off by one (two) error in the calculation of protein locations? #161

Closed pnrobinson closed 1 month ago

pnrobinson commented 1 month ago

For instance, this annotation:

TranscriptAnnotation(gene_id:ANKRD11,
transcript_id:NM_013275.6,
hgvs_cdna:NM_013275.6:c.2412del,
is_preferred:True,
variant_effects:(<VariantEffect.FRAMESHIFT_VARIANT: 'SO:0001589'>,),
overlapping_exons:(9,),
protein_id:NP_037407.4,
protein_effect_location:Region(start=803, end=804))

has NM_013275.6:c.2412del, which according to variantvalidator has NP_037407.4:p.(Glu805LysfsTer58) but we have

protein_effect_location:Region(start=803, end=804)
ielis commented 1 month ago

Yeah, VEP seems to be returning the AA coordinates in an unexpected fashion. I'll fix this.

ielis commented 1 month ago

So, it seems there is no bug here.

The deletion removes the last base of the 804th codon AAA resulting in the new reading frame AAG. However, both AAA and AAG encode Lys, so the 804th codon is unaffected.

The frameshift then leads to aminoacid change one codon downstream, which is reflected on the predicted protein effect NP_037407.4:p.(Glu805LysfsTer58).

Therefore, the protein_effect_location:Region(start=803, end=804)) IS correct - this is the aminoacid that overlaps with the variant effect and we should use the coordinates to plot the variant location on the protein figure.

@pnrobinson I will add a PR with tests that I used to troubleshoot this issue and then we can close the issue.