openvax / varcode

Library for manipulating genomic variants and predicting their effects
Apache License 2.0
81 stars 25 forks source link

WT and mutation AA incorrect for variant in codon split across exon boundary #236

Closed ijhoskins closed 5 years ago

ijhoskins commented 5 years ago

I noticed a case where the reference and mutation AA is incorrect. It appears to be due to a variant in a codon split between two exons (NM_003345.4 exons 6 and 7).

hg19 test variant: chr16 1370517 T A

The AA change should be p.C138S, but varcode reports p.W138R. This appears to be due to it taking the third base of the codon from the adjacent intron (G), not the adjacent base in the transcript (C).

Based on this behavior, I would expect any variant in a codon that is split between exon boundaries to have incorrect WT and mutant AA.

iskandr commented 5 years ago

Yeesh, that would be quite a problem.

Thanks for letting us know, I'll look into tomorrow.

ijhoskins commented 5 years ago

Thanks @iskandr. Hopefully not too difficult a fix.

iskandr commented 5 years ago

Finally looking into this but not seeing the problem.

In [9]: varcode.Variant("chr16", 1370517, "T", "A", "GRCh37")
Out[9]: Variant(contig='16', start=1370517, ref='T', alt='A', reference_name='GRCh37')

In [10]: v = _

In [11]: v.effects()
Out[11]:
<EffectCollection with 13 elements>
  -- Substitution(variant=Variant(contig='16', start=1370517, ref='T', alt='A', reference_name='GRCh37'), transcript_name=UBE2I-003, transcript_id=ENST00000325437, effect_description=p.C138S)
  -- Substitution(variant=Variant(contig='16', start=1370517, ref='T', alt='A', reference_name='GRCh37'), transcript_name=UBE2I-002, transcript_id=ENST00000355803, effect_description=p.C138S)
  -- Substitution(variant=Variant(contig='16', start=1370517, ref='T', alt='A', reference_name='GRCh37'), transcript_name=UBE2I-001, transcript_id=ENST00000397514, effect_description=p.C138S)
  -- Substitution(variant=Variant(contig='16', start=1370517, ref='T', alt='A', reference_name='GRCh37'), transcript_name=UBE2I-004, transcript_id=ENST00000397515, effect_description=p.C138S)
  -- Substitution(variant=Variant(contig='16', start=1370517, ref='T', alt='A', reference_name='GRCh37'), transcript_name=UBE2I-006, transcript_id=ENST00000402301, effect_description=p.W138R)
  -- Substitution(variant=Variant(contig='16', start=1370517, ref='T', alt='A', reference_name='GRCh37'), transcript_name=UBE2I-007, transcript_id=ENST00000403747, effect_description=p.C138S)
  -- Substitution(variant=Variant(contig='16', start=1370517, ref='T', alt='A', reference_name='GRCh37'), transcript_name=UBE2I-008, transcript_id=ENST00000406620, effect_description=p.C138S)
  -- Substitution(variant=Variant(contig='16', start=1370517, ref='T', alt='A', reference_name='GRCh37'), transcript_name=UBE2I-014, transcript_id=ENST00000566587, effect_description=p.C138S)
  -- IncompleteTranscript(variant=Variant(contig='16', start=1370517, ref='T', alt='A', reference_name='GRCh37'), transcript_name=UBE2I-016, transcript_id=ENST00000567074)
  -- NoncodingTranscript(variant=Variant(contig='16', start=1370517, ref='T', alt='A', reference_name='GRCh37'), transcript_name=UBE2I-010, transcript_id=ENST00000568288)
  -- NoncodingTranscript(variant=Variant(contig='16', start=1370517, ref='T', alt='A', reference_name='GRCh37'), transcript_name=UBE2I-019, transcript_id=ENST00000568989)
  -- NoncodingTranscript(variant=Variant(contig='16', start=1370517, ref='T', alt='A', reference_name='GRCh37'), transcript_name=LA16c-358B7.3-001, transcript_id=ENST00000567829)
  -- NoncodingTranscript(variant=Variant(contig='16', start=1370517, ref='T', alt='A', reference_name='GRCh37'), transcript_name=LA16c-358B7.3-002, transcript_id=ENST00000568106)
iskandr commented 5 years ago

Ah, I see that one of the coding transcripts has a W138R annotation.

  -- Substitution(variant=Variant(contig='16', start=1370517, ref='T', alt='A', reference_name='GRCh37'), transcript_name=UBE2I-006, transcript_id=ENST00000402301, effect_description=p.W138R)

All the other ones have C138S

iskandr commented 5 years ago

...but that's the one that gets selected as the top priority effect:

In [13]: es = v.effects()

In [14]: es.top_priority_effect()
Out[14]: Substitution(variant=Variant(contig='16', start=1370517, ref='T', alt='A', reference_name='GRCh37'), transcript_name=UBE2I-006, transcript_id=ENST00000402301, effect_description=p.W138R)
iskandr commented 5 years ago

Looking at all the transcripts and their lengths:

UBE2I-003 protein_coding 2832
UBE2I-002 protein_coding 3253
UBE2I-001 protein_coding 2976
UBE2I-004 protein_coding 3109
UBE2I-006 protein_coding 1708
UBE2I-007 protein_coding 1217
UBE2I-008 protein_coding 1842
UBE2I-014 protein_coding 1098
UBE2I-016 protein_coding 884
LA16c-358B7.3-001 antisense 2926
LA16c-358B7.3-002 antisense 3107
UBE2I-010 processed_transcript 501
UBE2I-019 retained_intron 498

But adding CDS length to this, I see that UBE2I-006 does in fact have the longest CDS:

UBE2I-003 protein_coding 2832 477
UBE2I-002 protein_coding 3253 477
UBE2I-001 protein_coding 2976 477
UBE2I-004 protein_coding 3109 477
UBE2I-006 protein_coding 1708 555
UBE2I-007 protein_coding 1217 477
UBE2I-008 protein_coding 1842 477
UBE2I-014 protein_coding 1098 477

...which explains why this transcript gets chosen, since the "top priority" comes from sorting on (effect_class_priority, cds_length, transcript_length)

ijhoskins commented 5 years ago

Thanks for taking a look @iskandr. So is it that the particular transcript (with highest priority) is spliced differently and the AA change is in fact correct? Looking at some other cases of SNPs at codons split across junctions, they appear correct. So this particular case might be due to the prioritized isoform having a different splice site than the others?

iskandr commented 5 years ago

That's correct @ijhoskins -- I'm significantly expanding the transcript selection logic to avoid picking weird isoforms. Fix coming soon.

ijhoskins commented 5 years ago

I see; apologies, I should have done more sleuthing. Glad to hear about improved selection logic!

ijhoskins commented 5 years ago

I'm not sure if this database would help in selection: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3531113/

iskandr commented 5 years ago

Thanks @ijhoskins -- I'm hoping to eventually integrate APPRIS but for now I tweaked the heuristics used in the latest Varcode.

ijhoskins commented 5 years ago

Thank you @iskandr !