Closed ijhoskins closed 5 years ago
Yeesh, that would be quite a problem.
Thanks for letting us know, I'll look into tomorrow.
Thanks @iskandr. Hopefully not too difficult a fix.
Finally looking into this but not seeing the problem.
In [9]: varcode.Variant("chr16", 1370517, "T", "A", "GRCh37")
Out[9]: Variant(contig='16', start=1370517, ref='T', alt='A', reference_name='GRCh37')
In [10]: v = _
In [11]: v.effects()
Out[11]:
<EffectCollection with 13 elements>
-- Substitution(variant=Variant(contig='16', start=1370517, ref='T', alt='A', reference_name='GRCh37'), transcript_name=UBE2I-003, transcript_id=ENST00000325437, effect_description=p.C138S)
-- Substitution(variant=Variant(contig='16', start=1370517, ref='T', alt='A', reference_name='GRCh37'), transcript_name=UBE2I-002, transcript_id=ENST00000355803, effect_description=p.C138S)
-- Substitution(variant=Variant(contig='16', start=1370517, ref='T', alt='A', reference_name='GRCh37'), transcript_name=UBE2I-001, transcript_id=ENST00000397514, effect_description=p.C138S)
-- Substitution(variant=Variant(contig='16', start=1370517, ref='T', alt='A', reference_name='GRCh37'), transcript_name=UBE2I-004, transcript_id=ENST00000397515, effect_description=p.C138S)
-- Substitution(variant=Variant(contig='16', start=1370517, ref='T', alt='A', reference_name='GRCh37'), transcript_name=UBE2I-006, transcript_id=ENST00000402301, effect_description=p.W138R)
-- Substitution(variant=Variant(contig='16', start=1370517, ref='T', alt='A', reference_name='GRCh37'), transcript_name=UBE2I-007, transcript_id=ENST00000403747, effect_description=p.C138S)
-- Substitution(variant=Variant(contig='16', start=1370517, ref='T', alt='A', reference_name='GRCh37'), transcript_name=UBE2I-008, transcript_id=ENST00000406620, effect_description=p.C138S)
-- Substitution(variant=Variant(contig='16', start=1370517, ref='T', alt='A', reference_name='GRCh37'), transcript_name=UBE2I-014, transcript_id=ENST00000566587, effect_description=p.C138S)
-- IncompleteTranscript(variant=Variant(contig='16', start=1370517, ref='T', alt='A', reference_name='GRCh37'), transcript_name=UBE2I-016, transcript_id=ENST00000567074)
-- NoncodingTranscript(variant=Variant(contig='16', start=1370517, ref='T', alt='A', reference_name='GRCh37'), transcript_name=UBE2I-010, transcript_id=ENST00000568288)
-- NoncodingTranscript(variant=Variant(contig='16', start=1370517, ref='T', alt='A', reference_name='GRCh37'), transcript_name=UBE2I-019, transcript_id=ENST00000568989)
-- NoncodingTranscript(variant=Variant(contig='16', start=1370517, ref='T', alt='A', reference_name='GRCh37'), transcript_name=LA16c-358B7.3-001, transcript_id=ENST00000567829)
-- NoncodingTranscript(variant=Variant(contig='16', start=1370517, ref='T', alt='A', reference_name='GRCh37'), transcript_name=LA16c-358B7.3-002, transcript_id=ENST00000568106)
Ah, I see that one of the coding transcripts has a W138R annotation.
-- Substitution(variant=Variant(contig='16', start=1370517, ref='T', alt='A', reference_name='GRCh37'), transcript_name=UBE2I-006, transcript_id=ENST00000402301, effect_description=p.W138R)
All the other ones have C138S
...but that's the one that gets selected as the top priority effect:
In [13]: es = v.effects()
In [14]: es.top_priority_effect()
Out[14]: Substitution(variant=Variant(contig='16', start=1370517, ref='T', alt='A', reference_name='GRCh37'), transcript_name=UBE2I-006, transcript_id=ENST00000402301, effect_description=p.W138R)
Looking at all the transcripts and their lengths:
UBE2I-003 protein_coding 2832
UBE2I-002 protein_coding 3253
UBE2I-001 protein_coding 2976
UBE2I-004 protein_coding 3109
UBE2I-006 protein_coding 1708
UBE2I-007 protein_coding 1217
UBE2I-008 protein_coding 1842
UBE2I-014 protein_coding 1098
UBE2I-016 protein_coding 884
LA16c-358B7.3-001 antisense 2926
LA16c-358B7.3-002 antisense 3107
UBE2I-010 processed_transcript 501
UBE2I-019 retained_intron 498
But adding CDS length to this, I see that UBE2I-006 does in fact have the longest CDS:
UBE2I-003 protein_coding 2832 477
UBE2I-002 protein_coding 3253 477
UBE2I-001 protein_coding 2976 477
UBE2I-004 protein_coding 3109 477
UBE2I-006 protein_coding 1708 555
UBE2I-007 protein_coding 1217 477
UBE2I-008 protein_coding 1842 477
UBE2I-014 protein_coding 1098 477
...which explains why this transcript gets chosen, since the "top priority" comes from sorting on (effect_class_priority, cds_length, transcript_length)
Thanks for taking a look @iskandr. So is it that the particular transcript (with highest priority) is spliced differently and the AA change is in fact correct? Looking at some other cases of SNPs at codons split across junctions, they appear correct. So this particular case might be due to the prioritized isoform having a different splice site than the others?
That's correct @ijhoskins -- I'm significantly expanding the transcript selection logic to avoid picking weird isoforms. Fix coming soon.
I see; apologies, I should have done more sleuthing. Glad to hear about improved selection logic!
I'm not sure if this database would help in selection: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3531113/
Thanks @ijhoskins -- I'm hoping to eventually integrate APPRIS but for now I tweaked the heuristics used in the latest Varcode.
Thank you @iskandr !
I noticed a case where the reference and mutation AA is incorrect. It appears to be due to a variant in a codon split between two exons (NM_003345.4 exons 6 and 7).
hg19 test variant: chr16 1370517 T A
The AA change should be p.C138S, but varcode reports p.W138R. This appears to be due to it taking the third base of the codon from the adjacent intron (G), not the adjacent base in the transcript (C).
Based on this behavior, I would expect any variant in a codon that is split between exon boundaries to have incorrect WT and mutant AA.