mutalyzer / mutalyzer2

HGVS variant nomenclature checker
https://mutalyzer.nl
Other
98 stars 23 forks source link

Original variant in UCSC shows different base for SHANK3 #470

Closed mbelmadani closed 5 years ago

mbelmadani commented 5 years ago

I was trying to convert the variant NM_033517.1:c.3630dup to genomic coordinates:

https://mutalyzer.nl/name-checker?description=NM_033517.1%28SHANK3_v001%29%3Ac.3630dup

The duplication even looks like this:

CCTGCAGCGGCCAGCTGGCCTCATC - GTTGTGCACGCCACCAGCAACGGGC
CCTGCAGCGGCCAGCTGGCCTCATC C GTTGTGCACGCCACCAGCAACGG

But when I click "View original variant in UCSC Genome Browser, I get an A as the reference position. The rest of the bases prior to that position also do not match the string with the insertion mentionned above.

I also compared using TransVar and I get a duplication of A consistent with UCSC (chr22:g.51159928dupA/c.3630_3630dup/p.G1211Rfs*510)

Am I missing something obvious (or subtle) that would explain why mutalyzer believes it's an insertion of C and not A?

AngieHinrichs commented 5 years ago

Usually, RefSeq transcript sequences align perfectly to the genome, but NM_033517.1 is an exception -- some of its bases cannot be aligned to the genome. If Mutalyzer is using only genomic coordinates instead of the full alignment, then it doesn't have enough information to correctly map variants that fall after the unaligned bases.

There is a way to check for these incomplete alignment cases in the Genome Browser. Here is a Genome Browser session that shows the full alignment of NM_033517.1 to hg19:

http://genome.ucsc.edu/s/AngieHinrichs/hg19%20SHANK3

In the main RefSeq track (blue), there is no sign that anything is wrong, but there is also a subtrack (RefSeq Alignments, black) that shows the detailed sequence alignment and you can see that one of the introns has double lines -- that means that not only is it skipping genomic bases (as it would with an intron), it is skipping bases of NM_033517.1. So the alignment is incomplete, the "intron" is incorrect, and all tools that use only genomic coordinates to make transcript projections will have issues.

If I enter "NM_033517.1:c.3630dup" in the UCSC Genome Browser position/search input for hg19, it maps to chr22:51,159,891 which is a C; I believe that is the correct mapping. Unfortunately the codon numbers of the RefSeq track are derived only from genomic coordinates, so they are incorrect -- instead of 1197 it should say 1210.

mbelmadani commented 5 years ago

That makes sense. Thank you, that's incredibly helpful! Will keep that in mind in the future.