Open Peter-J-Freeman opened 1 year ago
This gives us a bit of a conundrum. We could simply go for p= but sometimes we like to see the affected aas e.g. in this issue https://github.com/openvar/variantValidator/issues/482.
In this instance, it looks like, from the Mutalyzer output, that the duplication has simply shifted things around but not affected the Priotein sequence. VV picks this up. So the description is not necessarily incorrect, but the question is, should we state the amino acids queried? If not, what rules govern this. We need to be clear!
NP_065184.2:p.(Val275_Ala276=)
I do not think personally that NM_020451.2(NP_065184.2):p.(Ter127*) is correct because it suggests variation where no actual variation exists.
as far as I know correct is p.(Ala276_Cys277insSer), see https://databases.lovd.nl/shared/refseq/SEPN1_codingDNA.html and a recent submission in the LOVD database.
Best regards,
Johan den Dunnen
GVsharedLOVD-team
facebook.com/LeidenOpenVariationDatabase
GTG GCC TGC CTG >> GTG GCC TCC TGC CTG
Val Ala Cys Leu >> Val Ala Ser Cys Leu
275 276 277 278 >> 275 276 277 278
Hence, I agree that p.(Ala276_Cys277insSer)
is correct.
@leicray. Before you submit anything, have you checked the full length of the CDS? Something is tripping up both VV and Mutalyzer. If you simply look at the local variation, a few nucleotides either side, you may miss something
What I want to do is create and translate the complete CDS with and without the duplicated sequence.
Without scanning the consequence on the complete protein we cannot determine what the bug, if any actually is
I will do transcript/protein alignments.
Thanks. I'm curious because both MT and VV predict no change in the Protein sequence
This could be tricky. I have just tried to align the transcript and protein (using exonerate) and it reports an error:
** FATAL ERROR **: Unknown amino acid [U]
exiting ...
I have checked the fasta sequence file for NP_065184.2
and it does indeed contain two U's. The gotcha is revealed in the transcript description Homo sapiens selenoprotein N (SELENON), transcript variant 2, mRNA
.
There are two selenocysteines (Sec, U) in the protein which are encoded UGAs which would otherwise be stop codons. These codons are at reference transcript positions 434..436
and 1439..1441
, which correspond to c. positions 379..381
and 1384..1386
respectively.
Somewhat confusingly, neither instance coincides with the c.827_829dup
duplication.
How do we proceed from here?
I knew there would be something fishy going on. This is a whole project in itself. How do we deal with these genes?
I currently have no time to spend more time on this, but I'll just drop this link here: https://github.com/openvar/variantValidator/issues/154
I have not forgotten this issue. The best tool to perform the alignment between the protein and the transcript is exonerate because it's aware of selenocysteine being encoded by UGA. However, the recently updated version of exonerate on SPECTRE is not the latest version and is not selenocysteine-aware. The SPECTRE people are working on it.
Im not entirely sure we need to do a protein alignment. What we need is the protein sequence from vvta and the transcript sequence plus cds start and end. In theory we might be able to make naive translations. We may need to capture additional annotation to state read through of the stop codons. Let me check the protein sequence. It might be possible to spot thiss
Ok Selenocysteine has the symbol U.
https://www.ddbj.nig.ac.jp/ddbj/code-e.html
So, what we can potentially do during all translation is
Check for U in the protein sequence. If U, then set UGA to translate to U.
There will be additional tweaks, but this could well be a solution. Ill take a look and see what can be done. Ultimately, its gonna be a guese because we wont know anything about why UGA is read through. Note 3 letter for U is Sec
https://www.ncbi.nlm.nih.gov/protein/NP_065184.2 is the MANE Select protein translation from https://www.ncbi.nlm.nih.gov/nuccore/NM_020451.3. The reference sequence has U at the correct location.
https://www.ensembl.org/Homo_sapiens/Transcript/ProteinSummary?db=core;g=ENSG00000162430;r=1:25800193-25818221;t=ENST00000361547 is the MANE Select Translation of https://www.ensembl.org/Homo_sapiens/Transcript/ProteinSummary?db=core;g=ENSG00000162430;r=1:25800193-25818221;t=ENST00000361547.
The alignment is perfect
<div id="dln_Query_14769" style="box-sizing: inherit; color: rgb(33, 33, 33); font-family: Roboto, "Helvetica Neue", Arial, Tahoma; font-size: 17px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><div class="dlfRow" style="box-sizing: inherit; display: inline; font-size: 18.7px; font-weight: bold;">NP_065184.2 selenoprotein N isoform 2 [Homo sapiens]<div style="box-sizing: inherit; font-size: 16.83px;"><label style="box-sizing: inherit; display: inline; margin-top: 3rem; max-width: 46rem; font-size: 14.6421px; color: rgb(96, 96, 96); font-weight: bold;">Sequence ID:<span> </span></label>Query_14769<span class=" r" style="box-sizing: inherit;"><label style="box-sizing: inherit; display: inline; margin-top: 3rem; max-width: 46rem; font-size: 14.6421px; color: rgb(96, 96, 96); font-weight: bold; padding-left: 0.8em;">Length:<span> </span></label>590<label style="box-sizing: inherit; display: inline; margin-top: 3rem; max-width: 46rem; font-size: 14.6421px; color: rgb(96, 96, 96); font-weight: bold; padding-left: 0.8em;">Number of Matches:<span> </span></label>1</span></div></div></div><div class="alnAll" id="alnAll_Query_14769" style="box-sizing: inherit; line-height: normal; margin-top: 1em; color: rgb(33, 33, 33); font-family: Roboto, "Helvetica Neue", Arial, Tahoma; font-size: 17px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><div id="hd_Query_14769_1" style="box-sizing: inherit; width: 808.5px;"><div class="dflLnk hsp" style="box-sizing: inherit; width: 56em; font-family: Verdana, sans-serif; font-size: 13.6px; padding-bottom: 1em;"><span class="alnRn" style="box-sizing: inherit; float: left;"><label style="box-sizing: inherit; display: inline; margin-top: 3rem; max-width: 46rem; font-weight: bold; color: rgb(96, 96, 96); font-size: 12.92px; padding-right: 0.5em;">Range 1: 1 to 590</label><span class="" id="rng_Query_14769" style="box-sizing: inherit;"><a href="https://www.ncbi.nlm.nih.gov/projects/sviewer/?RID=8E9B78W7114&id=lcl|Query_14769&tracks=[key:sequence_track,name:Sequence,display_name:Sequence,id:STD1,category:Sequence,annots:Sequence,ShowLabel:true][key:gene_model_track,CDSProductFeats:false][key:alignment_track,name:other%20alignments,annots:NG%20Alignments|Refseq%20Alignments|Gnomon%20Alignments|Unnamed,shown:false]&v=0:619&appname=ncbiblast&link_loc=fromHSP" class="spr" target="lnk8E9B78W7114" title="Show alignment to Query_14769 in Protein Graphics for 1 to 590 range" style="box-sizing: inherit; background-color: transparent; color: rgb(0, 113, 188); text-decoration: underline; padding-left: 0px; font: 13.6px / 1.5 arial, tahoma, verdana, sans-serif; padding-right: 1em; border-right: none;">Graphics</a></span></span><span id="hspQuery_14769_1" class="alnParLinks" style="box-sizing: inherit; float: right;"><a class="gbd toolsCtr navNext" disabled="disabled" title="Go to next match #2 for lcl|Query_14769" onmouseover="scan(this)" ref="ordinalpos=1&currseq=Query_14769" onclick="goToNextHSP(this,true)" style="box-sizing: inherit; background-color: transparent; color: rgb(194, 194, 194); text-decoration: underline; padding-left: 0px; font: 13.6px / 1.5 arial, tahoma, verdana, sans-serif; margin-left: 8px; float: left; cursor: default;"><span style="box-sizing: inherit; color: rgb(194, 194, 194); padding-left: 15px; background-repeat: no-repeat; background-position: left center; background-image: url("images/next_Inactive.png");">Next Match</span></a><a class="gbd toolsCtr navPrev" disabled="disabled" title="Go to previous match #0 for lcl|Query_14769" onmouseover="scan(this)" ref="ordinalpos=1&currseq=Query_14769" onclick="goToNextHSP(this,false)" style="box-sizing: inherit; background-color: transparent; color: rgb(194, 194, 194); text-decoration: underline; padding-left: 0px; font: 13.6px / 1.5 arial, tahoma, verdana, sans-serif; margin-left: 8px; float: left; cursor: default;"><span style="box-sizing: inherit; color: rgb(194, 194, 194); padding-left: 15px; background-repeat: no-repeat; background-position: left center; background-image: url("images/prev_Inactive.png");">Previous Match</span></a></span></div>
Score | Expect | Method | Identities | Positives | Gaps
-- | -- | -- | -- | -- | --
1211 bits(3134) | 0.0 | Compositional matrix adjust. | 590/590(100%) | 590/590(100%) | 0/590(0%)
</div><div id="ar_Query_14769_1" style="box-sizing: inherit; width: 808.5px;"><pre style="box-sizing: inherit; overflow: auto hidden; font-family: monospace, monospace; font-size: 15.13px; line-height: 1; margin-top: 0.5em;">Query 1 MGRARPGQRGPPSPGPAAQPPAPPRRRARSLALLGALLAAAAAAAVRVCARHAEAQAAAR 60
MGRARPGQRGPPSPGPAAQPPAPPRRRARSLALLGALLAAAAAAAVRVCARHAEAQAAAR
Sbjct 1 MGRARPGQRGPPSPGPAAQPPAPPRRRARSLALLGALLAAAAAAAVRVCARHAEAQAAAR 60
Query 61 QELALKTLGTDGLFLFSSLDTDGDMYISPEEFKPIAEKLTGSCSVTQTGVQWCSHSSLQP 120
QELALKTLGTDGLFLFSSLDTDGDMYISPEEFKPIAEKLTGSCSVTQTGVQWCSHSSLQP
Sbjct 61 QELALKTLGTDGLFLFSSLDTDGDMYISPEEFKPIAEKLTGSCSVTQTGVQWCSHSSLQP 120
Query 121 QLPWLNUSSCLSLLRSTPAASCEEEELPPDPSEETLTIEARFQPLLPETMTKSKDGFLGV 180
QLPWLNUSSCLSLLRSTPAASCEEEELPPDPSEETLTIEARFQPLLPETMTKSKDGFLGV
Sbjct 121 QLPWLNUSSCLSLLRSTPAASCEEEELPPDPSEETLTIEARFQPLLPETMTKSKDGFLGV 180
Query 181 SRLALSGLRNWTAAASPSAVFATRHFQPFLPPPGQELGEPWWIIPSELSMFTGYLSNNRF 240
SRLALSGLRNWTAAASPSAVFATRHFQPFLPPPGQELGEPWWIIPSELSMFTGYLSNNRF
Sbjct 181 SRLALSGLRNWTAAASPSAVFATRHFQPFLPPPGQELGEPWWIIPSELSMFTGYLSNNRF 240
Query 241 YPPPPKGKEVIIHRLLSMFHPRPFVKTRFAPQGAVACLTAISDFYYTVMFRIHAEFQLSE 300
YPPPPKGKEVIIHRLLSMFHPRPFVKTRFAPQGAVACLTAISDFYYTVMFRIHAEFQLSE
Sbjct 241 YPPPPKGKEVIIHRLLSMFHPRPFVKTRFAPQGAVACLTAISDFYYTVMFRIHAEFQLSE 300
Query 301 PPDFPFWFSPAQFTGHIILSKDATHVRDFRLFVPNHRSLNVDMEWLYGASESSNMEVDIG 360
PPDFPFWFSPAQFTGHIILSKDATHVRDFRLFVPNHRSLNVDMEWLYGASESSNMEVDIG
Sbjct 301 PPDFPFWFSPAQFTGHIILSKDATHVRDFRLFVPNHRSLNVDMEWLYGASESSNMEVDIG 360
Query 361 YIPQMELEATGPSVPSVILDEDGSMIDSHLPSGEPLQFVFEEIKWQQELSWEEAARRLEV 420
YIPQMELEATGPSVPSVILDEDGSMIDSHLPSGEPLQFVFEEIKWQQELSWEEAARRLEV
Sbjct 361 YIPQMELEATGPSVPSVILDEDGSMIDSHLPSGEPLQFVFEEIKWQQELSWEEAARRLEV 420
Query 421 AMYPFKKVSYLPFTEAFDRAKAENKLVHSILLWGALDDQSCUGSGRTLRETVLESSPILT 480
AMYPFKKVSYLPFTEAFDRAKAENKLVHSILLWGALDDQSCUGSGRTLRETVLESSPILT
Sbjct 421 AMYPFKKVSYLPFTEAFDRAKAENKLVHSILLWGALDDQSCUGSGRTLRETVLESSPILT 480
Query 481 LLNESFISTWSLVKELEELQNNQENSSHQKLAGLHLEKYSFPVEMMICLPNGTVVHHINA 540
LLNESFISTWSLVKELEELQNNQENSSHQKLAGLHLEKYSFPVEMMICLPNGTVVHHINA
Sbjct 481 LLNESFISTWSLVKELEELQNNQENSSHQKLAGLHLEKYSFPVEMMICLPNGTVVHHINA 540
Query 541 NYFLDITSVKPEEIESNLFSFSSTFEDPSTATYMQFLKEGLRRGLPLLQP 590
NYFLDITSVKPEEIESNLFSFSSTFEDPSTATYMQFLKEGLRRGLPLLQP
Sbjct 541 NYFLDITSVKPEEIESNLFSFSSTFEDPSTATYMQFLKEGLRRGLPLLQP 590</pre></div></div>NP_065184.2 selenoprotein N isoform 2 [Homo sapiens]
Sequence ID: Query_14769Length: 590Number of Matches: 1
Range 1: 1 to 590[Graphics](https://www.ncbi.nlm.nih.gov/projects/sviewer/?RID=8E9B78W7114&id=lcl|Query_14769&tracks=[key:sequence_track,name:Sequence,display_name:Sequence,id:STD1,category:Sequence,annots:Sequence,ShowLabel:true][key:gene_model_track,CDSProductFeats:false][key:alignment_track,name:other%20alignments,annots:NG%20Alignments|Refseq%20Alignments|Gnomon%20Alignments|Unnamed,shown:false]&v=0:619&appname=ncbiblast&link_loc=fromHSP)Next MatchPrevious Match
Alignment statistics for match #1
Score Expect Method Identities Positives Gaps
1211 bits(3134) 0.0 Compositional matrix adjust. 590/590(100%) 590/590(100%) 0/590(0%)
Query 1 MGRARPGQRGPPSPGPAAQPPAPPRRRARSLALLGALLAAAAAAAVRVCARHAEAQAAAR 60
MGRARPGQRGPPSPGPAAQPPAPPRRRARSLALLGALLAAAAAAAVRVCARHAEAQAAAR
Sbjct 1 MGRARPGQRGPPSPGPAAQPPAPPRRRARSLALLGALLAAAAAAAVRVCARHAEAQAAAR 60
Query 61 QELALKTLGTDGLFLFSSLDTDGDMYISPEEFKPIAEKLTGSCSVTQTGVQWCSHSSLQP 120
QELALKTLGTDGLFLFSSLDTDGDMYISPEEFKPIAEKLTGSCSVTQTGVQWCSHSSLQP
Sbjct 61 QELALKTLGTDGLFLFSSLDTDGDMYISPEEFKPIAEKLTGSCSVTQTGVQWCSHSSLQP 120
Query 121 QLPWLNUSSCLSLLRSTPAASCEEEELPPDPSEETLTIEARFQPLLPETMTKSKDGFLGV 180
QLPWLNUSSCLSLLRSTPAASCEEEELPPDPSEETLTIEARFQPLLPETMTKSKDGFLGV
Sbjct 121 QLPWLNUSSCLSLLRSTPAASCEEEELPPDPSEETLTIEARFQPLLPETMTKSKDGFLGV 180
Query 181 SRLALSGLRNWTAAASPSAVFATRHFQPFLPPPGQELGEPWWIIPSELSMFTGYLSNNRF 240
SRLALSGLRNWTAAASPSAVFATRHFQPFLPPPGQELGEPWWIIPSELSMFTGYLSNNRF
Sbjct 181 SRLALSGLRNWTAAASPSAVFATRHFQPFLPPPGQELGEPWWIIPSELSMFTGYLSNNRF 240
Query 241 YPPPPKGKEVIIHRLLSMFHPRPFVKTRFAPQGAVACLTAISDFYYTVMFRIHAEFQLSE 300
YPPPPKGKEVIIHRLLSMFHPRPFVKTRFAPQGAVACLTAISDFYYTVMFRIHAEFQLSE
Sbjct 241 YPPPPKGKEVIIHRLLSMFHPRPFVKTRFAPQGAVACLTAISDFYYTVMFRIHAEFQLSE 300
Query 301 PPDFPFWFSPAQFTGHIILSKDATHVRDFRLFVPNHRSLNVDMEWLYGASESSNMEVDIG 360
PPDFPFWFSPAQFTGHIILSKDATHVRDFRLFVPNHRSLNVDMEWLYGASESSNMEVDIG
Sbjct 301 PPDFPFWFSPAQFTGHIILSKDATHVRDFRLFVPNHRSLNVDMEWLYGASESSNMEVDIG 360
Query 361 YIPQMELEATGPSVPSVILDEDGSMIDSHLPSGEPLQFVFEEIKWQQELSWEEAARRLEV 420
YIPQMELEATGPSVPSVILDEDGSMIDSHLPSGEPLQFVFEEIKWQQELSWEEAARRLEV
Sbjct 361 YIPQMELEATGPSVPSVILDEDGSMIDSHLPSGEPLQFVFEEIKWQQELSWEEAARRLEV 420
Query 421 AMYPFKKVSYLPFTEAFDRAKAENKLVHSILLWGALDDQSCUGSGRTLRETVLESSPILT 480
AMYPFKKVSYLPFTEAFDRAKAENKLVHSILLWGALDDQSCUGSGRTLRETVLESSPILT
Sbjct 421 AMYPFKKVSYLPFTEAFDRAKAENKLVHSILLWGALDDQSCUGSGRTLRETVLESSPILT 480
Query 481 LLNESFISTWSLVKELEELQNNQENSSHQKLAGLHLEKYSFPVEMMICLPNGTVVHHINA 540
LLNESFISTWSLVKELEELQNNQENSSHQKLAGLHLEKYSFPVEMMICLPNGTVVHHINA
Sbjct 481 LLNESFISTWSLVKELEELQNNQENSSHQKLAGLHLEKYSFPVEMMICLPNGTVVHHINA 540
Query 541 NYFLDITSVKPEEIESNLFSFSSTFEDPSTATYMQFLKEGLRRGLPLLQP 590
NYFLDITSVKPEEIESNLFSFSSTFEDPSTATYMQFLKEGLRRGLPLLQP
Sbjct 541 NYFLDITSVKPEEIESNLFSFSSTFEDPSTATYMQFLKEGLRRGLPLLQP 590
So, it seems Ensembl do use the U translation of UGA
It is probable that account only needs to be taken of sequence variants that affect UGA codons that do encode Sec (U) in selenocysteine-containing proteins. Variants that might naively be predicted to create a Sec codon (e.g., GGA > UGA) will probably be treated as a premature stop codon. The reason for this is that the use of UGA codons to genuinely specify Sec is quite tightly controlled.
More reading around the subject is required.
OK, about to push to a new branch. VV can now read-through stop codons in these genes.
We need more examples though. This is for
NM_020451.2:c.827_829dup which now gives NP_065184.2:p.(Ser276dup)
{
"NM_020451.2:c.827_829dup": {
"alt_genomic_loci": [],
"annotations": {
"chromosome": "1",
"db_xref": {
"CCDS": "CCDS41282.1",
"ensemblgene": null,
"hgnc": "HGNC:15999",
"ncbigene": "57190",
"select": false
},
"ensembl_select": false,
"mane_plus_clinical": false,
"mane_select": false,
"map": "1p36.11",
"note": "selenoprotein N",
"refseq_select": false,
"variant": "2"
},
"gene_ids": {
"ccds_ids": [
"CCDS41282",
"CCDS41283"
],
"ensembl_gene_id": "ENSG00000162430",
"entrez_gene_id": "57190",
"hgnc_id": "HGNC:15999",
"omim_id": [
"606210"
],
"ucsc_id": "uc021ojk.2"
},
"gene_symbol": "SELENON",
"genome_context_intronic_sequence": "",
"hgvs_lrg_transcript_variant": "LRG_857t1:c.827_829dup",
"hgvs_lrg_variant": "LRG_857:g.13930_13932dup",
"hgvs_predicted_protein_consequence": {
"lrg_slr": "LRG_857p1:p.(S276dup)",
"lrg_tlr": "LRG_857p1:p.(Ser276dup)",
"slr": "NP_065184.2:p.(S276dup)",
"tlr": "NP_065184.2:p.(Ser276dup)"
},
"hgvs_refseqgene_variant": "NG_009930.1:g.13930_13932dup",
"hgvs_transcript_variant": "NM_020451.2:c.827_829dup",
"primary_assembly_loci": {
"grch37": {
"hgvs_genomic_description": "NC_000001.10:g.26135596_26135598dup",
"vcf": {
"alt": "GCCT",
"chr": "1",
"pos": "26135595",
"ref": "G"
}
},
"grch38": {
"hgvs_genomic_description": "NC_000001.11:g.25809105_25809107dup",
"vcf": {
"alt": "GCCT",
"chr": "1",
"pos": "25809104",
"ref": "G"
}
},
"hg19": {
"hgvs_genomic_description": "NC_000001.10:g.26135596_26135598dup",
"vcf": {
"alt": "GCCT",
"chr": "chr1",
"pos": "26135595",
"ref": "G"
}
},
"hg38": {
"hgvs_genomic_description": "NC_000001.11:g.25809105_25809107dup",
"vcf": {
"alt": "GCCT",
"chr": "chr1",
"pos": "25809104",
"ref": "G"
}
}
},
"reference_sequence_records": {
"lrg": "http://ftp.ebi.ac.uk/pub/databases/lrgex/pending/LRG_857.xml",
"protein": "https://www.ncbi.nlm.nih.gov/nuccore/NP_065184.2",
"refseqgene": "https://www.ncbi.nlm.nih.gov/nuccore/NG_009930.1",
"transcript": "https://www.ncbi.nlm.nih.gov/nuccore/NM_020451.2"
},
"refseqgene_context_intronic_sequence": "",
"rna_variant_descriptions": null,
"selected_assembly": "GRCh37",
"submitted_variant": "NM_020451.2:c.827_829dup",
"transcript_description": "Homo sapiens selenoprotein N (SELENON), transcript variant 2, mRNA",
"validation_warnings": [
"A more recent version of the selected reference sequence NM_020451.2 is available (NM_020451.3): NM_020451.3:c.827_829dup MUST be fully validated prior to use in reports: select_variants=NM_020451.3:c.827_829dup",
"The current status of LRG_857 is pending therefore changes may be made to the LRG reference sequence"
],
"variant_exonic_positions": {
"NC_000001.10": {
"end_exon": "6",
"start_exon": "6"
},
"NC_000001.11": {
"end_exon": "6",
"start_exon": "6"
},
"NG_009930.1": {
"end_exon": "6",
"start_exon": "6"
}
}
},
"flag": "gene_variant",
"metadata": {
"variantvalidator_hgvs_version": "2.2.0",
"variantvalidator_version": "2.1.1.dev82+g131fae9",
"vvdb_version": "vvdb_2022_11",
"vvseqrepo_db": "VV_SR_2022_11/master",
"vvta_version": "vvta_2022_11_1"
}
}
@leicray proposed
GTG GCC TGC CTG >> GTG GCC TCC TGC CTG Val Ala Cys Leu >> Val Ala Ser Cys Leu 275 276 277 278 >> 275 276 277 278
Hence, I agree that p.(Ala276_Cys277insSer) is correct.
So is this correct or is NP_065184.2:p.(Ser276dup) correct? It cannot be since p.276 is indeed Ala, but we are hopefully in the right direction. More debugging to go
ORIGIN
1 mgrarpgqrg ppspgpaaqp papprrrars lallgallaa aaaaavrvca rhaeaqaaar
61 qelalktlgt dglflfssld tdgdmyispe efkpiaeklt gscsvtqtgv qwcshsslqp
121 qlpwlnussc lsllrstpaa sceeeelppd pseetltiea rfqpllpetm tkskdgflgv
181 srlalsglrn wtaaaspsav fatrhfqpfl pppgqelgep wwiipselsm ftgylsnnrf
241 yppppkgkev iihrllsmfh prpfvktrfa pqgavaclta isdfyytvmf rihaefqlse
301 ppdfpfwfsp aqftghiils kdathvrdfr lfvpnhrsln vdmewlygas essnmevdig
361 yipqmeleat gpsvpsvild edgsmidshl psgeplqfvf eeikwqqels weeaarrlev
421 amypfkkvsy lpfteafdra kaenklvhsi llwgalddqs cugsgrtlre tvlesspilt
481 llnesfistw slvkeleelq nnqensshqk laglhlekys fpvemmiclp ngtvvhhina
541 nyflditsvk peeiesnlfs fsstfedpst atymqflkeg lrrglpllqp
It looks like VV is mishandling the ins and calling it a dup
The translations are correct
<div id="dln_Query_13239" style="box-sizing: inherit; color: rgb(33, 33, 33); font-family: Roboto, "Helvetica Neue", Arial, Tahoma; font-size: 17px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><div class="dlfRow" style="box-sizing: inherit; display: inline; font-size: 18.7px; font-weight: bold;">unnamed protein product<div style="box-sizing: inherit; font-size: 16.83px;"><label style="box-sizing: inherit; display: inline; margin-top: 3rem; max-width: 46rem; font-size: 14.6421px; color: rgb(96, 96, 96); font-weight: bold;">Sequence ID:<span> </span></label>Query_13239<span class=" r" style="box-sizing: inherit;"><label style="box-sizing: inherit; display: inline; margin-top: 3rem; max-width: 46rem; font-size: 14.6421px; color: rgb(96, 96, 96); font-weight: bold; padding-left: 0.8em;">Length:<span> </span></label>591<label style="box-sizing: inherit; display: inline; margin-top: 3rem; max-width: 46rem; font-size: 14.6421px; color: rgb(96, 96, 96); font-weight: bold; padding-left: 0.8em;">Number of Matches:<span> </span></label>1</span></div></div></div><div class="alnAll" id="alnAll_Query_13239" style="box-sizing: inherit; line-height: normal; margin-top: 1em; color: rgb(33, 33, 33); font-family: Roboto, "Helvetica Neue", Arial, Tahoma; font-size: 17px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><div id="hd_Query_13239_1" style="box-sizing: inherit; width: 808.5px;"><div class="dflLnk hsp" style="box-sizing: inherit; width: 56em; font-family: Verdana, sans-serif; font-size: 13.6px; padding-bottom: 1em;"><span class="alnRn" style="box-sizing: inherit; float: left;"><label style="box-sizing: inherit; display: inline; margin-top: 3rem; max-width: 46rem; font-weight: bold; color: rgb(96, 96, 96); font-size: 12.92px; padding-right: 0.5em;">Range 1: 1 to 591</label><span class="" id="rng_Query_13239" style="box-sizing: inherit;"><a href="https://www.ncbi.nlm.nih.gov/projects/sviewer/?RID=8KD00R7G114&id=lcl|Query_13239&tracks=[key:sequence_track,name:Sequence,display_name:Sequence,id:STD1,category:Sequence,annots:Sequence,ShowLabel:true][key:gene_model_track,CDSProductFeats:false][key:alignment_track,name:other%20alignments,annots:NG%20Alignments|Refseq%20Alignments|Gnomon%20Alignments|Unnamed,shown:false]&v=0:620&appname=ncbiblast&link_loc=fromHSP" class="spr" target="lnk8KD00R7G114" title="Show alignment to Query_13239 in Protein Graphics for 1 to 591 range" style="box-sizing: inherit; background-color: transparent; color: rgb(0, 113, 188); text-decoration: underline; padding-left: 0px; font: 13.6px / 1.5 arial, tahoma, verdana, sans-serif; padding-right: 1em; border-right: none;">Graphics</a></span></span><span id="hspQuery_13239_1" class="alnParLinks" style="box-sizing: inherit; float: right;"><a class="gbd toolsCtr navNext" disabled="disabled" title="Go to next match #2 for lcl|Query_13239" onmouseover="scan(this)" ref="ordinalpos=1&currseq=Query_13239" onclick="goToNextHSP(this,true)" style="box-sizing: inherit; background-color: transparent; color: rgb(194, 194, 194); text-decoration: underline; padding-left: 0px; font: 13.6px / 1.5 arial, tahoma, verdana, sans-serif; margin-left: 8px; float: left; cursor: default;"><span style="box-sizing: inherit; color: rgb(194, 194, 194); padding-left: 15px; background-repeat: no-repeat; background-position: left center; background-image: url("images/next_Inactive.png");">Next Match</span></a><a class="gbd toolsCtr navPrev" disabled="disabled" title="Go to previous match #0 for lcl|Query_13239" onmouseover="scan(this)" ref="ordinalpos=1&currseq=Query_13239" onclick="goToNextHSP(this,false)" style="box-sizing: inherit; background-color: transparent; color: rgb(194, 194, 194); text-decoration: underline; padding-left: 0px; font: 13.6px / 1.5 arial, tahoma, verdana, sans-serif; margin-left: 8px; float: left; cursor: default;"><span style="box-sizing: inherit; color: rgb(194, 194, 194); padding-left: 15px; background-repeat: no-repeat; background-position: left center; background-image: url("images/prev_Inactive.png");">Previous Match</span></a></span></div>
Score | Expect | Method | Identities | Positives | Gaps
-- | -- | -- | -- | -- | --
1207 bits(3122) | 0.0 | Compositional matrix adjust. | 590/591(99%) | 590/591(99%) | 1/591(0%)
</div><div id="ar_Query_13239_1" style="box-sizing: inherit; width: 808.5px;"><pre style="box-sizing: inherit; overflow: auto hidden; font-family: monospace, monospace; font-size: 15.13px; line-height: 1; margin-top: 0.5em;">Query 1 MGRARPGQRGPPSPGPAAQPPAPPRRRARSLALLGALLAAAAAAAVRVCARHAEAQAAAR 60
MGRARPGQRGPPSPGPAAQPPAPPRRRARSLALLGALLAAAAAAAVRVCARHAEAQAAAR
Sbjct 1 MGRARPGQRGPPSPGPAAQPPAPPRRRARSLALLGALLAAAAAAAVRVCARHAEAQAAAR 60
Query 61 QELALKTLGTDGLFLFSSLDTDGDMYISPEEFKPIAEKLTGSCSVTQTGVQWCSHSSLQP 120
QELALKTLGTDGLFLFSSLDTDGDMYISPEEFKPIAEKLTGSCSVTQTGVQWCSHSSLQP
Sbjct 61 QELALKTLGTDGLFLFSSLDTDGDMYISPEEFKPIAEKLTGSCSVTQTGVQWCSHSSLQP 120
Query 121 QLPWLNUSSCLSLLRSTPAASCEEEELPPDPSEETLTIEARFQPLLPETMTKSKDGFLGV 180
QLPWLNUSSCLSLLRSTPAASCEEEELPPDPSEETLTIEARFQPLLPETMTKSKDGFLGV
Sbjct 121 QLPWLNUSSCLSLLRSTPAASCEEEELPPDPSEETLTIEARFQPLLPETMTKSKDGFLGV 180
Query 181 SRLALSGLRNWTAAASPSAVFATRHFQPFLPPPGQELGEPWWIIPSELSMFTGYLSNNRF 240
SRLALSGLRNWTAAASPSAVFATRHFQPFLPPPGQELGEPWWIIPSELSMFTGYLSNNRF
Sbjct 181 SRLALSGLRNWTAAASPSAVFATRHFQPFLPPPGQELGEPWWIIPSELSMFTGYLSNNRF 240
Query 241 YPPPPKGKEVIIHRLLSMFHPRPFVKTRFAPQGAVA-CLTAISDFYYTVMFRIHAEFQLS 299
YPPPPKGKEVIIHRLLSMFHPRPFVKTRFAPQGAVA CLTAISDFYYTVMFRIHAEFQLS
Sbjct 241 YPPPPKGKEVIIHRLLSMFHPRPFVKTRFAPQGAVASCLTAISDFYYTVMFRIHAEFQLS 300
Query 300 EPPDFPFWFSPAQFTGHIILSKDATHVRDFRLFVPNHRSLNVDMEWLYGASESSNMEVDI 359
EPPDFPFWFSPAQFTGHIILSKDATHVRDFRLFVPNHRSLNVDMEWLYGASESSNMEVDI
Sbjct 301 EPPDFPFWFSPAQFTGHIILSKDATHVRDFRLFVPNHRSLNVDMEWLYGASESSNMEVDI 360
Query 360 GYIPQMELEATGPSVPSVILDEDGSMIDSHLPSGEPLQFVFEEIKWQQELSWEEAARRLE 419
GYIPQMELEATGPSVPSVILDEDGSMIDSHLPSGEPLQFVFEEIKWQQELSWEEAARRLE
Sbjct 361 GYIPQMELEATGPSVPSVILDEDGSMIDSHLPSGEPLQFVFEEIKWQQELSWEEAARRLE 420
Query 420 VAMYPFKKVSYLPFTEAFDRAKAENKLVHSILLWGALDDQSCUGSGRTLRETVLESSPIL 479
VAMYPFKKVSYLPFTEAFDRAKAENKLVHSILLWGALDDQSCUGSGRTLRETVLESSPIL
Sbjct 421 VAMYPFKKVSYLPFTEAFDRAKAENKLVHSILLWGALDDQSCUGSGRTLRETVLESSPIL 480
Query 480 TLLNESFISTWSLVKELEELQNNQENSSHQKLAGLHLEKYSFPVEMMICLPNGTVVHHIN 539
TLLNESFISTWSLVKELEELQNNQENSSHQKLAGLHLEKYSFPVEMMICLPNGTVVHHIN
Sbjct 481 TLLNESFISTWSLVKELEELQNNQENSSHQKLAGLHLEKYSFPVEMMICLPNGTVVHHIN 540
Query 540 ANYFLDITSVKPEEIESNLFSFSSTFEDPSTATYMQFLKEGLRRGLPLLQP 590
ANYFLDITSVKPEEIESNLFSFSSTFEDPSTATYMQFLKEGLRRGLPLLQP
Sbjct 541 ANYFLDITSVKPEEIESNLFSFSSTFEDPSTATYMQFLKEGLRRGLPLLQP 591</pre></div></div>unnamed protein product
Sequence ID: Query_13239Length: 591Number of Matches: 1
Range 1: 1 to 591[Graphics](https://www.ncbi.nlm.nih.gov/projects/sviewer/?RID=8KD00R7G114&id=lcl|Query_13239&tracks=[key:sequence_track,name:Sequence,display_name:Sequence,id:STD1,category:Sequence,annots:Sequence,ShowLabel:true][key:gene_model_track,CDSProductFeats:false][key:alignment_track,name:other%20alignments,annots:NG%20Alignments|Refseq%20Alignments|Gnomon%20Alignments|Unnamed,shown:false]&v=0:620&appname=ncbiblast&link_loc=fromHSP)Next MatchPrevious Match
Alignment statistics for match #1
Score Expect Method Identities Positives Gaps
1207 bits(3122) 0.0 Compositional matrix adjust. 590/591(99%) 590/591(99%) 1/591(0%)
Query 1 MGRARPGQRGPPSPGPAAQPPAPPRRRARSLALLGALLAAAAAAAVRVCARHAEAQAAAR 60
MGRARPGQRGPPSPGPAAQPPAPPRRRARSLALLGALLAAAAAAAVRVCARHAEAQAAAR
Sbjct 1 MGRARPGQRGPPSPGPAAQPPAPPRRRARSLALLGALLAAAAAAAVRVCARHAEAQAAAR 60
Query 61 QELALKTLGTDGLFLFSSLDTDGDMYISPEEFKPIAEKLTGSCSVTQTGVQWCSHSSLQP 120
QELALKTLGTDGLFLFSSLDTDGDMYISPEEFKPIAEKLTGSCSVTQTGVQWCSHSSLQP
Sbjct 61 QELALKTLGTDGLFLFSSLDTDGDMYISPEEFKPIAEKLTGSCSVTQTGVQWCSHSSLQP 120
Query 121 QLPWLNUSSCLSLLRSTPAASCEEEELPPDPSEETLTIEARFQPLLPETMTKSKDGFLGV 180
QLPWLNUSSCLSLLRSTPAASCEEEELPPDPSEETLTIEARFQPLLPETMTKSKDGFLGV
Sbjct 121 QLPWLNUSSCLSLLRSTPAASCEEEELPPDPSEETLTIEARFQPLLPETMTKSKDGFLGV 180
Query 181 SRLALSGLRNWTAAASPSAVFATRHFQPFLPPPGQELGEPWWIIPSELSMFTGYLSNNRF 240
SRLALSGLRNWTAAASPSAVFATRHFQPFLPPPGQELGEPWWIIPSELSMFTGYLSNNRF
Sbjct 181 SRLALSGLRNWTAAASPSAVFATRHFQPFLPPPGQELGEPWWIIPSELSMFTGYLSNNRF 240
Query 241 YPPPPKGKEVIIHRLLSMFHPRPFVKTRFAPQGAVA-CLTAISDFYYTVMFRIHAEFQLS 299
YPPPPKGKEVIIHRLLSMFHPRPFVKTRFAPQGAVA CLTAISDFYYTVMFRIHAEFQLS
Sbjct 241 YPPPPKGKEVIIHRLLSMFHPRPFVKTRFAPQGAVASCLTAISDFYYTVMFRIHAEFQLS 300
Query 300 EPPDFPFWFSPAQFTGHIILSKDATHVRDFRLFVPNHRSLNVDMEWLYGASESSNMEVDI 359
EPPDFPFWFSPAQFTGHIILSKDATHVRDFRLFVPNHRSLNVDMEWLYGASESSNMEVDI
Sbjct 301 EPPDFPFWFSPAQFTGHIILSKDATHVRDFRLFVPNHRSLNVDMEWLYGASESSNMEVDI 360
Query 360 GYIPQMELEATGPSVPSVILDEDGSMIDSHLPSGEPLQFVFEEIKWQQELSWEEAARRLE 419
GYIPQMELEATGPSVPSVILDEDGSMIDSHLPSGEPLQFVFEEIKWQQELSWEEAARRLE
Sbjct 361 GYIPQMELEATGPSVPSVILDEDGSMIDSHLPSGEPLQFVFEEIKWQQELSWEEAARRLE 420
Query 420 VAMYPFKKVSYLPFTEAFDRAKAENKLVHSILLWGALDDQSCUGSGRTLRETVLESSPIL 479
VAMYPFKKVSYLPFTEAFDRAKAENKLVHSILLWGALDDQSCUGSGRTLRETVLESSPIL
Sbjct 421 VAMYPFKKVSYLPFTEAFDRAKAENKLVHSILLWGALDDQSCUGSGRTLRETVLESSPIL 480
Query 480 TLLNESFISTWSLVKELEELQNNQENSSHQKLAGLHLEKYSFPVEMMICLPNGTVVHHIN 539
TLLNESFISTWSLVKELEELQNNQENSSHQKLAGLHLEKYSFPVEMMICLPNGTVVHHIN
Sbjct 481 TLLNESFISTWSLVKELEELQNNQENSSHQKLAGLHLEKYSFPVEMMICLPNGTVVHHIN 540
Query 540 ANYFLDITSVKPEEIESNLFSFSSTFEDPSTATYMQFLKEGLRRGLPLLQP 590
ANYFLDITSVKPEEIESNLFSFSSTFEDPSTATYMQFLKEGLRRGLPLLQP
Sbjct 541 ANYFLDITSVKPEEIESNLFSFSSTFEDPSTATYMQFLKEGLRRGLPLLQP 591
"OK, about to push to a new branch. VV can now read-through stop codons in these genes."
What's the detail on this new behaviour? It's important that, for example, when a GGA glycine codon is changed to TGA that the variant is treated as creation of a stop codon, rather than creation of a Sec codon.
Robust handling of TGA stop codons has to be based on a full understanding of the underlying biological processes. A tandem duplication of a TGA codon that is intended to be translated to Sec will probably result in a tandem duplication of the Sec amino acid. This is because of the duplication occurring in the vicinity of a Sec codon.
However, is current biological understanding of how TGA translates to Sec sufficient to make such predictions? More reading is required.
What's the detail on this new behaviour? It's important that, for example, when a GGA glycine codon is changed to TGA that the variant is treated as creation of a stop codon, rather than creation of a Sec codon.
It's a secret until I push :)
OK enough for one day.
Now we have the correct nomenclature
{
"NM_020451.2:c.827_829dup": {
"alt_genomic_loci": [],
"annotations": {
"chromosome": "1",
"db_xref": {
"CCDS": "CCDS41282.1",
"ensemblgene": null,
"hgnc": "HGNC:15999",
"ncbigene": "57190",
"select": false
},
"ensembl_select": false,
"mane_plus_clinical": false,
"mane_select": false,
"map": "1p36.11",
"note": "selenoprotein N",
"refseq_select": false,
"variant": "2"
},
"gene_ids": {
"ccds_ids": [
"CCDS41282",
"CCDS41283"
],
"ensembl_gene_id": "ENSG00000162430",
"entrez_gene_id": "57190",
"hgnc_id": "HGNC:15999",
"omim_id": [
"606210"
],
"ucsc_id": "uc021ojk.2"
},
"gene_symbol": "SELENON",
"genome_context_intronic_sequence": "",
"hgvs_lrg_transcript_variant": "LRG_857t1:c.827_829dup",
"hgvs_lrg_variant": "LRG_857:g.13930_13932dup",
"hgvs_predicted_protein_consequence": {
"lrg_slr": "LRG_857p1:p.(A276_C277insS)",
"lrg_tlr": "LRG_857p1:p.(Ala276_Cys277insSer)",
"slr": "NP_065184.2:p.(A276_C277insS)",
"tlr": "NP_065184.2:p.(Ala276_Cys277insSer)"
},
"hgvs_refseqgene_variant": "NG_009930.1:g.13930_13932dup",
"hgvs_transcript_variant": "NM_020451.2:c.827_829dup",
"primary_assembly_loci": {
"grch37": {
"hgvs_genomic_description": "NC_000001.10:g.26135596_26135598dup",
"vcf": {
"alt": "GCCT",
"chr": "1",
"pos": "26135595",
"ref": "G"
}
},
"grch38": {
"hgvs_genomic_description": "NC_000001.11:g.25809105_25809107dup",
"vcf": {
"alt": "GCCT",
"chr": "1",
"pos": "25809104",
"ref": "G"
}
},
"hg19": {
"hgvs_genomic_description": "NC_000001.10:g.26135596_26135598dup",
"vcf": {
"alt": "GCCT",
"chr": "chr1",
"pos": "26135595",
"ref": "G"
}
},
"hg38": {
"hgvs_genomic_description": "NC_000001.11:g.25809105_25809107dup",
"vcf": {
"alt": "GCCT",
"chr": "chr1",
"pos": "25809104",
"ref": "G"
}
}
},
"reference_sequence_records": {
"lrg": "http://ftp.ebi.ac.uk/pub/databases/lrgex/pending/LRG_857.xml",
"protein": "https://www.ncbi.nlm.nih.gov/nuccore/NP_065184.2",
"refseqgene": "https://www.ncbi.nlm.nih.gov/nuccore/NG_009930.1",
"transcript": "https://www.ncbi.nlm.nih.gov/nuccore/NM_020451.2"
},
"refseqgene_context_intronic_sequence": "",
"rna_variant_descriptions": null,
"selected_assembly": "GRCh37",
"submitted_variant": "NM_020451.2:c.827_829dup",
"transcript_description": "Homo sapiens selenoprotein N (SELENON), transcript variant 2, mRNA",
"validation_warnings": [
"A more recent version of the selected reference sequence NM_020451.2 is available (NM_020451.3): NM_020451.3:c.827_829dup MUST be fully validated prior to use in reports: select_variants=NM_020451.3:c.827_829dup",
"The current status of LRG_857 is pending therefore changes may be made to the LRG reference sequence"
],
"variant_exonic_positions": {
"NC_000001.10": {
"end_exon": "6",
"start_exon": "6"
},
"NC_000001.11": {
"end_exon": "6",
"start_exon": "6"
},
"NG_009930.1": {
"end_exon": "6",
"start_exon": "6"
}
}
},
"flag": "gene_variant",
"metadata": {
"variantvalidator_hgvs_version": "2.2.0",
"variantvalidator_version": "2.1.1.dev82+g131fae9",
"vvdb_version": "vvdb_2022_11",
"vvseqrepo_db": "VV_SR_2022_11/master",
"vvta_version": "vvta_2022_11_1"
}
}
The code has been altered to
To test
For what it's worth, there are 25 human selenoproteins:
There is an in frame CGA at
NM_020451.2:c.481_483=
{
"end_position": "483",
"error": "",
"sequence": "CGA",
"start_position": "481",
"variant": "NM_020451.2:c.481_483=",
"warning": ""
}
Change this to NM_020451.2:c.481C>T and I hope to see a substitution not a termination
Switched off the native hgvs translation and put in the VV code and Pif Paf Puf
NP_065184.2(LRG_857p1):p.(Arg161Sel)
some JSON processing has been missed. Will trace and kill
{
"NM_020451.2:c.481C>T": {
"alt_genomic_loci": [],
"annotations": {
"chromosome": "1",
"db_xref": {
"CCDS": "CCDS41282.1",
"ensemblgene": null,
"hgnc": "HGNC:15999",
"ncbigene": "57190",
"select": false
},
"ensembl_select": false,
"mane_plus_clinical": false,
"mane_select": false,
"map": "1p36.11",
"note": "selenoprotein N",
"refseq_select": false,
"variant": "2"
},
"gene_ids": {
"ccds_ids": [
"CCDS41282",
"CCDS41283"
],
"ensembl_gene_id": "ENSG00000162430",
"entrez_gene_id": "57190",
"hgnc_id": "HGNC:15999",
"omim_id": [
"606210"
],
"ucsc_id": "uc021ojk.2"
},
"gene_symbol": "SELENON",
"genome_context_intronic_sequence": "",
"hgvs_lrg_transcript_variant": "LRG_857t1:c.481C>T",
"hgvs_lrg_variant": "LRG_857:g.10044C>T",
"hgvs_predicted_protein_consequence": {
"slr": "",
"tlr": "NP_065184.2(LRG_857p1):p.(Arg161Sel)"
},
"hgvs_refseqgene_variant": "NG_009930.1:g.10044C>T",
"hgvs_transcript_variant": "NM_020451.2:c.481C>T",
"primary_assembly_loci": {
"grch37": {
"hgvs_genomic_description": "NC_000001.10:g.26131710C>T",
"vcf": {
"alt": "T",
"chr": "1",
"pos": "26131710",
"ref": "C"
}
},
"grch38": {
"hgvs_genomic_description": "NC_000001.11:g.25805219C>T",
"vcf": {
"alt": "T",
"chr": "1",
"pos": "25805219",
"ref": "C"
}
},
"hg19": {
"hgvs_genomic_description": "NC_000001.10:g.26131710C>T",
"vcf": {
"alt": "T",
"chr": "chr1",
"pos": "26131710",
"ref": "C"
}
},
"hg38": {
"hgvs_genomic_description": "NC_000001.11:g.25805219C>T",
"vcf": {
"alt": "T",
"chr": "chr1",
"pos": "25805219",
"ref": "C"
}
}
},
"reference_sequence_records": {
"lrg": "http://ftp.ebi.ac.uk/pub/databases/lrgex/pending/LRG_857.xml",
"refseqgene": "https://www.ncbi.nlm.nih.gov/nuccore/NG_009930.1",
"transcript": "https://www.ncbi.nlm.nih.gov/nuccore/NM_020451.2"
},
"refseqgene_context_intronic_sequence": "",
"rna_variant_descriptions": null,
"selected_assembly": "hg19",
"submitted_variant": "NM_020451.2:c.481C>T",
"transcript_description": "Homo sapiens selenoprotein N (SELENON), transcript variant 2, mRNA",
"validation_warnings": [
"A more recent version of the selected reference sequence NM_020451.2 is available (NM_020451.3): NM_020451.3:c.481C>T MUST be fully validated prior to use in reports: select_variants=NM_020451.3:c.481C>T",
"The current status of LRG_857 is pending therefore changes may be made to the LRG reference sequence"
],
"variant_exonic_positions": {
"NC_000001.10": {
"end_exon": "4",
"start_exon": "4"
},
"NC_000001.11": {
"end_exon": "4",
"start_exon": "4"
},
"NG_009930.1": {
"end_exon": "4",
"start_exon": "4"
}
}
},
"flag": "gene_variant",
"metadata": {
"variantvalidator_hgvs_version": "2.2.0",
"variantvalidator_version": "2.1.1.dev82+g131fae9",
"vvdb_version": "vvdb_2022_11",
"vvseqrepo_db": "VV_SR_2022_11/master",
"vvta_version": "vvta_2022_11_1"
}
}
The issue is to do with vv_hgvs. It is not liking the Use of p.(Arg161Sel). May need to modify it. Job for another day for now
I guess another question is whether we should read-through and add U if there is a frameshift. This could be more tricky.
Resolved an issue
"hgvs_predicted_protein_consequence": {
"lrg_slr": "LRG_857p1:p.(R161U)",
"lrg_tlr": "LRG_857p1:p.(Arg161Sec)",
"slr": "NP_065184.2:p.(R161U)",
"tlr": "NP_065184.2:p.(Arg161Sec)"
},
Was using Sel not Sec
@leicray @ifokkema I am doing some debugging because I'm currently in need of a break from teaching material
For Sec genes, the code will now assume that Sec is always incorporated into the translated Protein because we have not made a decision about what would break this assumption.
For now, is this enough?
I have tried to find more information about the mechanism by which Sec (U) is incorporated into selenoproteins, but have made no new progress.
What's clear is that only certain TGA codons are interpreted as encoding Sec. Such codons are in specific regions of the mRNA that interact with the factor that binds to the 3´-UTR. I have made this argument already in an earlier post to this thread.
For now, let's go with the code changes and watch for complaints or comments which might prompt a rethink.
Can we define which TGA codons are translated. I guess this will be transcript / gene specific? This is what makes this issue difficult. We need to define a "recipe" for determining whether TGA should or should not be translated. Fancy having a go at wording this out in a simple bullet list @leicray ? I can then code it up. It may be that frameshifts etc stop TGA from being translated and that the current logic is incorrect
Will re-open since further development is needed
I could have a bash at wording it out, but it might have to be on a gene-by-gene basis. Not a trivial amount of work.
We don't need anythin quickly. Perhaps for this one gene for the purposes of bouncing ideas. Will be easier after that
Can we define which TGA codons are translated. I guess this will be transcript / gene specific? This is what makes this issue difficult. We need to define a "recipe" for determining whether TGA should or should not be translated.
Do all these genes naturally terminate at either a TAG or TAA? If so, that could be the simple logic? All TGA is read-through, but TAG or TAA are still respected as a stop? If some of these genes naturally terminate at a TGA, then, well... we've got issues 😂
Funny you should ask.
These are the notes on the first iteration we are gonna make
So for now we can say
If a variant does not change the reading frame, all TGAs are used to encode Sec (except for stop gained which we will treat as Ter) If the frame shifts, TGA encodes Ter (*) For SEPH, we need an additional flag to ensure we Ter at the correct TGA (CDS end) which will be defined in the transcript records
In a second iteration we can tweak the parameters for the first bullet
So, SEPH is the strange one :P
So this section https://github.com/openvar/variantValidator/issues/503#issuecomment-1592987524 Is now incorrect
Currently
NM_020451.2:c.481C>T is producing
"hgvs_predicted_protein_consequence": {
"lrg_slr": "LRG_857p1:p.(R161U)",
"lrg_tlr": "LRG_857p1:p.(Arg161Sec)",
"slr": "NP_065184.2:p.(R161U)",
"tlr": "NP_065184.2:p.(Arg161Sec)"
},
but according to https://github.com/openvar/variantValidator/issues/503#issuecomment-1681088384
If a variant does not change the reading frame, all TGAs are used to encode Sec (except for stop gained which we will treat as Ter)
This would be a stop gained so should revert to Ter
OK so NM_020451.2:c.481C>T is producing is now producing
"hgvs_predicted_protein_consequence": {
"lrg_slr": "LRG_857p1:p.(R161*)",
"lrg_tlr": "LRG_857p1:p.(Arg161Ter)",
"slr": "NP_065184.2:p.(R161*)",
"tlr": "NP_065184.2:p.(Arg161Ter)"
},
Have a applied a little logic. The way things currently work is that if the amino acid U is found in the reference range and U is found in the Variant amino acid sequence, then U is maintained in the Alt. But if U is not found in the ref, U is turned to *.
This sets us up for some parameters which we can add in, for example if the U in the alt is > x bases from the U in the Ref, we can make it *.
Do I understand correctly from your comments, that SEPH is the only selenoprotein gene that terminates at a TGA? I'm really not sure if an inserted TGA would, in reality, be likely to stop translation. My limited understanding of selenoproteins is that the recognition sequence is located in the 3' UTR, and it can affect multiple TGA codons.
I quickly checked, and found it's gene-specific.
Quote:
The data indicate that mammals evolved the ability to limit Sec insertion into natural positions within selenoproteins, but do so in a selenoprotein-specific manner, and that this process is controlled by the SECIS element in the 3′-UTR.
(source)
So some genes don't stop translation at TGA at all, while other genes are quite specific about where TGAs are recognized as a stop and where not.
Although tested in only one gene, this may also be relevant:
Quote:
Sec incorporation at high efficiency appears to require that the UGA be >21 nucleotides from the AUG-start and >204 nucleotides from the selenocysteine insertion sequence element.
(source)
The selenoprotein gene in question is SEPHS2, not SEPH, or SEPHS.
To be absolutely clear, I only considered the sequence of the stop codon for the MANE Select transcript for each gene. Some genes have multiple transcripts and some transcripts are non-coding. The stop codon for every coding transcript has yet to be checked. Although it is improbable, there might be TGA stop codons used in other transcripts.
Here is a spreadsheet collating what I have found so far: Selonoproteins.xlsx
There is a very comprehensive published analysis of the variant evidence: https://pubmed.ncbi.nlm.nih.gov/31560400. There is a massive amount of data, much of which are in the supplementary files.
Sec incorporation at high efficiency appears to require that the UGA be >21 nucleotides from the AUG-start and >204 nucleotides from the selenocysteine insertion sequence element.
This is useful. Can use this to add a setting to switch from Sec to Ter. Just need to know a little more about the selenocysteine insertion sequence element.
According to selenocysteine insertion sequence element.) they are in the UTR in Eukaryotes.
In the record https://www.ncbi.nlm.nih.gov/nuccore/NM_020451.3 I found the annotation
1431..1469 /regulatory_class="recoding_stimulatory_region" /gene="SELENON" /gene_synonym="CFTD; CMYP3; MDRS1; RSMD1; RSS; SELN; SEPN1" /note="stop_codon_readthrough_signal; stop-codon redefinition element (SRE)" /function="stimulates readthrough at the UGA codon"
So what we need to put together is a table of every transcript that uses Sec and the location of this element within it.
There is an excellent review in Physiological Reviews (PubMed) that displays (Figure 9) the locations of the Sec residues in all 25 human selenoproteins. Sec location(s) and protein size are usefully listed.
What's clear is that Sec is often incorporated a long way from the 3´-UTR where the SECIS element resides. This suggests that the control of the insertion process is complex and is not determined simply by position within the protein sequence. The data in Figure 9 form the basis for creating a table of Sec locations. However, this ought to be done for all current coding transcripts for the 25 selenoproteins.
Describe the bug NM_020451.2:c.827_829dup > NP_065184.2:p.(Val275_Ala276=)
Expected behavior To be confirmed
This is a weird one. Mutalyzer shows 2 alternate descriptions. I have added Ivo and Raymond to discuss.
NM_020451.2(NP_065184.2):p.(Ter127*) and the equivalent description NM_020451.2(NP_065184.2):p.(=)
https://mutalyzer.nl/normalizer/NM_020451.2(NP_065184.2):p.(Ter127*)
So the description VV produces NP_065184.2:p.(Val275_Ala276=) sort of makes sense, but I agree it is not necessarily the best way to go.