Protein mishandled for NM_020451.2:c.827_829dup

Peter-J-Freeman commented 1 year ago

Describe the bug NM_020451.2:c.827_829dup > NP_065184.2:p.(Val275_Ala276=)

Expected behavior To be confirmed

This is a weird one. Mutalyzer shows 2 alternate descriptions. I have added Ivo and Raymond to discuss.

NM_020451.2(NP_065184.2):p.(Ter127*) and the equivalent description NM_020451.2(NP_065184.2):p.(=)

https://mutalyzer.nl/normalizer/NM_020451.2(NP_065184.2):p.(Ter127*)

So the description VV produces NP_065184.2:p.(Val275_Ala276=) sort of makes sense, but I agree it is not necessarily the best way to go.

Peter-J-Freeman commented 1 year ago

This gives us a bit of a conundrum. We could simply go for p= but sometimes we like to see the affected aas e.g. in this issue https://github.com/openvar/variantValidator/issues/482.

In this instance, it looks like, from the Mutalyzer output, that the duplication has simply shifted things around but not affected the Priotein sequence. VV picks this up. So the description is not necessarily incorrect, but the question is, should we state the amino acids queried? If not, what rules govern this. We need to be clear!

NP_065184.2:p.(Val275_Ala276=)

I do not think personally that NM_020451.2(NP_065184.2):p.(Ter127*) is correct because it suggests variation where no actual variation exists.

Peter-J-Freeman commented 1 year ago

as far as I know correct is p.(Ala276_Cys277insSer), see https://databases.lovd.nl/shared/refseq/SEPN1_codingDNA.html and a recent submission in the LOVD database.

Best regards,

    Johan den Dunnen
    GVsharedLOVD-team
            facebook.com/LeidenOpenVariationDatabase

leicray commented 1 year ago

GTG GCC TGC CTG >> GTG GCC TCC TGC CTG Val Ala Cys Leu >> Val Ala Ser Cys Leu 275 276 277 278 >> 275 276 277 278

Hence, I agree that p.(Ala276_Cys277insSer) is correct.

Peter-J-Freeman commented 1 year ago

@leicray. Before you submit anything, have you checked the full length of the CDS? Something is tripping up both VV and Mutalyzer. If you simply look at the local variation, a few nucleotides either side, you may miss something

What I want to do is create and translate the complete CDS with and without the duplicated sequence.

Without scanning the consequence on the complete protein we cannot determine what the bug, if any actually is

leicray commented 1 year ago

I will do transcript/protein alignments.

Peter-J-Freeman commented 1 year ago

Thanks. I'm curious because both MT and VV predict no change in the Protein sequence

leicray commented 1 year ago

This could be tricky. I have just tried to align the transcript and protein (using exonerate) and it reports an error:

** FATAL ERROR **: Unknown amino acid [U] exiting ...

I have checked the fasta sequence file for NP_065184.2 and it does indeed contain two U's. The gotcha is revealed in the transcript description Homo sapiens selenoprotein N (SELENON), transcript variant 2, mRNA.

There are two selenocysteines (Sec, U) in the protein which are encoded UGAs which would otherwise be stop codons. These codons are at reference transcript positions 434..436 and 1439..1441, which correspond to c. positions 379..381 and 1384..1386 respectively.

Somewhat confusingly, neither instance coincides with the c.827_829dup duplication.

How do we proceed from here?

Peter-J-Freeman commented 1 year ago

I knew there would be something fishy going on. This is a whole project in itself. How do we deal with these genes?

ifokkema commented 1 year ago

I currently have no time to spend more time on this, but I'll just drop this link here: https://github.com/openvar/variantValidator/issues/154

leicray commented 1 year ago

I have not forgotten this issue. The best tool to perform the alignment between the protein and the transcript is exonerate because it's aware of selenocysteine being encoded by UGA. However, the recently updated version of exonerate on SPECTRE is not the latest version and is not selenocysteine-aware. The SPECTRE people are working on it.

Peter-J-Freeman commented 1 year ago

Im not entirely sure we need to do a protein alignment. What we need is the protein sequence from vvta and the transcript sequence plus cds start and end. In theory we might be able to make naive translations. We may need to capture additional annotation to state read through of the stop codons. Let me check the protein sequence. It might be possible to spot thiss

Peter-J-Freeman commented 1 year ago

Ok Selenocysteine has the symbol U.

https://www.ddbj.nig.ac.jp/ddbj/code-e.html

So, what we can potentially do during all translation is

Check for U in the protein sequence. If U, then set UGA to translate to U.

There will be additional tweaks, but this could well be a solution. Ill take a look and see what can be done. Ultimately, its gonna be a guese because we wont know anything about why UGA is read through. Note 3 letter for U is Sec

Peter-J-Freeman commented 1 year ago

https://www.ncbi.nlm.nih.gov/protein/NP_065184.2 is the MANE Select protein translation from https://www.ncbi.nlm.nih.gov/nuccore/NM_020451.3. The reference sequence has U at the correct location.

https://www.ensembl.org/Homo_sapiens/Transcript/ProteinSummary?db=core;g=ENSG00000162430;r=1:25800193-25818221;t=ENST00000361547 is the MANE Select Translation of https://www.ensembl.org/Homo_sapiens/Transcript/ProteinSummary?db=core;g=ENSG00000162430;r=1:25800193-25818221;t=ENST00000361547.

The alignment is perfect

<div id="dln_Query_14769" style="box-sizing: inherit; color: rgb(33, 33, 33); font-family: Roboto, &quot;Helvetica Neue&quot;, Arial, Tahoma; font-size: 17px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><div class="dlfRow" style="box-sizing: inherit; display: inline; font-size: 18.7px; font-weight: bold;">NP_065184.2 selenoprotein N isoform 2 [Homo sapiens]<div style="box-sizing: inherit; font-size: 16.83px;"><label style="box-sizing: inherit; display: inline; margin-top: 3rem; max-width: 46rem; font-size: 14.6421px; color: rgb(96, 96, 96); font-weight: bold;">Sequence ID:<span> </span></label>Query_14769<span class=" r" style="box-sizing: inherit;"><label style="box-sizing: inherit; display: inline; margin-top: 3rem; max-width: 46rem; font-size: 14.6421px; color: rgb(96, 96, 96); font-weight: bold; padding-left: 0.8em;">Length:<span> </span></label>590<label style="box-sizing: inherit; display: inline; margin-top: 3rem; max-width: 46rem; font-size: 14.6421px; color: rgb(96, 96, 96); font-weight: bold; padding-left: 0.8em;">Number of Matches:<span> </span></label>1</span></div></div></div><div class="alnAll" id="alnAll_Query_14769" style="box-sizing: inherit; line-height: normal; margin-top: 1em; color: rgb(33, 33, 33); font-family: Roboto, &quot;Helvetica Neue&quot;, Arial, Tahoma; font-size: 17px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><div id="hd_Query_14769_1" style="box-sizing: inherit; width: 808.5px;"><div class="dflLnk hsp" style="box-sizing: inherit; width: 56em; font-family: Verdana, sans-serif; font-size: 13.6px; padding-bottom: 1em;"><span class="alnRn" style="box-sizing: inherit; float: left;"><label style="box-sizing: inherit; display: inline; margin-top: 3rem; max-width: 46rem; font-weight: bold; color: rgb(96, 96, 96); font-size: 12.92px; padding-right: 0.5em;">Range 1: 1 to 590</label><span class="" id="rng_Query_14769" style="box-sizing: inherit;"><a href="https://www.ncbi.nlm.nih.gov/projects/sviewer/?RID=8E9B78W7114&amp;id=lcl|Query_14769&amp;tracks=[key:sequence_track,name:Sequence,display_name:Sequence,id:STD1,category:Sequence,annots:Sequence,ShowLabel:true][key:gene_model_track,CDSProductFeats:false][key:alignment_track,name:other%20alignments,annots:NG%20Alignments|Refseq%20Alignments|Gnomon%20Alignments|Unnamed,shown:false]&amp;v=0:619&amp;appname=ncbiblast&amp;link_loc=fromHSP" class="spr" target="lnk8E9B78W7114" title="Show alignment to Query_14769 in Protein Graphics for 1 to 590 range" style="box-sizing: inherit; background-color: transparent; color: rgb(0, 113, 188); text-decoration: underline; padding-left: 0px; font: 13.6px / 1.5 arial, tahoma, verdana, sans-serif; padding-right: 1em; border-right: none;">Graphics</a></span></span><span id="hspQuery_14769_1" class="alnParLinks" style="box-sizing: inherit; float: right;"><a class="gbd toolsCtr navNext" disabled="disabled" title="Go to next match #2 for lcl|Query_14769" onmouseover="scan(this)" ref="ordinalpos=1&amp;currseq=Query_14769" onclick="goToNextHSP(this,true)" style="box-sizing: inherit; background-color: transparent; color: rgb(194, 194, 194); text-decoration: underline; padding-left: 0px; font: 13.6px / 1.5 arial, tahoma, verdana, sans-serif; margin-left: 8px; float: left; cursor: default;"><span style="box-sizing: inherit; color: rgb(194, 194, 194); padding-left: 15px; background-repeat: no-repeat; background-position: left center; background-image: url(&quot;images/next_Inactive.png&quot;);">Next Match</span></a><a class="gbd toolsCtr navPrev" disabled="disabled" title="Go to previous match #0 for lcl|Query_14769" onmouseover="scan(this)" ref="ordinalpos=1&amp;currseq=Query_14769" onclick="goToNextHSP(this,false)" style="box-sizing: inherit; background-color: transparent; color: rgb(194, 194, 194); text-decoration: underline; padding-left: 0px; font: 13.6px / 1.5 arial, tahoma, verdana, sans-serif; margin-left: 8px; float: left; cursor: default;"><span style="box-sizing: inherit; color: rgb(194, 194, 194); padding-left: 15px; background-repeat: no-repeat; background-position: left center; background-image: url(&quot;images/prev_Inactive.png&quot;);">Previous Match</span></a></span></div>

Score | Expect | Method | Identities | Positives | Gaps
-- | -- | -- | -- | -- | --
1211 bits(3134) | 0.0 | Compositional matrix adjust. | 590/590(100%) | 590/590(100%) | 0/590(0%)

</div><div id="ar_Query_14769_1" style="box-sizing: inherit; width: 808.5px;"><pre style="box-sizing: inherit; overflow: auto hidden; font-family: monospace, monospace; font-size: 15.13px; line-height: 1; margin-top: 0.5em;">Query  1    MGRARPGQRGPPSPGPAAQPPAPPRRRARSLALLGALLAAAAAAAVRVCARHAEAQAAAR  60
            MGRARPGQRGPPSPGPAAQPPAPPRRRARSLALLGALLAAAAAAAVRVCARHAEAQAAAR
Sbjct  1    MGRARPGQRGPPSPGPAAQPPAPPRRRARSLALLGALLAAAAAAAVRVCARHAEAQAAAR  60

Query  61   QELALKTLGTDGLFLFSSLDTDGDMYISPEEFKPIAEKLTGSCSVTQTGVQWCSHSSLQP  120
            QELALKTLGTDGLFLFSSLDTDGDMYISPEEFKPIAEKLTGSCSVTQTGVQWCSHSSLQP
Sbjct  61   QELALKTLGTDGLFLFSSLDTDGDMYISPEEFKPIAEKLTGSCSVTQTGVQWCSHSSLQP  120

Query  121  QLPWLNUSSCLSLLRSTPAASCEEEELPPDPSEETLTIEARFQPLLPETMTKSKDGFLGV  180
            QLPWLNUSSCLSLLRSTPAASCEEEELPPDPSEETLTIEARFQPLLPETMTKSKDGFLGV
Sbjct  121  QLPWLNUSSCLSLLRSTPAASCEEEELPPDPSEETLTIEARFQPLLPETMTKSKDGFLGV  180

Query  181  SRLALSGLRNWTAAASPSAVFATRHFQPFLPPPGQELGEPWWIIPSELSMFTGYLSNNRF  240
            SRLALSGLRNWTAAASPSAVFATRHFQPFLPPPGQELGEPWWIIPSELSMFTGYLSNNRF
Sbjct  181  SRLALSGLRNWTAAASPSAVFATRHFQPFLPPPGQELGEPWWIIPSELSMFTGYLSNNRF  240

Query  241  YPPPPKGKEVIIHRLLSMFHPRPFVKTRFAPQGAVACLTAISDFYYTVMFRIHAEFQLSE  300
            YPPPPKGKEVIIHRLLSMFHPRPFVKTRFAPQGAVACLTAISDFYYTVMFRIHAEFQLSE
Sbjct  241  YPPPPKGKEVIIHRLLSMFHPRPFVKTRFAPQGAVACLTAISDFYYTVMFRIHAEFQLSE  300

Query  301  PPDFPFWFSPAQFTGHIILSKDATHVRDFRLFVPNHRSLNVDMEWLYGASESSNMEVDIG  360
            PPDFPFWFSPAQFTGHIILSKDATHVRDFRLFVPNHRSLNVDMEWLYGASESSNMEVDIG
Sbjct  301  PPDFPFWFSPAQFTGHIILSKDATHVRDFRLFVPNHRSLNVDMEWLYGASESSNMEVDIG  360

Query  361  YIPQMELEATGPSVPSVILDEDGSMIDSHLPSGEPLQFVFEEIKWQQELSWEEAARRLEV  420
            YIPQMELEATGPSVPSVILDEDGSMIDSHLPSGEPLQFVFEEIKWQQELSWEEAARRLEV
Sbjct  361  YIPQMELEATGPSVPSVILDEDGSMIDSHLPSGEPLQFVFEEIKWQQELSWEEAARRLEV  420

Query  421  AMYPFKKVSYLPFTEAFDRAKAENKLVHSILLWGALDDQSCUGSGRTLRETVLESSPILT  480
            AMYPFKKVSYLPFTEAFDRAKAENKLVHSILLWGALDDQSCUGSGRTLRETVLESSPILT
Sbjct  421  AMYPFKKVSYLPFTEAFDRAKAENKLVHSILLWGALDDQSCUGSGRTLRETVLESSPILT  480

Query  481  LLNESFISTWSLVKELEELQNNQENSSHQKLAGLHLEKYSFPVEMMICLPNGTVVHHINA  540
            LLNESFISTWSLVKELEELQNNQENSSHQKLAGLHLEKYSFPVEMMICLPNGTVVHHINA
Sbjct  481  LLNESFISTWSLVKELEELQNNQENSSHQKLAGLHLEKYSFPVEMMICLPNGTVVHHINA  540

Query  541  NYFLDITSVKPEEIESNLFSFSSTFEDPSTATYMQFLKEGLRRGLPLLQP  590
            NYFLDITSVKPEEIESNLFSFSSTFEDPSTATYMQFLKEGLRRGLPLLQP
Sbjct  541  NYFLDITSVKPEEIESNLFSFSSTFEDPSTATYMQFLKEGLRRGLPLLQP  590</pre></div></div>NP_065184.2 selenoprotein N isoform 2 [Homo sapiens]
Sequence ID: Query_14769Length: 590Number of Matches: 1
Range 1: 1 to 590[Graphics](https://www.ncbi.nlm.nih.gov/projects/sviewer/?RID=8E9B78W7114&id=lcl|Query_14769&tracks=[key:sequence_track,name:Sequence,display_name:Sequence,id:STD1,category:Sequence,annots:Sequence,ShowLabel:true][key:gene_model_track,CDSProductFeats:false][key:alignment_track,name:other%20alignments,annots:NG%20Alignments|Refseq%20Alignments|Gnomon%20Alignments|Unnamed,shown:false]&v=0:619&appname=ncbiblast&link_loc=fromHSP)Next MatchPrevious Match
Alignment statistics for match #1
Score   Expect  Method  Identities  Positives   Gaps
1211 bits(3134) 0.0 Compositional matrix adjust.    590/590(100%)   590/590(100%)   0/590(0%)
Query  1    MGRARPGQRGPPSPGPAAQPPAPPRRRARSLALLGALLAAAAAAAVRVCARHAEAQAAAR  60
            MGRARPGQRGPPSPGPAAQPPAPPRRRARSLALLGALLAAAAAAAVRVCARHAEAQAAAR
Sbjct  1    MGRARPGQRGPPSPGPAAQPPAPPRRRARSLALLGALLAAAAAAAVRVCARHAEAQAAAR  60

Query  61   QELALKTLGTDGLFLFSSLDTDGDMYISPEEFKPIAEKLTGSCSVTQTGVQWCSHSSLQP  120
            QELALKTLGTDGLFLFSSLDTDGDMYISPEEFKPIAEKLTGSCSVTQTGVQWCSHSSLQP
Sbjct  61   QELALKTLGTDGLFLFSSLDTDGDMYISPEEFKPIAEKLTGSCSVTQTGVQWCSHSSLQP  120

Query  121  QLPWLNUSSCLSLLRSTPAASCEEEELPPDPSEETLTIEARFQPLLPETMTKSKDGFLGV  180
            QLPWLNUSSCLSLLRSTPAASCEEEELPPDPSEETLTIEARFQPLLPETMTKSKDGFLGV
Sbjct  121  QLPWLNUSSCLSLLRSTPAASCEEEELPPDPSEETLTIEARFQPLLPETMTKSKDGFLGV  180

Query  181  SRLALSGLRNWTAAASPSAVFATRHFQPFLPPPGQELGEPWWIIPSELSMFTGYLSNNRF  240
            SRLALSGLRNWTAAASPSAVFATRHFQPFLPPPGQELGEPWWIIPSELSMFTGYLSNNRF
Sbjct  181  SRLALSGLRNWTAAASPSAVFATRHFQPFLPPPGQELGEPWWIIPSELSMFTGYLSNNRF  240

Query  241  YPPPPKGKEVIIHRLLSMFHPRPFVKTRFAPQGAVACLTAISDFYYTVMFRIHAEFQLSE  300
            YPPPPKGKEVIIHRLLSMFHPRPFVKTRFAPQGAVACLTAISDFYYTVMFRIHAEFQLSE
Sbjct  241  YPPPPKGKEVIIHRLLSMFHPRPFVKTRFAPQGAVACLTAISDFYYTVMFRIHAEFQLSE  300

Query  301  PPDFPFWFSPAQFTGHIILSKDATHVRDFRLFVPNHRSLNVDMEWLYGASESSNMEVDIG  360
            PPDFPFWFSPAQFTGHIILSKDATHVRDFRLFVPNHRSLNVDMEWLYGASESSNMEVDIG
Sbjct  301  PPDFPFWFSPAQFTGHIILSKDATHVRDFRLFVPNHRSLNVDMEWLYGASESSNMEVDIG  360

Query  361  YIPQMELEATGPSVPSVILDEDGSMIDSHLPSGEPLQFVFEEIKWQQELSWEEAARRLEV  420
            YIPQMELEATGPSVPSVILDEDGSMIDSHLPSGEPLQFVFEEIKWQQELSWEEAARRLEV
Sbjct  361  YIPQMELEATGPSVPSVILDEDGSMIDSHLPSGEPLQFVFEEIKWQQELSWEEAARRLEV  420

Query  421  AMYPFKKVSYLPFTEAFDRAKAENKLVHSILLWGALDDQSCUGSGRTLRETVLESSPILT  480
            AMYPFKKVSYLPFTEAFDRAKAENKLVHSILLWGALDDQSCUGSGRTLRETVLESSPILT
Sbjct  421  AMYPFKKVSYLPFTEAFDRAKAENKLVHSILLWGALDDQSCUGSGRTLRETVLESSPILT  480

Query  481  LLNESFISTWSLVKELEELQNNQENSSHQKLAGLHLEKYSFPVEMMICLPNGTVVHHINA  540
            LLNESFISTWSLVKELEELQNNQENSSHQKLAGLHLEKYSFPVEMMICLPNGTVVHHINA
Sbjct  481  LLNESFISTWSLVKELEELQNNQENSSHQKLAGLHLEKYSFPVEMMICLPNGTVVHHINA  540

Query  541  NYFLDITSVKPEEIESNLFSFSSTFEDPSTATYMQFLKEGLRRGLPLLQP  590
            NYFLDITSVKPEEIESNLFSFSSTFEDPSTATYMQFLKEGLRRGLPLLQP
Sbjct  541  NYFLDITSVKPEEIESNLFSFSSTFEDPSTATYMQFLKEGLRRGLPLLQP  590

Peter-J-Freeman commented 1 year ago

So, it seems Ensembl do use the U translation of UGA

leicray commented 1 year ago

It is probable that account only needs to be taken of sequence variants that affect UGA codons that do encode Sec (U) in selenocysteine-containing proteins. Variants that might naively be predicted to create a Sec codon (e.g., GGA > UGA) will probably be treated as a premature stop codon. The reason for this is that the use of UGA codons to genuinely specify Sec is quite tightly controlled.

More reading around the subject is required.

Peter-J-Freeman commented 1 year ago

OK, about to push to a new branch. VV can now read-through stop codons in these genes.

We need more examples though. This is for

NM_020451.2:c.827_829dup which now gives NP_065184.2:p.(Ser276dup)

{
    "NM_020451.2:c.827_829dup": {
        "alt_genomic_loci": [],
        "annotations": {
            "chromosome": "1",
            "db_xref": {
                "CCDS": "CCDS41282.1",
                "ensemblgene": null,
                "hgnc": "HGNC:15999",
                "ncbigene": "57190",
                "select": false
            },
            "ensembl_select": false,
            "mane_plus_clinical": false,
            "mane_select": false,
            "map": "1p36.11",
            "note": "selenoprotein N",
            "refseq_select": false,
            "variant": "2"
        },
        "gene_ids": {
            "ccds_ids": [
                "CCDS41282",
                "CCDS41283"
            ],
            "ensembl_gene_id": "ENSG00000162430",
            "entrez_gene_id": "57190",
            "hgnc_id": "HGNC:15999",
            "omim_id": [
                "606210"
            ],
            "ucsc_id": "uc021ojk.2"
        },
        "gene_symbol": "SELENON",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "LRG_857t1:c.827_829dup",
        "hgvs_lrg_variant": "LRG_857:g.13930_13932dup",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "LRG_857p1:p.(S276dup)",
            "lrg_tlr": "LRG_857p1:p.(Ser276dup)",
            "slr": "NP_065184.2:p.(S276dup)",
            "tlr": "NP_065184.2:p.(Ser276dup)"
        },
        "hgvs_refseqgene_variant": "NG_009930.1:g.13930_13932dup",
        "hgvs_transcript_variant": "NM_020451.2:c.827_829dup",
        "primary_assembly_loci": {
            "grch37": {
                "hgvs_genomic_description": "NC_000001.10:g.26135596_26135598dup",
                "vcf": {
                    "alt": "GCCT",
                    "chr": "1",
                    "pos": "26135595",
                    "ref": "G"
                }
            },
            "grch38": {
                "hgvs_genomic_description": "NC_000001.11:g.25809105_25809107dup",
                "vcf": {
                    "alt": "GCCT",
                    "chr": "1",
                    "pos": "25809104",
                    "ref": "G"
                }
            },
            "hg19": {
                "hgvs_genomic_description": "NC_000001.10:g.26135596_26135598dup",
                "vcf": {
                    "alt": "GCCT",
                    "chr": "chr1",
                    "pos": "26135595",
                    "ref": "G"
                }
            },
            "hg38": {
                "hgvs_genomic_description": "NC_000001.11:g.25809105_25809107dup",
                "vcf": {
                    "alt": "GCCT",
                    "chr": "chr1",
                    "pos": "25809104",
                    "ref": "G"
                }
            }
        },
        "reference_sequence_records": {
            "lrg": "http://ftp.ebi.ac.uk/pub/databases/lrgex/pending/LRG_857.xml",
            "protein": "https://www.ncbi.nlm.nih.gov/nuccore/NP_065184.2",
            "refseqgene": "https://www.ncbi.nlm.nih.gov/nuccore/NG_009930.1",
            "transcript": "https://www.ncbi.nlm.nih.gov/nuccore/NM_020451.2"
        },
        "refseqgene_context_intronic_sequence": "",
        "rna_variant_descriptions": null,
        "selected_assembly": "GRCh37",
        "submitted_variant": "NM_020451.2:c.827_829dup",
        "transcript_description": "Homo sapiens selenoprotein N (SELENON), transcript variant 2, mRNA",
        "validation_warnings": [
            "A more recent version of the selected reference sequence NM_020451.2 is available (NM_020451.3): NM_020451.3:c.827_829dup MUST be fully validated prior to use in reports: select_variants=NM_020451.3:c.827_829dup",
            "The current status of LRG_857 is pending therefore changes may be made to the LRG reference sequence"
        ],
        "variant_exonic_positions": {
            "NC_000001.10": {
                "end_exon": "6",
                "start_exon": "6"
            },
            "NC_000001.11": {
                "end_exon": "6",
                "start_exon": "6"
            },
            "NG_009930.1": {
                "end_exon": "6",
                "start_exon": "6"
            }
        }
    },
    "flag": "gene_variant",
    "metadata": {
        "variantvalidator_hgvs_version": "2.2.0",
        "variantvalidator_version": "2.1.1.dev82+g131fae9",
        "vvdb_version": "vvdb_2022_11",
        "vvseqrepo_db": "VV_SR_2022_11/master",
        "vvta_version": "vvta_2022_11_1"
    }
}

@leicray proposed

GTG GCC TGC CTG >> GTG GCC TCC TGC CTG Val Ala Cys Leu >> Val Ala Ser Cys Leu 275 276 277 278 >> 275 276 277 278

Hence, I agree that p.(Ala276_Cys277insSer) is correct.

So is this correct or is NP_065184.2:p.(Ser276dup) correct? It cannot be since p.276 is indeed Ala, but we are hopefully in the right direction. More debugging to go

ORIGIN      
        1 mgrarpgqrg ppspgpaaqp papprrrars lallgallaa aaaaavrvca rhaeaqaaar
       61 qelalktlgt dglflfssld tdgdmyispe efkpiaeklt gscsvtqtgv qwcshsslqp
      121 qlpwlnussc lsllrstpaa sceeeelppd pseetltiea rfqpllpetm tkskdgflgv
      181 srlalsglrn wtaaaspsav fatrhfqpfl pppgqelgep wwiipselsm ftgylsnnrf
      241 yppppkgkev iihrllsmfh prpfvktrfa pqgavaclta isdfyytvmf rihaefqlse
      301 ppdfpfwfsp aqftghiils kdathvrdfr lfvpnhrsln vdmewlygas essnmevdig
      361 yipqmeleat gpsvpsvild edgsmidshl psgeplqfvf eeikwqqels weeaarrlev
      421 amypfkkvsy lpfteafdra kaenklvhsi llwgalddqs cugsgrtlre tvlesspilt
      481 llnesfistw slvkeleelq nnqensshqk laglhlekys fpvemmiclp ngtvvhhina
      541 nyflditsvk peeiesnlfs fsstfedpst atymqflkeg lrrglpllqp

Peter-J-Freeman commented 1 year ago

It looks like VV is mishandling the ins and calling it a dup

The translations are correct

<div id="dln_Query_13239" style="box-sizing: inherit; color: rgb(33, 33, 33); font-family: Roboto, &quot;Helvetica Neue&quot;, Arial, Tahoma; font-size: 17px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><div class="dlfRow" style="box-sizing: inherit; display: inline; font-size: 18.7px; font-weight: bold;">unnamed protein product<div style="box-sizing: inherit; font-size: 16.83px;"><label style="box-sizing: inherit; display: inline; margin-top: 3rem; max-width: 46rem; font-size: 14.6421px; color: rgb(96, 96, 96); font-weight: bold;">Sequence ID:<span> </span></label>Query_13239<span class=" r" style="box-sizing: inherit;"><label style="box-sizing: inherit; display: inline; margin-top: 3rem; max-width: 46rem; font-size: 14.6421px; color: rgb(96, 96, 96); font-weight: bold; padding-left: 0.8em;">Length:<span> </span></label>591<label style="box-sizing: inherit; display: inline; margin-top: 3rem; max-width: 46rem; font-size: 14.6421px; color: rgb(96, 96, 96); font-weight: bold; padding-left: 0.8em;">Number of Matches:<span> </span></label>1</span></div></div></div><div class="alnAll" id="alnAll_Query_13239" style="box-sizing: inherit; line-height: normal; margin-top: 1em; color: rgb(33, 33, 33); font-family: Roboto, &quot;Helvetica Neue&quot;, Arial, Tahoma; font-size: 17px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><div id="hd_Query_13239_1" style="box-sizing: inherit; width: 808.5px;"><div class="dflLnk hsp" style="box-sizing: inherit; width: 56em; font-family: Verdana, sans-serif; font-size: 13.6px; padding-bottom: 1em;"><span class="alnRn" style="box-sizing: inherit; float: left;"><label style="box-sizing: inherit; display: inline; margin-top: 3rem; max-width: 46rem; font-weight: bold; color: rgb(96, 96, 96); font-size: 12.92px; padding-right: 0.5em;">Range 1: 1 to 591</label><span class="" id="rng_Query_13239" style="box-sizing: inherit;"><a href="https://www.ncbi.nlm.nih.gov/projects/sviewer/?RID=8KD00R7G114&amp;id=lcl|Query_13239&amp;tracks=[key:sequence_track,name:Sequence,display_name:Sequence,id:STD1,category:Sequence,annots:Sequence,ShowLabel:true][key:gene_model_track,CDSProductFeats:false][key:alignment_track,name:other%20alignments,annots:NG%20Alignments|Refseq%20Alignments|Gnomon%20Alignments|Unnamed,shown:false]&amp;v=0:620&amp;appname=ncbiblast&amp;link_loc=fromHSP" class="spr" target="lnk8KD00R7G114" title="Show alignment to Query_13239 in Protein Graphics for 1 to 591 range" style="box-sizing: inherit; background-color: transparent; color: rgb(0, 113, 188); text-decoration: underline; padding-left: 0px; font: 13.6px / 1.5 arial, tahoma, verdana, sans-serif; padding-right: 1em; border-right: none;">Graphics</a></span></span><span id="hspQuery_13239_1" class="alnParLinks" style="box-sizing: inherit; float: right;"><a class="gbd toolsCtr navNext" disabled="disabled" title="Go to next match #2 for lcl|Query_13239" onmouseover="scan(this)" ref="ordinalpos=1&amp;currseq=Query_13239" onclick="goToNextHSP(this,true)" style="box-sizing: inherit; background-color: transparent; color: rgb(194, 194, 194); text-decoration: underline; padding-left: 0px; font: 13.6px / 1.5 arial, tahoma, verdana, sans-serif; margin-left: 8px; float: left; cursor: default;"><span style="box-sizing: inherit; color: rgb(194, 194, 194); padding-left: 15px; background-repeat: no-repeat; background-position: left center; background-image: url(&quot;images/next_Inactive.png&quot;);">Next Match</span></a><a class="gbd toolsCtr navPrev" disabled="disabled" title="Go to previous match #0 for lcl|Query_13239" onmouseover="scan(this)" ref="ordinalpos=1&amp;currseq=Query_13239" onclick="goToNextHSP(this,false)" style="box-sizing: inherit; background-color: transparent; color: rgb(194, 194, 194); text-decoration: underline; padding-left: 0px; font: 13.6px / 1.5 arial, tahoma, verdana, sans-serif; margin-left: 8px; float: left; cursor: default;"><span style="box-sizing: inherit; color: rgb(194, 194, 194); padding-left: 15px; background-repeat: no-repeat; background-position: left center; background-image: url(&quot;images/prev_Inactive.png&quot;);">Previous Match</span></a></span></div>

Score | Expect | Method | Identities | Positives | Gaps
-- | -- | -- | -- | -- | --
1207 bits(3122) | 0.0 | Compositional matrix adjust. | 590/591(99%) | 590/591(99%) | 1/591(0%)

</div><div id="ar_Query_13239_1" style="box-sizing: inherit; width: 808.5px;"><pre style="box-sizing: inherit; overflow: auto hidden; font-family: monospace, monospace; font-size: 15.13px; line-height: 1; margin-top: 0.5em;">Query  1    MGRARPGQRGPPSPGPAAQPPAPPRRRARSLALLGALLAAAAAAAVRVCARHAEAQAAAR  60
            MGRARPGQRGPPSPGPAAQPPAPPRRRARSLALLGALLAAAAAAAVRVCARHAEAQAAAR
Sbjct  1    MGRARPGQRGPPSPGPAAQPPAPPRRRARSLALLGALLAAAAAAAVRVCARHAEAQAAAR  60

Query  61   QELALKTLGTDGLFLFSSLDTDGDMYISPEEFKPIAEKLTGSCSVTQTGVQWCSHSSLQP  120
            QELALKTLGTDGLFLFSSLDTDGDMYISPEEFKPIAEKLTGSCSVTQTGVQWCSHSSLQP
Sbjct  61   QELALKTLGTDGLFLFSSLDTDGDMYISPEEFKPIAEKLTGSCSVTQTGVQWCSHSSLQP  120

Query  121  QLPWLNUSSCLSLLRSTPAASCEEEELPPDPSEETLTIEARFQPLLPETMTKSKDGFLGV  180
            QLPWLNUSSCLSLLRSTPAASCEEEELPPDPSEETLTIEARFQPLLPETMTKSKDGFLGV
Sbjct  121  QLPWLNUSSCLSLLRSTPAASCEEEELPPDPSEETLTIEARFQPLLPETMTKSKDGFLGV  180

Query  181  SRLALSGLRNWTAAASPSAVFATRHFQPFLPPPGQELGEPWWIIPSELSMFTGYLSNNRF  240
            SRLALSGLRNWTAAASPSAVFATRHFQPFLPPPGQELGEPWWIIPSELSMFTGYLSNNRF
Sbjct  181  SRLALSGLRNWTAAASPSAVFATRHFQPFLPPPGQELGEPWWIIPSELSMFTGYLSNNRF  240

Query  241  YPPPPKGKEVIIHRLLSMFHPRPFVKTRFAPQGAVA-CLTAISDFYYTVMFRIHAEFQLS  299
            YPPPPKGKEVIIHRLLSMFHPRPFVKTRFAPQGAVA CLTAISDFYYTVMFRIHAEFQLS
Sbjct  241  YPPPPKGKEVIIHRLLSMFHPRPFVKTRFAPQGAVASCLTAISDFYYTVMFRIHAEFQLS  300

Query  300  EPPDFPFWFSPAQFTGHIILSKDATHVRDFRLFVPNHRSLNVDMEWLYGASESSNMEVDI  359
            EPPDFPFWFSPAQFTGHIILSKDATHVRDFRLFVPNHRSLNVDMEWLYGASESSNMEVDI
Sbjct  301  EPPDFPFWFSPAQFTGHIILSKDATHVRDFRLFVPNHRSLNVDMEWLYGASESSNMEVDI  360

Query  360  GYIPQMELEATGPSVPSVILDEDGSMIDSHLPSGEPLQFVFEEIKWQQELSWEEAARRLE  419
            GYIPQMELEATGPSVPSVILDEDGSMIDSHLPSGEPLQFVFEEIKWQQELSWEEAARRLE
Sbjct  361  GYIPQMELEATGPSVPSVILDEDGSMIDSHLPSGEPLQFVFEEIKWQQELSWEEAARRLE  420

Query  420  VAMYPFKKVSYLPFTEAFDRAKAENKLVHSILLWGALDDQSCUGSGRTLRETVLESSPIL  479
            VAMYPFKKVSYLPFTEAFDRAKAENKLVHSILLWGALDDQSCUGSGRTLRETVLESSPIL
Sbjct  421  VAMYPFKKVSYLPFTEAFDRAKAENKLVHSILLWGALDDQSCUGSGRTLRETVLESSPIL  480

Query  480  TLLNESFISTWSLVKELEELQNNQENSSHQKLAGLHLEKYSFPVEMMICLPNGTVVHHIN  539
            TLLNESFISTWSLVKELEELQNNQENSSHQKLAGLHLEKYSFPVEMMICLPNGTVVHHIN
Sbjct  481  TLLNESFISTWSLVKELEELQNNQENSSHQKLAGLHLEKYSFPVEMMICLPNGTVVHHIN  540

Query  540  ANYFLDITSVKPEEIESNLFSFSSTFEDPSTATYMQFLKEGLRRGLPLLQP  590
            ANYFLDITSVKPEEIESNLFSFSSTFEDPSTATYMQFLKEGLRRGLPLLQP
Sbjct  541  ANYFLDITSVKPEEIESNLFSFSSTFEDPSTATYMQFLKEGLRRGLPLLQP  591</pre></div></div>unnamed protein product
Sequence ID: Query_13239Length: 591Number of Matches: 1
Range 1: 1 to 591[Graphics](https://www.ncbi.nlm.nih.gov/projects/sviewer/?RID=8KD00R7G114&id=lcl|Query_13239&tracks=[key:sequence_track,name:Sequence,display_name:Sequence,id:STD1,category:Sequence,annots:Sequence,ShowLabel:true][key:gene_model_track,CDSProductFeats:false][key:alignment_track,name:other%20alignments,annots:NG%20Alignments|Refseq%20Alignments|Gnomon%20Alignments|Unnamed,shown:false]&v=0:620&appname=ncbiblast&link_loc=fromHSP)Next MatchPrevious Match
Alignment statistics for match #1
Score   Expect  Method  Identities  Positives   Gaps
1207 bits(3122) 0.0 Compositional matrix adjust.    590/591(99%)    590/591(99%)    1/591(0%)
Query  1    MGRARPGQRGPPSPGPAAQPPAPPRRRARSLALLGALLAAAAAAAVRVCARHAEAQAAAR  60
            MGRARPGQRGPPSPGPAAQPPAPPRRRARSLALLGALLAAAAAAAVRVCARHAEAQAAAR
Sbjct  1    MGRARPGQRGPPSPGPAAQPPAPPRRRARSLALLGALLAAAAAAAVRVCARHAEAQAAAR  60

Query  61   QELALKTLGTDGLFLFSSLDTDGDMYISPEEFKPIAEKLTGSCSVTQTGVQWCSHSSLQP  120
            QELALKTLGTDGLFLFSSLDTDGDMYISPEEFKPIAEKLTGSCSVTQTGVQWCSHSSLQP
Sbjct  61   QELALKTLGTDGLFLFSSLDTDGDMYISPEEFKPIAEKLTGSCSVTQTGVQWCSHSSLQP  120

Query  121  QLPWLNUSSCLSLLRSTPAASCEEEELPPDPSEETLTIEARFQPLLPETMTKSKDGFLGV  180
            QLPWLNUSSCLSLLRSTPAASCEEEELPPDPSEETLTIEARFQPLLPETMTKSKDGFLGV
Sbjct  121  QLPWLNUSSCLSLLRSTPAASCEEEELPPDPSEETLTIEARFQPLLPETMTKSKDGFLGV  180

Query  181  SRLALSGLRNWTAAASPSAVFATRHFQPFLPPPGQELGEPWWIIPSELSMFTGYLSNNRF  240
            SRLALSGLRNWTAAASPSAVFATRHFQPFLPPPGQELGEPWWIIPSELSMFTGYLSNNRF
Sbjct  181  SRLALSGLRNWTAAASPSAVFATRHFQPFLPPPGQELGEPWWIIPSELSMFTGYLSNNRF  240

Query  241  YPPPPKGKEVIIHRLLSMFHPRPFVKTRFAPQGAVA-CLTAISDFYYTVMFRIHAEFQLS  299
            YPPPPKGKEVIIHRLLSMFHPRPFVKTRFAPQGAVA CLTAISDFYYTVMFRIHAEFQLS
Sbjct  241  YPPPPKGKEVIIHRLLSMFHPRPFVKTRFAPQGAVASCLTAISDFYYTVMFRIHAEFQLS  300

Query  300  EPPDFPFWFSPAQFTGHIILSKDATHVRDFRLFVPNHRSLNVDMEWLYGASESSNMEVDI  359
            EPPDFPFWFSPAQFTGHIILSKDATHVRDFRLFVPNHRSLNVDMEWLYGASESSNMEVDI
Sbjct  301  EPPDFPFWFSPAQFTGHIILSKDATHVRDFRLFVPNHRSLNVDMEWLYGASESSNMEVDI  360

Query  360  GYIPQMELEATGPSVPSVILDEDGSMIDSHLPSGEPLQFVFEEIKWQQELSWEEAARRLE  419
            GYIPQMELEATGPSVPSVILDEDGSMIDSHLPSGEPLQFVFEEIKWQQELSWEEAARRLE
Sbjct  361  GYIPQMELEATGPSVPSVILDEDGSMIDSHLPSGEPLQFVFEEIKWQQELSWEEAARRLE  420

Query  420  VAMYPFKKVSYLPFTEAFDRAKAENKLVHSILLWGALDDQSCUGSGRTLRETVLESSPIL  479
            VAMYPFKKVSYLPFTEAFDRAKAENKLVHSILLWGALDDQSCUGSGRTLRETVLESSPIL
Sbjct  421  VAMYPFKKVSYLPFTEAFDRAKAENKLVHSILLWGALDDQSCUGSGRTLRETVLESSPIL  480

Query  480  TLLNESFISTWSLVKELEELQNNQENSSHQKLAGLHLEKYSFPVEMMICLPNGTVVHHIN  539
            TLLNESFISTWSLVKELEELQNNQENSSHQKLAGLHLEKYSFPVEMMICLPNGTVVHHIN
Sbjct  481  TLLNESFISTWSLVKELEELQNNQENSSHQKLAGLHLEKYSFPVEMMICLPNGTVVHHIN  540

Query  540  ANYFLDITSVKPEEIESNLFSFSSTFEDPSTATYMQFLKEGLRRGLPLLQP  590
            ANYFLDITSVKPEEIESNLFSFSSTFEDPSTATYMQFLKEGLRRGLPLLQP
Sbjct  541  ANYFLDITSVKPEEIESNLFSFSSTFEDPSTATYMQFLKEGLRRGLPLLQP  591

leicray commented 1 year ago

"OK, about to push to a new branch. VV can now read-through stop codons in these genes."

What's the detail on this new behaviour? It's important that, for example, when a GGA glycine codon is changed to TGA that the variant is treated as creation of a stop codon, rather than creation of a Sec codon.

leicray commented 1 year ago

Robust handling of TGA stop codons has to be based on a full understanding of the underlying biological processes. A tandem duplication of a TGA codon that is intended to be translated to Sec will probably result in a tandem duplication of the Sec amino acid. This is because of the duplication occurring in the vicinity of a Sec codon.

However, is current biological understanding of how TGA translates to Sec sufficient to make such predictions? More reading is required.

Peter-J-Freeman commented 1 year ago

What's the detail on this new behaviour? It's important that, for example, when a GGA glycine codon is changed to TGA that the variant is treated as creation of a stop codon, rather than creation of a Sec codon.

It's a secret until I push :)

Peter-J-Freeman commented 1 year ago

OK enough for one day.

Now we have the correct nomenclature

{
    "NM_020451.2:c.827_829dup": {
        "alt_genomic_loci": [],
        "annotations": {
            "chromosome": "1",
            "db_xref": {
                "CCDS": "CCDS41282.1",
                "ensemblgene": null,
                "hgnc": "HGNC:15999",
                "ncbigene": "57190",
                "select": false
            },
            "ensembl_select": false,
            "mane_plus_clinical": false,
            "mane_select": false,
            "map": "1p36.11",
            "note": "selenoprotein N",
            "refseq_select": false,
            "variant": "2"
        },
        "gene_ids": {
            "ccds_ids": [
                "CCDS41282",
                "CCDS41283"
            ],
            "ensembl_gene_id": "ENSG00000162430",
            "entrez_gene_id": "57190",
            "hgnc_id": "HGNC:15999",
            "omim_id": [
                "606210"
            ],
            "ucsc_id": "uc021ojk.2"
        },
        "gene_symbol": "SELENON",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "LRG_857t1:c.827_829dup",
        "hgvs_lrg_variant": "LRG_857:g.13930_13932dup",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "LRG_857p1:p.(A276_C277insS)",
            "lrg_tlr": "LRG_857p1:p.(Ala276_Cys277insSer)",
            "slr": "NP_065184.2:p.(A276_C277insS)",
            "tlr": "NP_065184.2:p.(Ala276_Cys277insSer)"
        },
        "hgvs_refseqgene_variant": "NG_009930.1:g.13930_13932dup",
        "hgvs_transcript_variant": "NM_020451.2:c.827_829dup",
        "primary_assembly_loci": {
            "grch37": {
                "hgvs_genomic_description": "NC_000001.10:g.26135596_26135598dup",
                "vcf": {
                    "alt": "GCCT",
                    "chr": "1",
                    "pos": "26135595",
                    "ref": "G"
                }
            },
            "grch38": {
                "hgvs_genomic_description": "NC_000001.11:g.25809105_25809107dup",
                "vcf": {
                    "alt": "GCCT",
                    "chr": "1",
                    "pos": "25809104",
                    "ref": "G"
                }
            },
            "hg19": {
                "hgvs_genomic_description": "NC_000001.10:g.26135596_26135598dup",
                "vcf": {
                    "alt": "GCCT",
                    "chr": "chr1",
                    "pos": "26135595",
                    "ref": "G"
                }
            },
            "hg38": {
                "hgvs_genomic_description": "NC_000001.11:g.25809105_25809107dup",
                "vcf": {
                    "alt": "GCCT",
                    "chr": "chr1",
                    "pos": "25809104",
                    "ref": "G"
                }
            }
        },
        "reference_sequence_records": {
            "lrg": "http://ftp.ebi.ac.uk/pub/databases/lrgex/pending/LRG_857.xml",
            "protein": "https://www.ncbi.nlm.nih.gov/nuccore/NP_065184.2",
            "refseqgene": "https://www.ncbi.nlm.nih.gov/nuccore/NG_009930.1",
            "transcript": "https://www.ncbi.nlm.nih.gov/nuccore/NM_020451.2"
        },
        "refseqgene_context_intronic_sequence": "",
        "rna_variant_descriptions": null,
        "selected_assembly": "GRCh37",
        "submitted_variant": "NM_020451.2:c.827_829dup",
        "transcript_description": "Homo sapiens selenoprotein N (SELENON), transcript variant 2, mRNA",
        "validation_warnings": [
            "A more recent version of the selected reference sequence NM_020451.2 is available (NM_020451.3): NM_020451.3:c.827_829dup MUST be fully validated prior to use in reports: select_variants=NM_020451.3:c.827_829dup",
            "The current status of LRG_857 is pending therefore changes may be made to the LRG reference sequence"
        ],
        "variant_exonic_positions": {
            "NC_000001.10": {
                "end_exon": "6",
                "start_exon": "6"
            },
            "NC_000001.11": {
                "end_exon": "6",
                "start_exon": "6"
            },
            "NG_009930.1": {
                "end_exon": "6",
                "start_exon": "6"
            }
        }
    },
    "flag": "gene_variant",
    "metadata": {
        "variantvalidator_hgvs_version": "2.2.0",
        "variantvalidator_version": "2.1.1.dev82+g131fae9",
        "vvdb_version": "vvdb_2022_11",
        "vvseqrepo_db": "VV_SR_2022_11/master",
        "vvta_version": "vvta_2022_11_1"
    }
}

The code has been altered to

Look for U in the reference protein sequence
If True, TGA is translated to "U"

To test

Should not work for frameshifts!
Should it work for in-frame subs that create TGA?

leicray commented 1 year ago

For what it's worth, there are 25 human selenoproteins:

https://www.genenames.org/data/genegroup/#!/group/890

Peter-J-Freeman commented 1 year ago

There is an in frame CGA at

NM_020451.2:c.481_483=

https://rest.variantvalidator.org/VariantValidator/tools/hgvs2reference/NM_020451.2%3Ac.481_483%3D?content-type=application%2Fjson

{
  "end_position": "483",
  "error": "",
  "sequence": "CGA",
  "start_position": "481",
  "variant": "NM_020451.2:c.481_483=",
  "warning": ""
}

Change this to NM_020451.2:c.481C>T and I hope to see a substitution not a termination

Switched off the native hgvs translation and put in the VV code and Pif Paf Puf NP_065184.2(LRG_857p1):p.(Arg161Sel) some JSON processing has been missed. Will trace and kill

{
    "NM_020451.2:c.481C>T": {
        "alt_genomic_loci": [],
        "annotations": {
            "chromosome": "1",
            "db_xref": {
                "CCDS": "CCDS41282.1",
                "ensemblgene": null,
                "hgnc": "HGNC:15999",
                "ncbigene": "57190",
                "select": false
            },
            "ensembl_select": false,
            "mane_plus_clinical": false,
            "mane_select": false,
            "map": "1p36.11",
            "note": "selenoprotein N",
            "refseq_select": false,
            "variant": "2"
        },
        "gene_ids": {
            "ccds_ids": [
                "CCDS41282",
                "CCDS41283"
            ],
            "ensembl_gene_id": "ENSG00000162430",
            "entrez_gene_id": "57190",
            "hgnc_id": "HGNC:15999",
            "omim_id": [
                "606210"
            ],
            "ucsc_id": "uc021ojk.2"
        },
        "gene_symbol": "SELENON",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "LRG_857t1:c.481C>T",
        "hgvs_lrg_variant": "LRG_857:g.10044C>T",
        "hgvs_predicted_protein_consequence": {
            "slr": "",
            "tlr": "NP_065184.2(LRG_857p1):p.(Arg161Sel)"
        },
        "hgvs_refseqgene_variant": "NG_009930.1:g.10044C>T",
        "hgvs_transcript_variant": "NM_020451.2:c.481C>T",
        "primary_assembly_loci": {
            "grch37": {
                "hgvs_genomic_description": "NC_000001.10:g.26131710C>T",
                "vcf": {
                    "alt": "T",
                    "chr": "1",
                    "pos": "26131710",
                    "ref": "C"
                }
            },
            "grch38": {
                "hgvs_genomic_description": "NC_000001.11:g.25805219C>T",
                "vcf": {
                    "alt": "T",
                    "chr": "1",
                    "pos": "25805219",
                    "ref": "C"
                }
            },
            "hg19": {
                "hgvs_genomic_description": "NC_000001.10:g.26131710C>T",
                "vcf": {
                    "alt": "T",
                    "chr": "chr1",
                    "pos": "26131710",
                    "ref": "C"
                }
            },
            "hg38": {
                "hgvs_genomic_description": "NC_000001.11:g.25805219C>T",
                "vcf": {
                    "alt": "T",
                    "chr": "chr1",
                    "pos": "25805219",
                    "ref": "C"
                }
            }
        },
        "reference_sequence_records": {
            "lrg": "http://ftp.ebi.ac.uk/pub/databases/lrgex/pending/LRG_857.xml",
            "refseqgene": "https://www.ncbi.nlm.nih.gov/nuccore/NG_009930.1",
            "transcript": "https://www.ncbi.nlm.nih.gov/nuccore/NM_020451.2"
        },
        "refseqgene_context_intronic_sequence": "",
        "rna_variant_descriptions": null,
        "selected_assembly": "hg19",
        "submitted_variant": "NM_020451.2:c.481C>T",
        "transcript_description": "Homo sapiens selenoprotein N (SELENON), transcript variant 2, mRNA",
        "validation_warnings": [
            "A more recent version of the selected reference sequence NM_020451.2 is available (NM_020451.3): NM_020451.3:c.481C>T MUST be fully validated prior to use in reports: select_variants=NM_020451.3:c.481C>T",
            "The current status of LRG_857 is pending therefore changes may be made to the LRG reference sequence"
        ],
        "variant_exonic_positions": {
            "NC_000001.10": {
                "end_exon": "4",
                "start_exon": "4"
            },
            "NC_000001.11": {
                "end_exon": "4",
                "start_exon": "4"
            },
            "NG_009930.1": {
                "end_exon": "4",
                "start_exon": "4"
            }
        }
    },
    "flag": "gene_variant",
    "metadata": {
        "variantvalidator_hgvs_version": "2.2.0",
        "variantvalidator_version": "2.1.1.dev82+g131fae9",
        "vvdb_version": "vvdb_2022_11",
        "vvseqrepo_db": "VV_SR_2022_11/master",
        "vvta_version": "vvta_2022_11_1"
    }
}

Peter-J-Freeman commented 1 year ago

The issue is to do with vv_hgvs. It is not liking the Use of p.(Arg161Sel). May need to modify it. Job for another day for now

Peter-J-Freeman commented 1 year ago

I guess another question is whether we should read-through and add U if there is a frameshift. This could be more tricky.

Peter-J-Freeman commented 1 year ago

Resolved an issue

        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "LRG_857p1:p.(R161U)",
            "lrg_tlr": "LRG_857p1:p.(Arg161Sec)",
            "slr": "NP_065184.2:p.(R161U)",
            "tlr": "NP_065184.2:p.(Arg161Sec)"
        },

Was using Sel not Sec

Peter-J-Freeman commented 1 year ago

@leicray @ifokkema I am doing some debugging because I'm currently in need of a break from teaching material

For Sec genes, the code will now assume that Sec is always incorporated into the translated Protein because we have not made a decision about what would break this assumption.

For now, is this enough?

leicray commented 1 year ago

I have tried to find more information about the mechanism by which Sec (U) is incorporated into selenoproteins, but have made no new progress.

What's clear is that only certain TGA codons are interpreted as encoding Sec. Such codons are in specific regions of the mRNA that interact with the factor that binds to the 3´-UTR. I have made this argument already in an earlier post to this thread.

For now, let's go with the code changes and watch for complaints or comments which might prompt a rethink.

Peter-J-Freeman commented 1 year ago

Can we define which TGA codons are translated. I guess this will be transcript / gene specific? This is what makes this issue difficult. We need to define a "recipe" for determining whether TGA should or should not be translated. Fancy having a go at wording this out in a simple bullet list @leicray ? I can then code it up. It may be that frameshifts etc stop TGA from being translated and that the current logic is incorrect

Peter-J-Freeman commented 1 year ago

Will re-open since further development is needed

leicray commented 1 year ago

I could have a bash at wording it out, but it might have to be on a gene-by-gene basis. Not a trivial amount of work.

Peter-J-Freeman commented 1 year ago

We don't need anythin quickly. Perhaps for this one gene for the purposes of bouncing ideas. Will be easier after that

ifokkema commented 1 year ago

Can we define which TGA codons are translated. I guess this will be transcript / gene specific? This is what makes this issue difficult. We need to define a "recipe" for determining whether TGA should or should not be translated.

Do all these genes naturally terminate at either a TAG or TAA? If so, that could be the simple logic? All TGA is read-through, but TAG or TAA are still respected as a stop? If some of these genes naturally terminate at a TGA, then, well... we've got issues 😂

Peter-J-Freeman commented 1 year ago

Funny you should ask.

These are the notes on the first iteration we are gonna make

So for now we can say

If a variant does not change the reading frame, all TGAs are used to encode Sec (except for stop gained which we will treat as Ter) If the frame shifts, TGA encodes Ter (*) For SEPH, we need an additional flag to ensure we Ter at the correct TGA (CDS end) which will be defined in the transcript records

In a second iteration we can tweak the parameters for the first bullet

So, SEPH is the strange one :P

Peter-J-Freeman commented 1 year ago

So this section https://github.com/openvar/variantValidator/issues/503#issuecomment-1592987524 Is now incorrect

Currently

NM_020451.2:c.481C>T is producing

        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "LRG_857p1:p.(R161U)",
            "lrg_tlr": "LRG_857p1:p.(Arg161Sec)",
            "slr": "NP_065184.2:p.(R161U)",
            "tlr": "NP_065184.2:p.(Arg161Sec)"
        },

but according to https://github.com/openvar/variantValidator/issues/503#issuecomment-1681088384

If a variant does not change the reading frame, all TGAs are used to encode Sec (except for stop gained which we will treat as Ter)

This would be a stop gained so should revert to Ter

Peter-J-Freeman commented 1 year ago

OK so NM_020451.2:c.481C>T is producing is now producing

        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "LRG_857p1:p.(R161*)",
            "lrg_tlr": "LRG_857p1:p.(Arg161Ter)",
            "slr": "NP_065184.2:p.(R161*)",
            "tlr": "NP_065184.2:p.(Arg161Ter)"
        },

Have a applied a little logic. The way things currently work is that if the amino acid U is found in the reference range and U is found in the Variant amino acid sequence, then U is maintained in the Alt. But if U is not found in the ref, U is turned to *.

This sets us up for some parameters which we can add in, for example if the U in the alt is > x bases from the U in the Ref, we can make it *.

ifokkema commented 1 year ago

Do I understand correctly from your comments, that SEPH is the only selenoprotein gene that terminates at a TGA? I'm really not sure if an inserted TGA would, in reality, be likely to stop translation. My limited understanding of selenoproteins is that the recognition sequence is located in the 3' UTR, and it can affect multiple TGA codons.

I quickly checked, and found it's gene-specific.

Quote:

The data indicate that mammals evolved the ability to limit Sec insertion into natural positions within selenoproteins, but do so in a selenoprotein-specific manner, and that this process is controlled by the SECIS element in the 3′-UTR.

(source)

So some genes don't stop translation at TGA at all, while other genes are quite specific about where TGAs are recognized as a stop and where not.

ifokkema commented 1 year ago

Although tested in only one gene, this may also be relevant:

Quote:

Sec incorporation at high efficiency appears to require that the UGA be >21 nucleotides from the AUG-start and >204 nucleotides from the selenocysteine insertion sequence element.

(source)

leicray commented 1 year ago

The selenoprotein gene in question is SEPHS2, not SEPH, or SEPHS.

To be absolutely clear, I only considered the sequence of the stop codon for the MANE Select transcript for each gene. Some genes have multiple transcripts and some transcripts are non-coding. The stop codon for every coding transcript has yet to be checked. Although it is improbable, there might be TGA stop codons used in other transcripts.

Here is a spreadsheet collating what I have found so far: Selonoproteins.xlsx

There is a very comprehensive published analysis of the variant evidence: https://pubmed.ncbi.nlm.nih.gov/31560400. There is a massive amount of data, much of which are in the supplementary files.

Peter-J-Freeman commented 1 year ago

Sec incorporation at high efficiency appears to require that the UGA be >21 nucleotides from the AUG-start and >204 nucleotides from the selenocysteine insertion sequence element.

This is useful. Can use this to add a setting to switch from Sec to Ter. Just need to know a little more about the selenocysteine insertion sequence element.

According to selenocysteine insertion sequence element.) they are in the UTR in Eukaryotes.

In the record https://www.ncbi.nlm.nih.gov/nuccore/NM_020451.3 I found the annotation

1431..1469 /regulatory_class="recoding_stimulatory_region" /gene="SELENON" /gene_synonym="CFTD; CMYP3; MDRS1; RSMD1; RSS; SELN; SEPN1" /note="stop_codon_readthrough_signal; stop-codon redefinition element (SRE)" /function="stimulates readthrough at the UGA codon"

So what we need to put together is a table of every transcript that uses Sec and the location of this element within it.

leicray commented 1 year ago

There is an excellent review in Physiological Reviews (PubMed) that displays (Figure 9) the locations of the Sec residues in all 25 human selenoproteins. Sec location(s) and protein size are usefully listed.

What's clear is that Sec is often incorporated a long way from the 3´-UTR where the SECIS element resides. This suggests that the control of the insertion process is complex and is not determined simply by position within the protein sequence. The data in Figure 9 form the basis for creating a table of Sec locations. However, this ought to be done for all current coding transcripts for the 25 selenoproteins.

openvar / variantValidator

Protein mishandled for NM_020451.2:c.827_829dup #503