Incorrect handling of multiple unknown nucleotides

leicray commented 1 year ago

Describe the bug A user has attempted to validate several delins variants where the inserted nucleotides are unknown (N), e.g.

chr17:g.7578512_7578519delinsNNNNNNNNNNNNNNNNNNNN for GRCh37

The HGVS syntax in incorrect and ought to be chr17:g.7578512_7578519delinsN[20] but this still does not validate correctly.

The reference sequence is correctly identified as being NC_000017.10 but an error message is generated:

NC_000017.10:g.7578512_7578519delinsN[20]: char 38: expected EOF

In this output string, position 38 is the 'N'. If chr17:g.7578512_7578519delinsN (just a single N) is submitted, then the variant description validates correctly. This suggests that the opening bracket ([) is not being parsed correctly.

Peter-J-Freeman commented 1 year ago

All we are going to be able to do with this is report whether the sybtax is correct. We also need a decent warning for the initial submission. Perhaps an auto correct and the warnings. Personally, i dislike these formats but they are what they are

leicray commented 1 year ago

Even when the syntax is correct, e.g. NC_000017.10:g.7578512_7578519delinsN[20], an error message is still generated. It's not just about syntax.

ifokkema commented 1 year ago

Actually, the HGVS nomenclature doesn't currently describe when sequences like NNN should be rewritten to N[3]. In that sense, the N[20] is a correct (likely preferred) alternative of a long sequence consisting of just Ns, but I wouldn't say using just Ns is incorrect.

Peter-J-Freeman commented 1 year ago

I dislike it even more when more than 1 format is acceptable. I think we should enforce one. We can only do so much. I think the brackets approach is more recent?

ifokkema commented 1 year ago

It's more recently described, but I'm not sure how recent it truly is since it's obviously just an extension of the existing repeat syntax. Or, I should maybe say, an application of an existing rule that is also applied in the repeat syntax? Either way, it lacks a formal definition in the HGVS nomenclature, including a lack of clarification on when NNN should become N[3]. So any enforcement on VV's end will not be according to the HGVS specs.

Peter-J-Freeman commented 1 year ago

True enough. Its not a repeat. Perhaps the repeat format could be considered incorrect in this case even

ifokkema commented 1 year ago

Sorry for the late reply... I'm not even sure what this syntax is called, as the HGVS website currently doesn't really explain the use of [ and ] outside of the use for alleles and insertions with reference sequences.

“[ ]” (square brackets) are used for alleles (see DNA, RNA, protein), which includes multiple inserted sequences at one position and insertions from a second reference sequence

(Source)

However, it's used in examples:

NC_000023.10:g.32717298_32717299insN[100] the insertion of 100 nucleotides (not specified) between position g.32717298 and g.32717299

(Source)

So, according to the examples, it's not incorrect. But here we are again, at a place where we use examples from the site to explain HGVS nomenclature, even though there is no official specification about this particular use.

Peter-J-Freeman commented 1 year ago

I'm more worried about this

import json
import VariantValidator
vval = VariantValidator.Validator()
variant = 'chr17:g.7578512_7578519delinsNNNNNNNNNNNNNNNNNNNN'
genome_build = 'hg19'
select_transcripts = 'all'
transcript_set = 'refseq'
validate = vval.validate(variant, genome_build, select_transcripts, transcript_set)
validation = validate.format_as_dict(with_meta=True)
print(json.dumps(validation, sort_keys=True, indent=4, separators=(',', ': ')))

"hgvs_predicted_protein_consequence": {
            "lrg_slr": "",
            "lrg_tlr": "",
            "slr": "NP_001263690.1:p.(L98delinsCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD*)",
            "tlr": "NP_001263690.1:p.(Leu98delinsCysProValGlnLeuTrpValAspSerThrProProProGlyThrArgValArgAlaMetAlaIleTyrLysGlnSerGlnHisMetThrGluValValArgArgCysProHisHisGluArgCysSerAspSerAspGlyLeuAlaProProGlnHisLeuIleArgValGluGlyAsnLeuArgValGluTyrLeuAspAspArgAsnThrPheArgHisSerValValValProTyrGluProProGluValGlySerAspCysThrThrIleHisTyrAsnTyrMetCysAsnSerSerCysMetGlyGlyMetAsnArgArgProIleLeuThrIleIleThrLeuGluAspSerSerGlyAsnLeuLeuGlyArgAsnSerPheGluValArgValCysAlaCysProGlyArgAspArgArgThrGluGluGluAsnLeuArgLysLysGlyGluProHisHisGluLeuProProGlySerThrLysArgAlaLeuProAsnAsnThrSerSerSerProGlnProLysLysLysProLeuAspGlyGluTyrPheThrLeuGlnIleArgGlyArgGluArgPheGluMetPheArgGluLeuAsnGluAlaLeuGluLeuLysAspAlaGlnAlaGlyLysGluProGlyGlySerArgAlaHisSerSerHisLeuLysSerLysLysGlyGlnSerThrSerArgHisLysLysLeuMetPheLysThrGluGlyProAspSerAspTer)"
        },

???????? :P

Peter-J-Freeman commented 1 year ago

OK,

Here is the solution

{
    "NM_000546.6:c.411_418delinsNNNNNNNNNNNNNNNNNNNN": {
        "alt_genomic_loci": [],
        "annotations": {
            "chromosome": "17",
            "db_xref": {
                "CCDS": "CCDS11118.1",
                "ensemblgene": null,
                "hgnc": "HGNC:11998",
                "ncbigene": "7157",
                "select": "MANE"
            },
            "ensembl_select": false,
            "mane_plus_clinical": false,
            "mane_select": true,
            "map": "17p13.1",
            "note": "tumor protein p53",
            "refseq_select": true,
            "variant": "1"
        },
        "gene_ids": {
            "ccds_ids": [
                "CCDS73964",
                "CCDS73971",
                "CCDS11118",
                "CCDS45606",
                "CCDS73963",
                "CCDS73970",
                "CCDS73966",
                "CCDS73968",
                "CCDS73967",
                "CCDS45605",
                "CCDS73969",
                "CCDS73965"
            ],
            "ensembl_gene_id": "ENSG00000141510",
            "entrez_gene_id": "7157",
            "hgnc_id": "HGNC:11998",
            "omim_id": [
                "191170"
            ],
            "ucsc_id": "uc060aur.1"
        },
        "gene_symbol": "TP53",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "",
            "lrg_tlr": "",
            "slr": "",
            "tlr": ""
        },
        "hgvs_refseqgene_variant": "",
        "hgvs_transcript_variant": "NM_000546.6:c.411_418delinsNNNNNNNNNNNNNNNNNNNN",
        "primary_assembly_loci": {
            "grch37": {
                "hgvs_genomic_description": "NC_000017.10:g.7578512_7578519delinsNNNNNNNNNNNNNNNNNNNN",
                "vcf": {
                    "alt": "NNNNNNNNNNNNNNNNNNNN",
                    "chr": "17",
                    "pos": "7578512",
                    "ref": "TCTTGGCC"
                }
            },
            "grch38": {
                "hgvs_genomic_description": "NC_000017.11:g.7675194_7675201delinsNNNNNNNNNNNNNNNNNNNN",
                "vcf": {
                    "alt": "NNNNNNNNNNNNNNNNNNNN",
                    "chr": "17",
                    "pos": "7675194",
                    "ref": "TCTTGGCC"
                }
            },
            "hg19": {
                "hgvs_genomic_description": "NC_000017.10:g.7578512_7578519delinsNNNNNNNNNNNNNNNNNNNN",
                "vcf": {
                    "alt": "NNNNNNNNNNNNNNNNNNNN",
                    "chr": "chr17",
                    "pos": "7578512",
                    "ref": "TCTTGGCC"
                }
            },
            "hg38": {
                "hgvs_genomic_description": "NC_000017.11:g.7675194_7675201delinsNNNNNNNNNNNNNNNNNNNN",
                "vcf": {
                    "alt": "NNNNNNNNNNNNNNNNNNNN",
                    "chr": "chr17",
                    "pos": "7675194",
                    "ref": "TCTTGGCC"
                }
            }
        },
        "reference_sequence_records": {
            "transcript": "https://www.ncbi.nlm.nih.gov/nuccore/NM_000546.6"
        },
        "refseqgene_context_intronic_sequence": "",
        "rna_variant_descriptions": null,
        "selected_assembly": false,
        "submitted_variant": "NC_000017.10:g.7578512_7578519delinsN[20]",
        "transcript_description": "Homo sapiens tumor protein p53 (TP53), transcript variant 1, mRNA",
        "validation_warnings": [
            "NC_000017.10:g.7578512_7578519delinsN[20] is better written as NC_000017.10:g.7578512_7578519delinsNNNNNNNNNNNNNNNNNNNN"
        ],
        "variant_exonic_positions": {
            "NC_000017.10": {
                "end_exon": "5",
                "start_exon": "5"
            },
            "NC_000017.11": {
                "end_exon": "5",
                "start_exon": "5"
            }
        }
    },
    "flag": "gene_variant",
    "metadata": {
        "variantvalidator_hgvs_version": "2.2.0",
        "variantvalidator_version": "2.2.1.dev37+g695bab9.d20230810",
        "vvdb_version": "vvdb_2022_11",
        "vvseqrepo_db": "VV_SR_2022_11/master",
        "vvta_version": "vvta_2022_11_1"
    }
}

So it accepts NC_000017.10:g.7578512_7578519delinsN[20] but warns

    "validation_warnings": [
        "NC_000017.10:g.7578512_7578519delinsN[20] is better written as NC_000017.10:g.7578512_7578519delinsNNNNNNNNNNNNNNNNNNNN"
    ],

I'm not sure I agree. Which do we think is better????

Peter-J-Freeman commented 1 year ago

Note, any sequence between ins and the [ will be treated this way, not just "N"

Peter-J-Freeman commented 1 year ago

Gonna update to "validation_warnings": [ "NC_000017.10:g.7578512_7578519delinsN[20] may also be written as NC_000017.10:g.7578512_7578519delinsNNNNNNNNNNNNNNNNNNNN" ],

ifokkema commented 1 year ago

I'm not sure I agree. Which do we think is better???? (...) Note, any sequence between ins and the [ will be treated this way, not just "N"

Hmm... the HGVS nomenclature makes no statements on this, but in general, variants should be described in the shortest way. So, I guess, insN[20] is valid. And especially when insN[20] is given, insN[20] should be returned. But again, there are no HGVS rules for this at all, it's just my opinion. Currently, LOVD's HGVS syntax-checking interface understands your message and rewrites the variant to the longer form when VV throws this error. Without validation with VV, it accepts the shorter form as correct.

Peter-J-Freeman commented 1 year ago

The issue is we would need to define a rule that applies this across the board.

Would we apply the short format only if the short format is submitted or always?
If so is the cut-off that defines which format is correct

The simplest way would be to only return the short format if that is what is submitted, but that breaks the consistency of the returns. I think this is an issue for the nomenclature committee to define the rule

I personally want the outputs to be consistent, reproducible and correct. Returning the short format just because that is what was submitted would break this

ifokkema commented 1 year ago

OK, that's a good point. I, anyway, fully agree that the HVNC should decide on this. Although our agenda is already quite full, I think it's a good idea to add it to the agenda of the next meeting so we can make a decision on this. If we won't have time for it, at least it's documented to be discussed in the future.

Peter-J-Freeman commented 1 year ago

Yes, this is something that could be closed pretty quickly. Will add to the agenda. I also want to close the discussion around so-called Genomic representations of transcripts. Its a very broken concept and it belittles what a reference sequence is. A reference. We cannot just change a reference to suit our needs

Sure, I want to be able to visit @ifokkema in person for a beer, but I can't just ignore an ocean because it doesn't suit my needs!!!!

openvar / variantValidator

Incorrect handling of multiple unknown nucleotides #504