openvar / variantValidator

Public repository for VariantValidator project
GNU Affero General Public License v3.0
67 stars 21 forks source link

Improved error messages for ins and delins variants #359

Open leicray opened 2 years ago

leicray commented 2 years ago

Describe the bug Error messages relating to ins and delins variants are not as clear as they might be.

To Reproduce Steps to reproduce the behaviour:

  1. Validate the variant description NM_001371623.1:c.483ins
  2. The error message is "NM_001371623.1:c.483ins: char 24: end of input"
  3. Validate the variant description NM_001371623.1:c.483delins
  4. The error message is "NM_001371623.1:c.483delins: char 28: expected EOF"

Expected behaviour A better error message for the first variant description would point out that there are two separate errors. The inserted bases need to be reported, and the insertion point needs to be specified as lying between two adjacent bases.

For the second variant, the error message needs to point out that the inserted bases need to be reported.

ifokkema commented 2 years ago

For your reference, LOVD in this case reports;

NM_001371623.1:c.483ins

{
  "errors": {
    "EPOSITIONMISSING": "An insertion must be provided with the two positions between which the insertion has taken place.",
    "ESUFFIXMISSING": "The inserted sequence must be provided for insertions or deletion-insertions."
  }
}

NM_001371623.1:c.483delins

{
  "errors": {
    "ESUFFIXMISSING": "The inserted sequence must be provided for insertions or deletion-insertions."
  }
}
Peter-J-Freeman commented 2 years ago

@ifokkema, thanks. I was just about to ask what LOVD does with these. if you can pass on the outputs then I will make thw changes

Thanks

ifokkema commented 2 years ago

@ifokkema, thanks. I was just about to ask what LOVD does with these. if you can pass on the outputs then I will make thw changes

Sure! What other outputs do you need?

Peter-J-Freeman commented 2 years ago

Ah, my mistake, I didn't notice there were only a couple of variant descriptions here. Do you have any other similar inputs and outputs that relate to this topic in your test set????

ifokkema commented 2 years ago

Sure, we have lots, depends on what you would consider a similarity. Insertions and deletion-insertions are the only variants where suffixes are required, so they are handled separately from all other variant types. Also, the suffixes can be much more complex. We recognize:

Positions or lengths are also further checked for order and uniqueness. If not correct;

Peter-J-Freeman commented 2 years ago

Thanks. I'll start working on these.

Issue 1 sorted NM_001371623.1:c.483ins

        "validation_warnings": [
            "The inserted sequence must be provided for insertions or deletion-insertions",
            "An insertion must be provided with the two positions between which the insertion has taken place."
        ],
Peter-J-Freeman commented 2 years ago

Hi @ifokkema For clarification, what is meant by "N[10]"???

ifokkema commented 2 years ago

N, as the IUPAC code for an unknown nucleotide, and [10], as a stretch of 10 of those. So, as c.1_2insAAAAAAAAAA can also be written as c.1_2insA[10], insertion of 10 unknown nucleotides can be written both as c.1_2insNNNNNNNNNN and as c.1_2insN[10].

Peter-J-Freeman commented 2 years ago

Perfect, thanks for clarification. I will look up the correct page in the hgvs website

ifokkema commented 2 years ago

Sure! Let me know if you can't find it!

leicray commented 2 years ago

It's never easy to find specific bits of information on the website. Each of the rules needs to be individually numbered in a hierarchical fashion and update histories for each individual rule need to be accessible in some simple manner.

I think that I suggested this some time ago, but never received any feedback on the idea. Then again, I cannot now remember the forum/medium through which I made the suggestion.

ifokkema commented 2 years ago

It may have been in our meeting last February. I did mention it to Johan later, but I doubt it will be changed. I'm in favor - let's see if my application for the HVNC membership will be accepted :wink:

Peter-J-Freeman commented 2 years ago

N, as the IUPAC code for an unknown nucleotide, and [10], as a stretch of 10 of those. So, as c.1_2insAAAAAAAAAA can also be written as c.1_2insA[10], insertion of 10 unknown nucleotides can be written both as c.1_2insNNNNNNNNNN and as c.1_2insN[10].

This is really unhelpful in terms of consistent reporting because we have set up a scenario where 2 different descriptions have the same meaning. I really dislike it when this happens. I think VV will accept c.1_2insA[10] but will convert it to c.1_2insAAAAAAAAAA and warn that it is better written in full

Thanks for bringing this up

Peter-J-Freeman commented 2 years ago

Also what about c.1_2insGATC[10]? Is this also correct???? For inserted repeats????

ifokkema commented 2 years ago

I can't find an example of it on the varnomen site, but as far as I know, that is allowed syntax. LOVD also recognizes it as valid syntax.

ifokkema commented 2 years ago

This is really unhelpful in terms of consistent reporting because we have set up a scenario where 2 different descriptions have the same meaning. I really dislike it when this happens. I think VV will accept c.1_2insA[10] but will convert it to c.1_2insAAAAAAAAAA and warn that it is better written in full

You're of course completely right. Perhaps there should be a rule when a sequence should be shortened. Note also that, e.g., c.1_2insAAAAAAAAAAT could be rewritten as c.1_2[A[10];T]. So it quickly gets more complex. LOVD handles this by splitting the inserted sequence (everything between the square brackets), exploding it on semi-colon, and then checking the syntax of each part.

Peter-J-Freeman commented 2 years ago

Oh wow. What a total nightmare.

So far I have handled a simple case and made the following

{
    "NM_001371623.1:c.483_484insAAAAAAAAAA": {
        "alt_genomic_loci": [],
        "annotations": {
            "chromosome": "5",
            "db_xref": {
                "CCDS": null,
                "ensemblgene": null,
                "hgnc": "HGNC:11654",
                "ncbigene": "6949",
                "select": "MANE"
            },
            "ensembl_select": false,
            "mane_plus_clinical": false,
            "mane_select": true,
            "map": "5q32-q33.1",
            "note": "treacle ribosome biogenesis factor 1",
            "refseq_select": true,
            "variant": "8"
        },
        "gene_ids": {
            "ccds_ids": [
                "CCDS47306",
                "CCDS4306",
                "CCDS47307",
                "CCDS54936",
                "CCDS47305"
            ],
            "ensembl_gene_id": "ENSG00000070814",
            "entrez_gene_id": "6949",
            "hgnc_id": "HGNC:11654",
            "omim_id": [
                "606847"
            ],
            "ucsc_id": "uc003lry.4"
        },
        "gene_symbol": "TCOF1",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "",
            "lrg_tlr": "",
            "slr": "NP_001358552.1:p.(E162Kfs*16)",
            "tlr": "NP_001358552.1:p.(Glu162LysfsTer16)"
        },
        "hgvs_refseqgene_variant": "",
        "hgvs_transcript_variant": "NM_001371623.1:c.483_484insAAAAAAAAAA",
        "primary_assembly_loci": {
            "grch37": {
                "hgvs_genomic_description": "NC_000005.9:g.149748383_149748384insAAAAAAAAAA",
                "vcf": {
                    "alt": "CAAAAAAAAAA",
                    "chr": "5",
                    "pos": "149748382",
                    "ref": "C"
                }
            },
            "grch38": {
                "hgvs_genomic_description": "NC_000005.10:g.150368820_150368821insAAAAAAAAAA",
                "vcf": {
                    "alt": "CAAAAAAAAAA",
                    "chr": "5",
                    "pos": "150368819",
                    "ref": "C"
                }
            },
            "hg19": {
                "hgvs_genomic_description": "NC_000005.9:g.149748383_149748384insAAAAAAAAAA",
                "vcf": {
                    "alt": "CAAAAAAAAAA",
                    "chr": "chr5",
                    "pos": "149748382",
                    "ref": "C"
                }
            },
            "hg38": {
                "hgvs_genomic_description": "NC_000005.10:g.150368820_150368821insAAAAAAAAAA",
                "vcf": {
                    "alt": "CAAAAAAAAAA",
                    "chr": "chr5",
                    "pos": "150368819",
                    "ref": "C"
                }
            }
        },
        "reference_sequence_records": {
            "protein": "https://www.ncbi.nlm.nih.gov/nuccore/NP_001358552.1",
            "transcript": "https://www.ncbi.nlm.nih.gov/nuccore/NM_001371623.1"
        },
        "refseqgene_context_intronic_sequence": "",
        "selected_assembly": false,
        "submitted_variant": "NM_001371623.1:c.483_484insA[10]",
        "transcript_description": "Homo sapiens treacle ribosome biogenesis factor 1 (TCOF1), transcript variant 8, mRNA",
        "validation_warnings": [
            "NM_001371623.1:c.483_484insA[10] is better written as NM_001371623.1:c.483_484insAAAAAAAAAA",
            "RefSeqGene record not available"
        ],
        "variant_exonic_positions": {
            "NC_000005.10": {
                "end_exon": "5",
                "start_exon": "5"
            },
            "NC_000005.9": {
                "end_exon": "5",
                "start_exon": "5"
            }
        }
    },
    "flag": "gene_variant",
    "metadata": {
        "variantvalidator_hgvs_version": "2.0.1.dev2+g58fc52a",
        "variantvalidator_version": "1.0.5.dev228+gee3fee4.d20211116",
        "vvdb_version": "vvdb_2021_4",
        "vvseqrepo_db": "VV_SR_2021_2/master",
        "vvta_version": "vvta_2021_2"
    }
}

Note, the code will also handle variants like GATC[10]

Now to look for the ; versions e.g. c.1_2[A[10];T]. Thanks for the info.

Peter-J-Freeman commented 2 years ago

OK, will now do the following too. Should be agnostic to the content of the insertion

{
    "NM_001371623.1:c.483_484insAAAAAAAAAAT": {
        "alt_genomic_loci": [],
        "annotations": {
            "chromosome": "5",
            "db_xref": {
                "CCDS": null,
                "ensemblgene": null,
                "hgnc": "HGNC:11654",
                "ncbigene": "6949",
                "select": "MANE"
            },
            "ensembl_select": false,
            "mane_plus_clinical": false,
            "mane_select": true,
            "map": "5q32-q33.1",
            "note": "treacle ribosome biogenesis factor 1",
            "refseq_select": true,
            "variant": "8"
        },
        "gene_ids": {
            "ccds_ids": [
                "CCDS47306",
                "CCDS4306",
                "CCDS47307",
                "CCDS54936",
                "CCDS47305"
            ],
            "ensembl_gene_id": "ENSG00000070814",
            "entrez_gene_id": "6949",
            "hgnc_id": "HGNC:11654",
            "omim_id": [
                "606847"
            ],
            "ucsc_id": "uc003lry.4"
        },
        "gene_symbol": "TCOF1",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "",
            "lrg_tlr": "",
            "slr": "NP_001358552.1:p.(E162Kfs*61)",
            "tlr": "NP_001358552.1:p.(Glu162LysfsTer61)"
        },
        "hgvs_refseqgene_variant": "",
        "hgvs_transcript_variant": "NM_001371623.1:c.483_484insAAAAAAAAAAT",
        "primary_assembly_loci": {
            "grch37": {
                "hgvs_genomic_description": "NC_000005.9:g.149748383_149748384insAAAAAAAAAAT",
                "vcf": {
                    "alt": "AAAAAAAAAAAT",
                    "chr": "5",
                    "pos": "149748383",
                    "ref": "A"
                }
            },
            "grch38": {
                "hgvs_genomic_description": "NC_000005.10:g.150368820_150368821insAAAAAAAAAAT",
                "vcf": {
                    "alt": "AAAAAAAAAAAT",
                    "chr": "5",
                    "pos": "150368820",
                    "ref": "A"
                }
            },
            "hg19": {
                "hgvs_genomic_description": "NC_000005.9:g.149748383_149748384insAAAAAAAAAAT",
                "vcf": {
                    "alt": "AAAAAAAAAAAT",
                    "chr": "chr5",
                    "pos": "149748383",
                    "ref": "A"
                }
            },
            "hg38": {
                "hgvs_genomic_description": "NC_000005.10:g.150368820_150368821insAAAAAAAAAAT",
                "vcf": {
                    "alt": "AAAAAAAAAAAT",
                    "chr": "chr5",
                    "pos": "150368820",
                    "ref": "A"
                }
            }
        },
        "reference_sequence_records": {
            "protein": "https://www.ncbi.nlm.nih.gov/nuccore/NP_001358552.1",
            "transcript": "https://www.ncbi.nlm.nih.gov/nuccore/NM_001371623.1"
        },
        "refseqgene_context_intronic_sequence": "",
        "selected_assembly": false,
        "submitted_variant": "NM_001371623.1:c.483_484ins[A[10];T]",  # I assume this is a correct description Ivo????
        "transcript_description": "Homo sapiens treacle ribosome biogenesis factor 1 (TCOF1), transcript variant 8, mRNA",
        "validation_warnings": [
            "NM_001371623.1:c.483_484ins[A[10];T] is better written as NM_001371623.1:c.483_484insAAAAAAAAAAT",
            "RefSeqGene record not available"
        ],
        "variant_exonic_positions": {
            "NC_000005.10": {
                "end_exon": "5",
                "start_exon": "5"
            },
            "NC_000005.9": {
                "end_exon": "5",
                "start_exon": "5"
            }
        }
    },
    "flag": "gene_variant",
    "metadata": {
        "variantvalidator_hgvs_version": "2.0.1.dev2+g58fc52a",
        "variantvalidator_version": "1.0.5.dev228+gee3fee4.d20211116",
        "vvdb_version": "vvdb_2021_4",
        "vvseqrepo_db": "VV_SR_2021_2/master",
        "vvta_version": "vvta_2021_2"
    }
}
Peter-J-Freeman commented 2 years ago

Pulling this section down so I can see what I still need to do ins(10) (length, still not OK) 'WSUFFIXFORMAT': 'The length of the variant is not formatted following the HGVS guidelines. Please rewrite "(10)" to "N[10]".', ins(10_20) (length, range, not OK) 'WSUFFIXFORMAT': 'The length of the variant is not formatted following the HGVS guidelines. Please rewrite "(10_20)" to "N[(10_20)]".', insN[10] (length, OK) insN[(10_20)] (length, range, OK) ins100_200 (positions, OK) ins[...] (contents split on ; and checked for all of the above including positions prefixed by refseqs) Positions or lengths are also further checked for order and uniqueness. If not correct;

The length of the variant is not formatted following the HGVS guidelines. Please rewrite "(30_30)" to "N[30]". The length of the variant is not formatted following the HGVS guidelines. Please rewrite "(50_30)" to "N[(30_50)]".

Peter-J-Freeman commented 2 years ago

@ifokkema , why the parentheses in "The length of the variant is not formatted following the HGVS guidelines. Please rewrite "(10_20)" to "N[(10_20)]"? Surely N[10_20] would be cleaner????? Can you please confirm this is correct

I'm guessing that it's because the range is uncertain?

Also, I don't understand this example ins100_200 (positions, OK). Again, should this not be N[(10_20)] or N[10_20]?

Peter-J-Freeman commented 2 years ago

OK assuming @ifokkema confirms the above syntaxes are correct, which I believe they are, here is another test set completed

{
    "flag": "warning",
    "metadata": {
        "variantvalidator_hgvs_version": "2.0.1.dev2+g58fc52a",
        "variantvalidator_version": "1.0.5.dev228+gee3fee4.d20211116",
        "vvdb_version": "vvdb_2021_4",
        "vvseqrepo_db": "VV_SR_2021_2/master",
        "vvta_version": "vvta_2021_2"
    },
    "validation_warning_1": {
        "alt_genomic_loci": [],
        "annotations": {},
        "gene_ids": {},
        "gene_symbol": "",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "",
            "lrg_tlr": "",
            "slr": "",
            "tlr": ""
        },
        "hgvs_refseqgene_variant": "",
        "hgvs_transcript_variant": "",
        "primary_assembly_loci": {},
        "reference_sequence_records": "",
        "refseqgene_context_intronic_sequence": "",
        "selected_assembly": "GRCh37",
        "submitted_variant": "NM_001371623.1:c.483_484ins[(10_20)]",
        "transcript_description": "",
        "validation_warnings": [
            "The variant description is syntactically correct but no further validation is possible because the description contains uncertainty"
        ],
        "variant_exonic_positions": null
    },
    "validation_warning_2": {
        "alt_genomic_loci": [],
        "annotations": {},
        "gene_ids": {},
        "gene_symbol": "",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "",
            "lrg_tlr": "",
            "slr": "",
            "tlr": ""
        },
        "hgvs_refseqgene_variant": "",
        "hgvs_transcript_variant": "",
        "primary_assembly_loci": {},
        "reference_sequence_records": "",
        "refseqgene_context_intronic_sequence": "",
        "selected_assembly": "GRCh37",
        "submitted_variant": "NM_001371623.1:c.483ins[(10_20)]",
        "transcript_description": "",
        "validation_warnings": [
            "An insertion must be provided with the two positions between which the insertion has taken place"
        ],
        "variant_exonic_positions": null
    },
    "validation_warning_3": {
        "alt_genomic_loci": [],
        "annotations": {},
        "gene_ids": {},
        "gene_symbol": "",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "",
            "lrg_tlr": "",
            "slr": "",
            "tlr": ""
        },
        "hgvs_refseqgene_variant": "",
        "hgvs_transcript_variant": "",
        "primary_assembly_loci": {},
        "reference_sequence_records": "",
        "refseqgene_context_intronic_sequence": "",
        "selected_assembly": "GRCh37",
        "submitted_variant": "NM_001371623.1:c.483ins[(20_20)]",
        "transcript_description": "",
        "validation_warnings": [
            "The length of the variant is not formatted following the HGVS guidelines. Please rewrite (20_20) to N[(20)]",
            "An insertion must be provided with the two positions between which the insertion has taken place"
        ],
        "variant_exonic_positions": null
    },
    "validation_warning_4": {
        "alt_genomic_loci": [],
        "annotations": {},
        "gene_ids": {},
        "gene_symbol": "",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "",
            "lrg_tlr": "",
            "slr": "",
            "tlr": ""
        },
        "hgvs_refseqgene_variant": "",
        "hgvs_transcript_variant": "",
        "primary_assembly_loci": {},
        "reference_sequence_records": "",
        "refseqgene_context_intronic_sequence": "",
        "selected_assembly": "GRCh37",
        "submitted_variant": "NM_001371623.1:c.483ins[(20_10)]",
        "transcript_description": "",
        "validation_warnings": [
            "The length of the variant is not formatted following the HGVS guidelines. Please rewrite (20_10) to N[(10_20)]",
            "An insertion must be provided with the two positions between which the insertion has taken place"
        ],
        "variant_exonic_positions": null
    }
}
ifokkema commented 2 years ago

Oh wow. What a total nightmare.

So far I have handled a simple case and made the following

(cut the JSON down to the relevant bits)

{
    "NM_001371623.1:c.483_484insAAAAAAAAAA": {
        "hgvs_transcript_variant": "NM_001371623.1:c.483_484insAAAAAAAAAA",
        "submitted_variant": "NM_001371623.1:c.483_484insA[10]",
        "validation_warnings": [
            "NM_001371623.1:c.483_484insA[10] is better written as NM_001371623.1:c.483_484insAAAAAAAAAA"
        ]
    }
}

Nice! Will you put a cutoff at a certain point? Like, from where would the A[...] syntax be preferred? Or just never?

Note, the code will also handle variants like GATC[10]

Excellent!

OK, will now do the following too. Should be agnostic to the content of the insertion

(cut the JSON down to the relevant bits)

{
    "NM_001371623.1:c.483_484insAAAAAAAAAAT": {
        "hgvs_transcript_variant": "NM_001371623.1:c.483_484insAAAAAAAAAAT",
        "submitted_variant": "NM_001371623.1:c.483_484ins[A[10];T]",  # I assume this is a correct description Ivo????
        "validation_warnings": [
            "NM_001371623.1:c.483_484ins[A[10];T] is better written as NM_001371623.1:c.483_484insAAAAAAAAAAT"
        ]
    }
}

Nice! Yes, NM_001371623.1:c.483_484ins[A[10];T] is valid syntax.

@ifokkema , why the parentheses in "The length of the variant is not formatted following the HGVS guidelines. Please rewrite "(10_20)" to "N[(10_20)]"? Surely N[10_20] would be cleaner????? Can you please confirm this is correct

I'm guessing that it's because the range is uncertain?

Correct; like c.(100_200)_(300_400) indicate uncertainty and also p.(...) indicate uncertainty, N[(10_20)] should be written as such to indicate the number of Ns is uncertain. Also, it should prevent confusion with the next example:

Also, I don't understand this example ins100_200 (positions, OK). Again, should this not be N[(10_20)] or N[10_20]?

No, because ins100_200 is a position range, not an insertion length range. So c.1_2ins100_200 means "insert c.100_200 between c.1_2". This is correct syntax, and something else entirely. NM_123456.1:c.1_2ins100_200 is like saying NM_123456.1:c.1_2ins[NM_123456.1:c.100_200].

OK assuming @ifokkema confirms the above syntaxes are correct, which I believe they are, here is another test set completed

(cut the JSON down to the relevant bits)

{
    "flag": "warning",
    "validation_warning_1": {
        "submitted_variant": "NM_001371623.1:c.483_484ins[(10_20)]",
        "validation_warnings": [
            "The variant description is syntactically correct but no further validation is possible because the description contains uncertainty"
        ],
    },
    "validation_warning_2": {
        "submitted_variant": "NM_001371623.1:c.483ins[(10_20)]",
        "validation_warnings": [
            "An insertion must be provided with the two positions between which the insertion has taken place"
        ],
    },
    "validation_warning_3": {
        "submitted_variant": "NM_001371623.1:c.483ins[(20_20)]",
        "validation_warnings": [
            "The length of the variant is not formatted following the HGVS guidelines. Please rewrite (20_20) to N[(20)]",
            "An insertion must be provided with the two positions between which the insertion has taken place"
        ],
    },
    "validation_warning_4": {
        "submitted_variant": "NM_001371623.1:c.483ins[(20_10)]",
        "validation_warnings": [
            "The length of the variant is not formatted following the HGVS guidelines. Please rewrite (20_10) to N[(10_20)]",
            "An insertion must be provided with the two positions between which the insertion has taken place"
        ],
    }
}

Some issues remain; All inputs have ins[(... instead of insN[(... (note the N). Therefore, none are syntactically correct. Assuming all inputs are given as insN[(...:

Peter-J-Freeman commented 2 years ago

Nice! Will you put a cutoff at a certain point? Like, from where would the A[...] syntax be preferred? Or just never?

This needs to be discussed by the HGVS SVD WG. We cannot just add an arbitrary value. My preference is that the description always be written in full for data sharing and journal metadata, but can be written as an annotation short-hand the A[...] syntax in the journal text or clinical report so long ad the full description is stored somewhere and linked/attached

No, because ins100_200 is a position range, not an insertion length range. So c.1_2ins100_200 means "insert c.100_200 between c.1_2". This is correct syntax, and something else entirely. NM_123456.1:c.1_2ins100_200 is like saying NM_123456.1:c.1_2ins[NM_123456.1:c.100_200].

Of course it is. Sorry. Was late when I was working on this and my brain was mush. Thanks for the reminder!

All inputs have ins[(... instead of insN[(... (note the N). Therefore, none are syntactically correct. Assuming all inputs are given as insN[(...:

I guess we need another warning in this case. Any suggested text?

ifokkema commented 2 years ago

Nice! Will you put a cutoff at a certain point? Like, from where would the A[...] syntax be preferred? Or just never?

This needs to be discussed by the HGVS SVD WG. We cannot just add an arbitrary value. My preference is that the description always be written in full for data sharing and journal metadata, but can be written as an annotation short-hand the A[...] syntax in the journal text or clinical report so long ad the full description is stored somewhere and linked/attached

Makes perfect sense to align it. LOVD has the issue of a limit of 255 characters for the DNA field. Other databases may have the same. So above that, LOVD has no way of storing the variant. I guess we might build in an optimizer for that, that shortens the variant if possible but only in that case.

All inputs have ins[(... instead of insN[(... (note the N). Therefore, none are syntactically correct. Assuming all inputs are given as insN[(...:

I guess we need another warning in this case. Any suggested text?

We fail at this point. These examples aren't recognized and result in:

{
    "WSUFFIXFORMAT": "The part after \"ins\" does not follow HGVS guidelines."
}
Peter-J-Freeman commented 2 years ago

Makes perfect sense to align it. LOVD has the issue of a limit of 255 characters for the DNA field. Other databases may have the same. So above that, LOVD has no way of storing the variant. I guess we might build in an optimizer for that, that shortens the variant if possible but only in that case.

I think this is why there needs to be consensus. To make sure that if we must shorten, we all do it at the same cutoff

{ "WSUFFIXFORMAT": "The part after \"ins\" does not follow HGVS guidelines." }

Thanks. Will add this

Peter-J-Freeman commented 2 years ago

@leicray . We need to bring this up with Johan