Open leicray opened 2 years ago
For your reference, LOVD in this case reports;
NM_001371623.1:c.483ins
{
"errors": {
"EPOSITIONMISSING": "An insertion must be provided with the two positions between which the insertion has taken place.",
"ESUFFIXMISSING": "The inserted sequence must be provided for insertions or deletion-insertions."
}
}
NM_001371623.1:c.483delins
{
"errors": {
"ESUFFIXMISSING": "The inserted sequence must be provided for insertions or deletion-insertions."
}
}
@ifokkema, thanks. I was just about to ask what LOVD does with these. if you can pass on the outputs then I will make thw changes
Thanks
@ifokkema, thanks. I was just about to ask what LOVD does with these. if you can pass on the outputs then I will make thw changes
Sure! What other outputs do you need?
Ah, my mistake, I didn't notice there were only a couple of variant descriptions here. Do you have any other similar inputs and outputs that relate to this topic in your test set????
Sure, we have lots, depends on what you would consider a similarity. Insertions and deletion-insertions are the only variants where suffixes are required, so they are handled separately from all other variant types. Also, the suffixes can be much more complex. We recognize:
insA
(sequence, OK)ins10
(length, not OK)
'WSUFFIXFORMAT': 'The length of the variant is not formatted following the HGVS guidelines. Please rewrite "10" to "N[10]".'
,ins(10)
(length, still not OK)
'WSUFFIXFORMAT': 'The length of the variant is not formatted following the HGVS guidelines. Please rewrite "(10)" to "N[10]".'
,ins(10_20)
(length, range, not OK)
'WSUFFIXFORMAT': 'The length of the variant is not formatted following the HGVS guidelines. Please rewrite "(10_20)" to "N[(10_20)]".'
,insN[10]
(length, OK)insN[(10_20)]
(length, range, OK)ins100_200
(positions, OK)ins[...]
(contents split on ;
and checked for all of the above including positions prefixed by refseqs)Positions or lengths are also further checked for order and uniqueness. If not correct;
Thanks. I'll start working on these.
Issue 1 sorted NM_001371623.1:c.483ins
"validation_warnings": [
"The inserted sequence must be provided for insertions or deletion-insertions",
"An insertion must be provided with the two positions between which the insertion has taken place."
],
Hi @ifokkema For clarification, what is meant by "N[10]"
???
N
, as the IUPAC code for an unknown nucleotide, and [10]
, as a stretch of 10 of those.
So, as c.1_2insAAAAAAAAAA
can also be written as c.1_2insA[10]
, insertion of 10 unknown nucleotides can be written both as c.1_2insNNNNNNNNNN
and as c.1_2insN[10]
.
Perfect, thanks for clarification. I will look up the correct page in the hgvs website
Sure! Let me know if you can't find it!
It's never easy to find specific bits of information on the website. Each of the rules needs to be individually numbered in a hierarchical fashion and update histories for each individual rule need to be accessible in some simple manner.
I think that I suggested this some time ago, but never received any feedback on the idea. Then again, I cannot now remember the forum/medium through which I made the suggestion.
It may have been in our meeting last February. I did mention it to Johan later, but I doubt it will be changed. I'm in favor - let's see if my application for the HVNC membership will be accepted :wink:
N, as the IUPAC code for an unknown nucleotide, and [10], as a stretch of 10 of those. So, as c.1_2insAAAAAAAAAA can also be written as c.1_2insA[10], insertion of 10 unknown nucleotides can be written both as c.1_2insNNNNNNNNNN and as c.1_2insN[10].
This is really unhelpful in terms of consistent reporting because we have set up a scenario where 2 different descriptions have the same meaning. I really dislike it when this happens. I think VV will accept c.1_2insA[10] but will convert it to c.1_2insAAAAAAAAAA and warn that it is better written in full
Thanks for bringing this up
Also what about c.1_2insGATC[10]? Is this also correct???? For inserted repeats????
I can't find an example of it on the varnomen site, but as far as I know, that is allowed syntax. LOVD also recognizes it as valid syntax.
This is really unhelpful in terms of consistent reporting because we have set up a scenario where 2 different descriptions have the same meaning. I really dislike it when this happens. I think VV will accept c.1_2insA[10] but will convert it to c.1_2insAAAAAAAAAA and warn that it is better written in full
You're of course completely right. Perhaps there should be a rule when a sequence should be shortened. Note also that, e.g.,
c.1_2insAAAAAAAAAAT
could be rewritten as c.1_2[A[10];T]
. So it quickly gets more complex. LOVD handles this by splitting the inserted sequence (everything between the square brackets), exploding it on semi-colon, and then checking the syntax of each part.
Oh wow. What a total nightmare.
So far I have handled a simple case and made the following
{
"NM_001371623.1:c.483_484insAAAAAAAAAA": {
"alt_genomic_loci": [],
"annotations": {
"chromosome": "5",
"db_xref": {
"CCDS": null,
"ensemblgene": null,
"hgnc": "HGNC:11654",
"ncbigene": "6949",
"select": "MANE"
},
"ensembl_select": false,
"mane_plus_clinical": false,
"mane_select": true,
"map": "5q32-q33.1",
"note": "treacle ribosome biogenesis factor 1",
"refseq_select": true,
"variant": "8"
},
"gene_ids": {
"ccds_ids": [
"CCDS47306",
"CCDS4306",
"CCDS47307",
"CCDS54936",
"CCDS47305"
],
"ensembl_gene_id": "ENSG00000070814",
"entrez_gene_id": "6949",
"hgnc_id": "HGNC:11654",
"omim_id": [
"606847"
],
"ucsc_id": "uc003lry.4"
},
"gene_symbol": "TCOF1",
"genome_context_intronic_sequence": "",
"hgvs_lrg_transcript_variant": "",
"hgvs_lrg_variant": "",
"hgvs_predicted_protein_consequence": {
"lrg_slr": "",
"lrg_tlr": "",
"slr": "NP_001358552.1:p.(E162Kfs*16)",
"tlr": "NP_001358552.1:p.(Glu162LysfsTer16)"
},
"hgvs_refseqgene_variant": "",
"hgvs_transcript_variant": "NM_001371623.1:c.483_484insAAAAAAAAAA",
"primary_assembly_loci": {
"grch37": {
"hgvs_genomic_description": "NC_000005.9:g.149748383_149748384insAAAAAAAAAA",
"vcf": {
"alt": "CAAAAAAAAAA",
"chr": "5",
"pos": "149748382",
"ref": "C"
}
},
"grch38": {
"hgvs_genomic_description": "NC_000005.10:g.150368820_150368821insAAAAAAAAAA",
"vcf": {
"alt": "CAAAAAAAAAA",
"chr": "5",
"pos": "150368819",
"ref": "C"
}
},
"hg19": {
"hgvs_genomic_description": "NC_000005.9:g.149748383_149748384insAAAAAAAAAA",
"vcf": {
"alt": "CAAAAAAAAAA",
"chr": "chr5",
"pos": "149748382",
"ref": "C"
}
},
"hg38": {
"hgvs_genomic_description": "NC_000005.10:g.150368820_150368821insAAAAAAAAAA",
"vcf": {
"alt": "CAAAAAAAAAA",
"chr": "chr5",
"pos": "150368819",
"ref": "C"
}
}
},
"reference_sequence_records": {
"protein": "https://www.ncbi.nlm.nih.gov/nuccore/NP_001358552.1",
"transcript": "https://www.ncbi.nlm.nih.gov/nuccore/NM_001371623.1"
},
"refseqgene_context_intronic_sequence": "",
"selected_assembly": false,
"submitted_variant": "NM_001371623.1:c.483_484insA[10]",
"transcript_description": "Homo sapiens treacle ribosome biogenesis factor 1 (TCOF1), transcript variant 8, mRNA",
"validation_warnings": [
"NM_001371623.1:c.483_484insA[10] is better written as NM_001371623.1:c.483_484insAAAAAAAAAA",
"RefSeqGene record not available"
],
"variant_exonic_positions": {
"NC_000005.10": {
"end_exon": "5",
"start_exon": "5"
},
"NC_000005.9": {
"end_exon": "5",
"start_exon": "5"
}
}
},
"flag": "gene_variant",
"metadata": {
"variantvalidator_hgvs_version": "2.0.1.dev2+g58fc52a",
"variantvalidator_version": "1.0.5.dev228+gee3fee4.d20211116",
"vvdb_version": "vvdb_2021_4",
"vvseqrepo_db": "VV_SR_2021_2/master",
"vvta_version": "vvta_2021_2"
}
}
Note, the code will also handle variants like GATC[10]
Now to look for the ;
versions e.g. c.1_2[A[10];T]. Thanks for the info.
OK, will now do the following too. Should be agnostic to the content of the insertion
{
"NM_001371623.1:c.483_484insAAAAAAAAAAT": {
"alt_genomic_loci": [],
"annotations": {
"chromosome": "5",
"db_xref": {
"CCDS": null,
"ensemblgene": null,
"hgnc": "HGNC:11654",
"ncbigene": "6949",
"select": "MANE"
},
"ensembl_select": false,
"mane_plus_clinical": false,
"mane_select": true,
"map": "5q32-q33.1",
"note": "treacle ribosome biogenesis factor 1",
"refseq_select": true,
"variant": "8"
},
"gene_ids": {
"ccds_ids": [
"CCDS47306",
"CCDS4306",
"CCDS47307",
"CCDS54936",
"CCDS47305"
],
"ensembl_gene_id": "ENSG00000070814",
"entrez_gene_id": "6949",
"hgnc_id": "HGNC:11654",
"omim_id": [
"606847"
],
"ucsc_id": "uc003lry.4"
},
"gene_symbol": "TCOF1",
"genome_context_intronic_sequence": "",
"hgvs_lrg_transcript_variant": "",
"hgvs_lrg_variant": "",
"hgvs_predicted_protein_consequence": {
"lrg_slr": "",
"lrg_tlr": "",
"slr": "NP_001358552.1:p.(E162Kfs*61)",
"tlr": "NP_001358552.1:p.(Glu162LysfsTer61)"
},
"hgvs_refseqgene_variant": "",
"hgvs_transcript_variant": "NM_001371623.1:c.483_484insAAAAAAAAAAT",
"primary_assembly_loci": {
"grch37": {
"hgvs_genomic_description": "NC_000005.9:g.149748383_149748384insAAAAAAAAAAT",
"vcf": {
"alt": "AAAAAAAAAAAT",
"chr": "5",
"pos": "149748383",
"ref": "A"
}
},
"grch38": {
"hgvs_genomic_description": "NC_000005.10:g.150368820_150368821insAAAAAAAAAAT",
"vcf": {
"alt": "AAAAAAAAAAAT",
"chr": "5",
"pos": "150368820",
"ref": "A"
}
},
"hg19": {
"hgvs_genomic_description": "NC_000005.9:g.149748383_149748384insAAAAAAAAAAT",
"vcf": {
"alt": "AAAAAAAAAAAT",
"chr": "chr5",
"pos": "149748383",
"ref": "A"
}
},
"hg38": {
"hgvs_genomic_description": "NC_000005.10:g.150368820_150368821insAAAAAAAAAAT",
"vcf": {
"alt": "AAAAAAAAAAAT",
"chr": "chr5",
"pos": "150368820",
"ref": "A"
}
}
},
"reference_sequence_records": {
"protein": "https://www.ncbi.nlm.nih.gov/nuccore/NP_001358552.1",
"transcript": "https://www.ncbi.nlm.nih.gov/nuccore/NM_001371623.1"
},
"refseqgene_context_intronic_sequence": "",
"selected_assembly": false,
"submitted_variant": "NM_001371623.1:c.483_484ins[A[10];T]", # I assume this is a correct description Ivo????
"transcript_description": "Homo sapiens treacle ribosome biogenesis factor 1 (TCOF1), transcript variant 8, mRNA",
"validation_warnings": [
"NM_001371623.1:c.483_484ins[A[10];T] is better written as NM_001371623.1:c.483_484insAAAAAAAAAAT",
"RefSeqGene record not available"
],
"variant_exonic_positions": {
"NC_000005.10": {
"end_exon": "5",
"start_exon": "5"
},
"NC_000005.9": {
"end_exon": "5",
"start_exon": "5"
}
}
},
"flag": "gene_variant",
"metadata": {
"variantvalidator_hgvs_version": "2.0.1.dev2+g58fc52a",
"variantvalidator_version": "1.0.5.dev228+gee3fee4.d20211116",
"vvdb_version": "vvdb_2021_4",
"vvseqrepo_db": "VV_SR_2021_2/master",
"vvta_version": "vvta_2021_2"
}
}
Pulling this section down so I can see what I still need to do ins(10) (length, still not OK) 'WSUFFIXFORMAT': 'The length of the variant is not formatted following the HGVS guidelines. Please rewrite "(10)" to "N[10]".', ins(10_20) (length, range, not OK) 'WSUFFIXFORMAT': 'The length of the variant is not formatted following the HGVS guidelines. Please rewrite "(10_20)" to "N[(10_20)]".', insN[10] (length, OK) insN[(10_20)] (length, range, OK) ins100_200 (positions, OK) ins[...] (contents split on ; and checked for all of the above including positions prefixed by refseqs) Positions or lengths are also further checked for order and uniqueness. If not correct;
The length of the variant is not formatted following the HGVS guidelines. Please rewrite "(30_30)" to "N[30]". The length of the variant is not formatted following the HGVS guidelines. Please rewrite "(50_30)" to "N[(30_50)]".
@ifokkema , why the parentheses in "The length of the variant is not formatted following the HGVS guidelines. Please rewrite "(10_20)" to "N[(10_20)]"? Surely N[10_20] would be cleaner????? Can you please confirm this is correct
I'm guessing that it's because the range is uncertain?
Also, I don't understand this example ins100_200 (positions, OK)
. Again, should this not be N[(10_20)] or N[10_20]?
OK assuming @ifokkema confirms the above syntaxes are correct, which I believe they are, here is another test set completed
{
"flag": "warning",
"metadata": {
"variantvalidator_hgvs_version": "2.0.1.dev2+g58fc52a",
"variantvalidator_version": "1.0.5.dev228+gee3fee4.d20211116",
"vvdb_version": "vvdb_2021_4",
"vvseqrepo_db": "VV_SR_2021_2/master",
"vvta_version": "vvta_2021_2"
},
"validation_warning_1": {
"alt_genomic_loci": [],
"annotations": {},
"gene_ids": {},
"gene_symbol": "",
"genome_context_intronic_sequence": "",
"hgvs_lrg_transcript_variant": "",
"hgvs_lrg_variant": "",
"hgvs_predicted_protein_consequence": {
"lrg_slr": "",
"lrg_tlr": "",
"slr": "",
"tlr": ""
},
"hgvs_refseqgene_variant": "",
"hgvs_transcript_variant": "",
"primary_assembly_loci": {},
"reference_sequence_records": "",
"refseqgene_context_intronic_sequence": "",
"selected_assembly": "GRCh37",
"submitted_variant": "NM_001371623.1:c.483_484ins[(10_20)]",
"transcript_description": "",
"validation_warnings": [
"The variant description is syntactically correct but no further validation is possible because the description contains uncertainty"
],
"variant_exonic_positions": null
},
"validation_warning_2": {
"alt_genomic_loci": [],
"annotations": {},
"gene_ids": {},
"gene_symbol": "",
"genome_context_intronic_sequence": "",
"hgvs_lrg_transcript_variant": "",
"hgvs_lrg_variant": "",
"hgvs_predicted_protein_consequence": {
"lrg_slr": "",
"lrg_tlr": "",
"slr": "",
"tlr": ""
},
"hgvs_refseqgene_variant": "",
"hgvs_transcript_variant": "",
"primary_assembly_loci": {},
"reference_sequence_records": "",
"refseqgene_context_intronic_sequence": "",
"selected_assembly": "GRCh37",
"submitted_variant": "NM_001371623.1:c.483ins[(10_20)]",
"transcript_description": "",
"validation_warnings": [
"An insertion must be provided with the two positions between which the insertion has taken place"
],
"variant_exonic_positions": null
},
"validation_warning_3": {
"alt_genomic_loci": [],
"annotations": {},
"gene_ids": {},
"gene_symbol": "",
"genome_context_intronic_sequence": "",
"hgvs_lrg_transcript_variant": "",
"hgvs_lrg_variant": "",
"hgvs_predicted_protein_consequence": {
"lrg_slr": "",
"lrg_tlr": "",
"slr": "",
"tlr": ""
},
"hgvs_refseqgene_variant": "",
"hgvs_transcript_variant": "",
"primary_assembly_loci": {},
"reference_sequence_records": "",
"refseqgene_context_intronic_sequence": "",
"selected_assembly": "GRCh37",
"submitted_variant": "NM_001371623.1:c.483ins[(20_20)]",
"transcript_description": "",
"validation_warnings": [
"The length of the variant is not formatted following the HGVS guidelines. Please rewrite (20_20) to N[(20)]",
"An insertion must be provided with the two positions between which the insertion has taken place"
],
"variant_exonic_positions": null
},
"validation_warning_4": {
"alt_genomic_loci": [],
"annotations": {},
"gene_ids": {},
"gene_symbol": "",
"genome_context_intronic_sequence": "",
"hgvs_lrg_transcript_variant": "",
"hgvs_lrg_variant": "",
"hgvs_predicted_protein_consequence": {
"lrg_slr": "",
"lrg_tlr": "",
"slr": "",
"tlr": ""
},
"hgvs_refseqgene_variant": "",
"hgvs_transcript_variant": "",
"primary_assembly_loci": {},
"reference_sequence_records": "",
"refseqgene_context_intronic_sequence": "",
"selected_assembly": "GRCh37",
"submitted_variant": "NM_001371623.1:c.483ins[(20_10)]",
"transcript_description": "",
"validation_warnings": [
"The length of the variant is not formatted following the HGVS guidelines. Please rewrite (20_10) to N[(10_20)]",
"An insertion must be provided with the two positions between which the insertion has taken place"
],
"variant_exonic_positions": null
}
}
Oh wow. What a total nightmare.
So far I have handled a simple case and made the following
(cut the JSON down to the relevant bits)
{ "NM_001371623.1:c.483_484insAAAAAAAAAA": { "hgvs_transcript_variant": "NM_001371623.1:c.483_484insAAAAAAAAAA", "submitted_variant": "NM_001371623.1:c.483_484insA[10]", "validation_warnings": [ "NM_001371623.1:c.483_484insA[10] is better written as NM_001371623.1:c.483_484insAAAAAAAAAA" ] } }
Nice! Will you put a cutoff at a certain point? Like, from where would the A[...] syntax be preferred? Or just never?
Note, the code will also handle variants like GATC[10]
Excellent!
OK, will now do the following too. Should be agnostic to the content of the insertion
(cut the JSON down to the relevant bits)
{ "NM_001371623.1:c.483_484insAAAAAAAAAAT": { "hgvs_transcript_variant": "NM_001371623.1:c.483_484insAAAAAAAAAAT", "submitted_variant": "NM_001371623.1:c.483_484ins[A[10];T]", # I assume this is a correct description Ivo???? "validation_warnings": [ "NM_001371623.1:c.483_484ins[A[10];T] is better written as NM_001371623.1:c.483_484insAAAAAAAAAAT" ] } }
Nice! Yes, NM_001371623.1:c.483_484ins[A[10];T]
is valid syntax.
@ifokkema , why the parentheses in "The length of the variant is not formatted following the HGVS guidelines. Please rewrite "(10_20)" to "N[(10_20)]"? Surely N[10_20] would be cleaner????? Can you please confirm this is correct
I'm guessing that it's because the range is uncertain?
Correct; like c.(100_200)_(300_400)
indicate uncertainty and also p.(...)
indicate uncertainty, N[(10_20)]
should be written as such to indicate the number of Ns is uncertain. Also, it should prevent confusion with the next example:
Also, I don't understand this example
ins100_200 (positions, OK)
. Again, should this not be N[(10_20)] or N[10_20]?
No, because ins100_200
is a position range, not an insertion length range. So c.1_2ins100_200
means "insert c.100_200
between c.1_2
". This is correct syntax, and something else entirely.
NM_123456.1:c.1_2ins100_200
is like saying NM_123456.1:c.1_2ins[NM_123456.1:c.100_200]
.
OK assuming @ifokkema confirms the above syntaxes are correct, which I believe they are, here is another test set completed
(cut the JSON down to the relevant bits)
{ "flag": "warning", "validation_warning_1": { "submitted_variant": "NM_001371623.1:c.483_484ins[(10_20)]", "validation_warnings": [ "The variant description is syntactically correct but no further validation is possible because the description contains uncertainty" ], }, "validation_warning_2": { "submitted_variant": "NM_001371623.1:c.483ins[(10_20)]", "validation_warnings": [ "An insertion must be provided with the two positions between which the insertion has taken place" ], }, "validation_warning_3": { "submitted_variant": "NM_001371623.1:c.483ins[(20_20)]", "validation_warnings": [ "The length of the variant is not formatted following the HGVS guidelines. Please rewrite (20_20) to N[(20)]", "An insertion must be provided with the two positions between which the insertion has taken place" ], }, "validation_warning_4": { "submitted_variant": "NM_001371623.1:c.483ins[(20_10)]", "validation_warnings": [ "The length of the variant is not formatted following the HGVS guidelines. Please rewrite (20_10) to N[(10_20)]", "An insertion must be provided with the two positions between which the insertion has taken place" ], } }
Some issues remain;
All inputs have ins[(...
instead of insN[(...
(note the N). Therefore, none are syntactically correct.
Assuming all inputs are given as insN[(...
:
Nice! Will you put a cutoff at a certain point? Like, from where would the A[...] syntax be preferred? Or just never?
This needs to be discussed by the HGVS SVD WG. We cannot just add an arbitrary value. My preference is that the description always be written in full for data sharing and journal metadata, but can be written as an annotation short-hand the A[...] syntax
in the journal text or clinical report so long ad the full description is stored somewhere and linked/attached
No, because ins100_200 is a position range, not an insertion length range. So c.1_2ins100_200 means "insert c.100_200 between c.1_2". This is correct syntax, and something else entirely. NM_123456.1:c.1_2ins100_200 is like saying NM_123456.1:c.1_2ins[NM_123456.1:c.100_200].
Of course it is. Sorry. Was late when I was working on this and my brain was mush. Thanks for the reminder!
All inputs have ins[(... instead of insN[(... (note the N). Therefore, none are syntactically correct. Assuming all inputs are given as insN[(...:
I guess we need another warning in this case. Any suggested text?
Nice! Will you put a cutoff at a certain point? Like, from where would the A[...] syntax be preferred? Or just never?
This needs to be discussed by the HGVS SVD WG. We cannot just add an arbitrary value. My preference is that the description always be written in full for data sharing and journal metadata, but can be written as an annotation short-hand
the A[...] syntax
in the journal text or clinical report so long ad the full description is stored somewhere and linked/attached
Makes perfect sense to align it. LOVD has the issue of a limit of 255 characters for the DNA field. Other databases may have the same. So above that, LOVD has no way of storing the variant. I guess we might build in an optimizer for that, that shortens the variant if possible but only in that case.
All inputs have ins[(... instead of insN[(... (note the N). Therefore, none are syntactically correct. Assuming all inputs are given as insN[(...:
I guess we need another warning in this case. Any suggested text?
We fail at this point. These examples aren't recognized and result in:
{
"WSUFFIXFORMAT": "The part after \"ins\" does not follow HGVS guidelines."
}
Makes perfect sense to align it. LOVD has the issue of a limit of 255 characters for the DNA field. Other databases may have the same. So above that, LOVD has no way of storing the variant. I guess we might build in an optimizer for that, that shortens the variant if possible but only in that case.
I think this is why there needs to be consensus. To make sure that if we must shorten, we all do it at the same cutoff
{ "WSUFFIXFORMAT": "The part after \"ins\" does not follow HGVS guidelines." }
Thanks. Will add this
@leicray . We need to bring this up with Johan
Describe the bug Error messages relating to ins and delins variants are not as clear as they might be.
To Reproduce Steps to reproduce the behaviour:
Expected behaviour A better error message for the first variant description would point out that there are two separate errors. The inserted bases need to be reported, and the insertion point needs to be specified as lying between two adjacent bases.
For the second variant, the error message needs to point out that the inserted bases need to be reported.