openvar / variantValidator

Public repository for VariantValidator project
GNU Affero General Public License v3.0
67 stars 21 forks source link

Revisit development of code to handle less common syntaxes #556

Open Peter-J-Freeman opened 7 months ago

Peter-J-Freeman commented 7 months ago

Is your feature request related to a problem? Please describe. There are a few formats that are uses less regularly which we need to develop code to handle

Describe the solution you'd like We need a few examples of each. One for @leicray. Then to map out workflows for each

Describe alternatives you've considered A couple of STP students did start to tackle some of this so we have code we can merge in and adapt

Additional context Add any other context or screenshots about the feature request here.

Peter-J-Freeman commented 7 months ago

@leicray, there are examples in the files so no need to trawl

Peter-J-Freeman commented 7 months ago

OK, @ifokkema .

I took a look at https://varnomen.hgvs.org/recommendations/DNA/variant/repeated/ and there is a relevant section that will change the nomenclature for expanded repeat syntax.

NOTE: a Community Consultation is prepared which will suggest to allow only one format where the entire range of the repeated sequence must be indicated, e.g. g.123_191CAG[23] not g.123CAG[23]

I was aware because I believe Ivo and I were amongst those who asked for the consultation. So, I will leave this work for now until the consultation is completed.

Peter-J-Freeman commented 7 months ago

We need to also look at uncertain positions

For example NC_000005.9:g.(9013680390144453)(90159675_90261231)dup

Here is the nomenclature page https://varnomen.hgvs.org/recommendations/uncertain/

Peter-J-Freeman commented 7 months ago

OK, here are the web page examples

NC_000003.12:g.(63912602_63912844)delN[15] or NM_000333.3:c.(4_246)delN[15] compared to the reference the amplified fragment is 15 nucleotides shorter. Format based on Repeated sequences NC_000003.12:g.(63912602_63912844)insN[(150_180)] or NM_000333.3:c.(4_246)insN[(150_180)]

NC_000023.11:g.(3172971631774235)(32216847_32287541)del (LRG_199t1:c.(62786438+69)(7310-43_7575)del)

NC_000023.11:g.(3172966331774080)(32216973_32287624)del (LRG_199t1:c.(61956381)(7422_7628)del)

NC_000023.10:g.(3221898332238146)(32984039_33252615)del a deletion on the X chromosome, based on reference genome build hg19 (reference sequence NC_000023.10), starting between nucleotides 32,218,983 to 32,238,146 and ending between nucleotides 32,984,039 to 33,252,615. NC_000023.10:g.(?32238146)(32984039_?)del a deletion on the X chromosome, based on reference genome build hg19, starting upstream of nucleotide 32,238,146 and ending downstream of nucleotide 32,984,039. NOTE: from the description indicated it is unclear how far the deletion extends, suggesting no up- or downstream probes were tested (and scored positive). NC_000013.11:g.(1938599319394916)(25045592_25059364)del a deletion on chromosome 13, based on reference genome build hg38, detected using a SNP-array. The deletion spans from dbSNP entries rs3929856 (g.19394916) to rs10507342 (g.25045592), both yielding a 50% reduced signal. On the centromeric side (q-arm) the closest probe tested normal was rs2342234 (g.19385993), on the telomeric side the closest probe tested normal was rs947283 (g.25059364). NC_000013.11:g.(?19394916)(25045592_?)del a deletion on chromosome 13, based on reference genome build hg38, detected using a SNP-array. The deletion spans from rs3929856 (g.19394916) to rs10507342 (g.25045592). The description indicates no flanking probes were tested normal, making it unclear how far the deletion extends.

rearrangements detected using FISH (Fluorescence In Situ Hybridisation) can be described using ISCN guidelines. When probe positions are known, variants can be described using genomic coordinates. The basic format is (position-last-normal-probeposition-first-variant probe)(position-last-variant-probe_position-first-normal-probe) (see also ISCN<>HGVS). In this description the “probe position” is based on the center of the labelled probe used during hybridisation. NC_000023.10:g.(3205707732364657)(32975163_33394206)del a deletion on the X-chromosome detected using FISH. The deletion, based on reference genome build hg38, spans from PAC probes RP4-556A22 (central position g.32364657) to RP11-151J4 (central position 32975163), both yielding no signal. On the telomeric side (p-arm) the closest probe tested positive was PAC RP11-509C1 (g.32057077), on the centromeric side the closest probe tested positive was RP6-60B16 (g.33394206). NC_000023.10:g.(?32364657)(32975163_?)del a deletion on the X-chromosome detected using FISH. The deletion, based on reference genome build hg38, spans from PAC probes RP4-556A22 (central position g.32364657) to RP11-151J4 (central position 32975163), both yielding no signal. No flanking positive probes were tested, making it unclear how far the deletion extends. chrX:g.(3205707732364657)(32894352_33055973)del a deletion on the X-chromosome detected using FISH. The deletion, based on reference genome build hg38, spans from PAC probe RP4-556A22 (central position g.32364657) to within RP11-151J4 (g.32894352_33055973), as indicated by the reduced signal. On the telomeric side (p-arm) the closest probe tested positive was PAC RP11-509C1 (g.32057077).

NC_000023.11:g.(3177582231819974)(32217064_32278336)del) (LRG_199t1:c.(6290+91936291-1)(7309+1_7310-1630)del a deletion of exons 44 to 51 in the DMD gene as detected by a HindIII digestion, Southern blotting and cDNA hybridisation. The variant is described based on the normal intensity of the hybridizing exon 43 fragment (g.32278336_32289141del / c.6118-1440_6290+9193), the absence of the hybridizing exon 44 fragment (g.32214437_32218461 / c.6291-1398_6438+2479), the absence of the hybridizing exon 51 fragment (g.31817965_31821709 / c.7201-1626_7309+2010), and the normal intensity of the hybridizing exon 52 fragment (g.31772670_31775822 / c.7310-1630_7542+1290) relative to the coding DNA reference sequence. NOTE: the deletion is assumed to involve the entire exon.

insertion the format to describe insertions that have not been fully characterised (sequenced) depends on the method used. The same recommendations apply as described above for deletions. size when a fragment containing an insertion has been amplified but only its size was determined (and not its sequence), the variant should be reported as g.(position-fragment-start_position-fragment-end)insN[#]. NC_000003.12:g.(63912602_63912844)insN[15] or NM_000333.3:c.(4_246)insN[15] where compared to the reference the amplified fragment is fifteen nucleotides larger giving an estimated 13-unit CAG/Gln repeat in the ATXN7 gene present/absent the format to describe insertions that have not been fully characterised, i.e. the inserted sequence and/or the insertion break point has not been sequenced, is g.(left-ins-position_right-ins-position)ins(last-normalfirst-inserted)(last-inserted_first-normal) NOTE: the description of the insertion, “ins(last-normalfirst-inserted)(last-inserted_first-normal)”, is based on the uncertainty of the extent of the inserted sequence. To describe the inserted sequence, follow the standard recommendations, i.e. try to describe it as precise as possible. NC_000002.11:g.47643464_47643465ins[NC_000022.10:35788169_35788352] the insertion on chromosome 2, between nucleotides g.47643464 and g.47643465 (in the MSH2 gene), of sequences from chromosome 22, nucleotides g.35788169 to g.35788352 (of the HMOX1 gene). NC000002.11:g.??ins[NC_000022.10:35788169_35788352] the insertion of sequences from chromosome 22, nucleotides g.35788169 to g.35788352 (of the HMOX1 gene) at an unknown position in chromosome 2. duplication the standard format to describe a duplication for which the break point has not been sequenced is (AB)(C_D)dup, where B_C describes the minimal extent and A_D to maximal extent of the duplication, i.e. g.(last-normalfirst-duplicated)(last-duplicated_first-normal)dup. NOTE: many assay detect the presence of an additional copy of specific sequences, not the location of the extra copy. When there is no evidence the additional copy is in tandem with the original copy but might be anywhere in a genome, the variant should be described as an insertion (see above). PCR NC_000023.11:g.(3172971631773911)(32217064_32287541)dup (LRG_199t61:c.(62786291-1)(7542+49_7575)dup) a duplication of exons 44 to 51 in the DMD gene as detected by a multiplex PCR assay. The variant is described based on the last amplified nucleotide of exon 43 (g.32287541/c.6278), the first duplicate amplified nucleotide of exon 44 (g.32216847/c.6291-1), the last duplicate amplified nucleotide of exon 51 (g.31774235/c.7542+49) and the first amplified nucleotide of exon 52 (g.31729716/c.7575). MLPA NC_000023.11:g.(3172966231774079)(32216972_32287623)dup (LRG_199t1:c.(61966382)(7423_7629)dup) a duplication of exons 44 to 51 in the DMD gene as detected by an MLPA assay. The duplication is described based on the last normal signal (exon 43 position g.32287623/c.6196), the first duplicated signal (exon 44 position g.32216972/c.6382), the last duplicated signal (exon 51 position g.31774079/c.7423) and the first normal signal (exon 52 position g.31729662/c.7629). NOTE: in samples containing 2 alleles, an increased signal indicates the presence of an extra copy of the sequence tested. The result has no information regarding the location of the extra copy, it can be anywhere in the genome! Southern blotting NC_000023.11:g.(3177582231817965)(32218461_32278336)dup (LRG_199t1:c.(6290+91936291-1398)(7309+2010_7310-1630)dup) a duplication of exons 44 to 51 in the DMD gene as detected by a HindIII digestion, Southern blotting and cDNA hybridisation. The variant is described based on the normal intensity of the hybridizing exon 43 fragment (g.32278336_32289141/c.6118-1440_6290+9193), the double intensity of the hybridizing exon 44 fragment (g.32214437_32218461/c.6291-1398_6438+2479), the double intensity of the hybridizing exon 51 fragment (g.31817965_31821709/c.7201-1626_7309+2010), and the normal intensity of the hybridizing exon 52 fragment (g.31772670_31775822/c.7310-1630_7542+1290) relative to the coding DNA reference sequence.

Peter-J-Freeman commented 7 months ago

@ifokkema To save duplication of effort. Have you handled these in your syntax checker?

ifokkema commented 7 months ago

So, I will leave this work for now until the consultation is completed.

Agreed!

Also related: #328

I've been wanting to make a finalized list of variant descriptions, with reference sequences so we can test them as well with VV, where we have defined:

Also related to https://github.com/LOVDnl/LOVD3/issues/573. And perhaps to Reece's HGVS eval.

Peter-J-Freeman commented 7 months ago

Made a little progress for the unknown position syntax

import json
import VariantValidator
vval = VariantValidator.Validator()
variant = 'NM_006138.4:c.(1_20)_(30_36)del' # variant 1
genome_build = 'GRCh38'
select_transcripts = 'all'
transcript_set = 'refseq'
validate = vval.validate(variant, genome_build, select_transcripts, transcript_set)
validation = validate.format_as_dict(with_meta=True)
print(json.dumps(validation, sort_keys=True, indent=4, separators=(',', ': ')))
{
    "flag": "warning",
    "metadata": {
        "variantvalidator_hgvs_version": "2.2.0",
        "variantvalidator_version": "2.2.1.dev465+gd6addb3.d20231211",
        "vvdb_version": "vvdb_2023_8",
        "vvseqrepo_db": "VV_SR_2023_05/master",
        "vvta_version": "vvta_2023_05"
    },
    "validation_warning_1": {
        "alt_genomic_loci": [],
        "annotations": {},
        "gene_ids": {},
        "gene_symbol": "",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "",
            "lrg_tlr": "",
            "slr": "",
            "tlr": ""
        },
        "hgvs_refseqgene_variant": "",
        "hgvs_transcript_variant": "NM_006138.4:c.(1_20)_(30_36)del",
        "primary_assembly_loci": {
            "grch38": {
                "hgvs_genomic_description": "NC_000011.10:g.(60061161_60061180)_(60061190_60061196)del"
            }
        },
        "reference_sequence_records": {
            "transcript": "https://www.ncbi.nlm.nih.gov/nuccore/NM_006138.4"
        },
        "refseqgene_context_intronic_sequence": "",
        "rna_variant_descriptions": null,
        "selected_assembly": "GRCh38",
        "submitted_variant": "NM_006138.4:c.(1_20)_(30_36)del",
        "transcript_description": "",
        "validation_warnings": [
            "Uncertain positions are not fully supported, however the syntax is valid"
        ],
        "variant_exonic_positions": null
    }
}

g. variants will map to a transcript (either a Select transcript or a single selected transcript). Transcript will map to g. for the selected assembly only