openvar / variantValidator

Public repository for VariantValidator project
GNU Affero General Public License v3.0
67 stars 21 forks source link

Unsupported transcripts #595

Closed leicray closed 3 months ago

leicray commented 3 months ago

Describe the bug An anonymous user has been trying to validate the NM_000109.2:r.4540G>C but this triggers an ERROR message to the sysadmins.

On screen, the error message is Unable to validate the submitted variant NM_000109.2:r.4540G>C against the GRCh37 assembly.

The error is triggered because transcript NM_000109.2 is not in the database, whereas versions 3 and 4 are included. If the description is corrected to NM_000109.4:r.4540g>c then it validates.

VV ought to be trapping that the reference sequence is not supported and outputting a user-friendly warning.

Peter-J-Freeman commented 3 months ago

NM_000109.2:r.4540G>C triggers

"validation_warnings": [ "This not a valid HGVS description, due to characters being in the wrong case. Please check the use of upper- and lowercase characters.", "RNA sequence must be lower-case" ],

Or will do when I update. Will test g>c now

Peter-J-Freeman commented 3 months ago

On the dev version I get this for GRCh38

import json
import VariantValidator
vval = VariantValidator.Validator()
variant = "NM_000109.2:r.4540g>c"
genome_build = 'GRCh38'
select_transcripts = 'all'
transcript_set = 'refseq'
validate = vval.validate(variant, genome_build, select_transcripts, transcript_set)
validation = validate.format_as_dict(with_meta=True)
print(json.dumps(validation, sort_keys=True, indent=4, separators=(',', ': ')))
{
    "NM_000109.2:c.4540G>C": {
        "alt_genomic_loci": [],
        "annotations": {
            "chromosome": "X",
            "db_xref": {
                "CCDS": null,
                "HPRD": "02303",
                "ensemblgene": null,
                "hgnc": "HGNC:2928",
                "ncbigene": "1756",
                "select": false
            },
            "ensembl_select": false,
            "mane_plus_clinical": false,
            "mane_select": false,
            "map": "Xp21.2-p21.1",
            "note": "dystrophin",
            "refseq_select": false,
            "variant": "DP427C"
        },
        "gene_ids": {
            "ccds_ids": [
                "CCDS14234",
                "CCDS48091",
                "CCDS14230",
                "CCDS14233",
                "CCDS14229",
                "CCDS94585",
                "CCDS94586",
                "CCDS14232",
                "CCDS55394",
                "CCDS55395",
                "CCDS14231"
            ],
            "ensembl_gene_id": "ENSG00000198947",
            "entrez_gene_id": "1756",
            "hgnc_id": "HGNC:2928",
            "omim_id": [
                "300377"
            ],
            "ucsc_id": "uc004dda.2"
        },
        "gene_symbol": "DMD",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "",
            "lrg_tlr": "",
            "slr": "NP_000100.2:p.(V1514L)",
            "tlr": "NP_000100.2:p.(Val1514Leu)"
        },
        "hgvs_refseqgene_variant": "",
        "hgvs_transcript_variant": "NM_000109.2:c.4540G>C",
        "primary_assembly_loci": {
            "grch37": {
                "hgvs_genomic_description": "NC_000023.10:g.32404537C>G",
                "vcf": {
                    "alt": "G",
                    "chr": "X",
                    "pos": "32404537",
                    "ref": "C"
                }
            },
            "grch38": {
                "hgvs_genomic_description": "NC_000023.11:g.32386420C>G",
                "vcf": {
                    "alt": "G",
                    "chr": "X",
                    "pos": "32386420",
                    "ref": "C"
                }
            },
            "hg19": {
                "hgvs_genomic_description": "NC_000023.10:g.32404537C>G",
                "vcf": {
                    "alt": "G",
                    "chr": "chrX",
                    "pos": "32404537",
                    "ref": "C"
                }
            },
            "hg38": {
                "hgvs_genomic_description": "NC_000023.11:g.32386420C>G",
                "vcf": {
                    "alt": "G",
                    "chr": "chrX",
                    "pos": "32386420",
                    "ref": "C"
                }
            }
        },
        "reference_sequence_records": {
            "protein": "https://www.ncbi.nlm.nih.gov/nuccore/NP_000100.2",
            "transcript": "https://www.ncbi.nlm.nih.gov/nuccore/NM_000109.2"
        },
        "refseqgene_context_intronic_sequence": "",
        "rna_variant_descriptions": {
            "rna_variant": "NM_000109.2:r.4540g>c",
            "translation": "NP_000100.2:p.Val1514Leu",
            "translation_slr": "NP_000100.2:p.V1514L",
            "usage_warnings": [
                "RNA (r.) descriptions are independent of cDNA descriptions (c.)",
                "RNA descriptions must only be used if the RNA has been sequenced and must not be inferred from a cDNA description",
                "c. and g. descriptions provided by VariantValidator must only be used if the DNA sequence has been confirmed"
            ]
        },
        "selected_assembly": "GRCh38",
        "submitted_variant": "NM_000109.2:r.4540g>c",
        "transcript_description": "Homo sapiens dystrophin (DMD), transcript variant Dp427c, mRNA",
        "validation_warnings": [
            "NM_000109.2:r.4540G>C automapped to NM_000109.2:c.4540G>C",
            "A more recent version of the selected reference sequence NM_000109.2 is available (NM_000109.4): NM_000109.4:c.4540G>C MUST be fully validated prior to use in reports: select_variants=NM_000109.4:c.4540G>C"
        ],
        "variant_exonic_positions": {
            "NC_000023.11": {
                "end_exon": "33",
                "start_exon": "33"
            }
        }
    },
    "flag": "gene_variant",
    "metadata": {
        "variantvalidator_hgvs_version": "2.2.0",
        "variantvalidator_version": "2.2.1.dev530+g520a21a.d20240222",
        "vvdb_version": "vvdb_2023_8",
        "vvseqrepo_db": "VV_SR_2024_01/master",
        "vvta_version": "vvta_2024_01"
    }
}

So this will work

Peter-J-Freeman commented 3 months ago

Same for GRCh37. So will close this as it seems to be resolved. Requires server updates.

Could be the new databases though so cannot replicate in dev

leicray commented 3 months ago

If this has only been checked on the dev server, you are perhaps not seeing what I am seeing on the live interactive validator.

This means that the wrong-case letters warning does already exist in the live interactive validator but that the warning is not triggered because the reference sequence has presumably not been found. However, that situation ought to trigger a "sequence not found" warning.

There might still be more to check.

Peter-J-Freeman commented 3 months ago

It wont be. The servers need updarting with the latest versions of the software. I'm not ready for release yet, but hope to do it soon