openvar / variantValidator

Public repository for VariantValidator project
GNU Affero General Public License v3.0
67 stars 21 forks source link

Submission of protein variants needs a better error message #465

Open leicray opened 1 year ago

leicray commented 1 year ago

Describe the bug Some users try to validate protein level variants such as NP_061966.1:p.(Gln289_Gln332del) but the error message is not sufficiently informative. Submission of this variant produces the on-screen error message:

Unable to validate the submitted variant NP_061966.1:p.(Gln289_Gln332del) against the GRCh38 assembly.

Please check your submission and re-submit.

The submission process also creates an ERROR message that is sent to the admins.

However, amino acid substitution variants such asNP_000079.2:p.(Gly197Cys) are correctly handled and produce the useful on-screen message:

Unable to validate the submitted variant NP_000079.2:p.(Gly197Cys) against the GRCh37 assembly. The following warnings were returned:

Protein level variant descriptions are not fully supported due to redundancy in the genetic code NP_000079.2:p.(Gly197Cys) is HGVS compliant and contains a valid reference amino acid description

Please check your submission and re-submit.

Expected behaviour VariantValidator needs to better handle amino acid variant descriptions that are not just simple substitutions and provide appropriate error messages.

Peter-J-Freeman commented 1 year ago

Unable to validate the submitted variant NP_000079.2:p.(Gly197Cys) against the GRCh37 assembly

I think these warnings need to be suppressed

Protein level variant descriptions are not fully supported due to redundancy in the genetic code NP_000079.2:p.(Gly197Cys) is HGVS compliant and contains a valid reference amino acid description

These warnings should always be displayed though

Is this sufficient. If so, I will fix for the next release

leicray commented 1 year ago

Supressing the warning Unable to validate the submitted variant NP_000079.2:p.(Gly197Cys) against the GRCh37 assembly is certainly appropriate.

However, I suspect that more might need to be done for the handling of non-substitution protein-level variants. At present, the variants NP_000079.2:p.(Gly197Cys) and NP_061966.1:p.(Gln289_Gln332del) both trigger error messages, both on-screen and to the admins. We need to ensure that variant descriptions of these types are also being checked with respect to the amino acids and that their locations are valid. It currently looks as though variant descriptions of these types are not being checked in the same fashion as simple-substitution descriptions.

Peter-J-Freeman commented 1 year ago

Ah, I wonder if they are not validating then. I will look into it

Peter-J-Freeman commented 1 year ago
import json
import VariantValidator
vval = VariantValidator.Validator()
variant = 'NP_000079.2:p.(Gly197Cys)' # variant 1
genome_build = 'GRCh37'
select_transcripts = 'all'
transcript_set = 'refseq'
validate = vval.validate(variant, genome_build, select_transcripts, transcript_set)
validation = validate.format_as_dict(with_meta=True)
print(json.dumps(validation, sort_keys=True, indent=4, separators=(',', ': '))

This is the current output

{
    "flag": "warning",
    "metadata": {
        "variantvalidator_hgvs_version": "2.0.2.dev5+g69b1a7c",
        "variantvalidator_version": "2.1.1.dev69+g4e3c76e",
        "vvdb_version": "vvdb_2022_11",
        "vvseqrepo_db": "VV_SR_2022_11/master",
        "vvta_version": "vvta_2022_11"
    },
    "validation_warning_1": {
        "alt_genomic_loci": [],
        "annotations": {},
        "gene_ids": {},
        "gene_symbol": "",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "LRG_1p1:p.(G197C)",
            "lrg_tlr": "LRG_1p1:p.(Gly197Cys)",
            "slr": "NP_000079.2:p.(G197C)",
            "tlr": "NP_000079.2:p.(Gly197Cys)"
        },
        "hgvs_refseqgene_variant": "",
        "hgvs_transcript_variant": "",
        "primary_assembly_loci": {},
        "reference_sequence_records": {
            "protein": "https://www.ncbi.nlm.nih.gov/nuccore/NP_000079.2"
        },
        "refseqgene_context_intronic_sequence": "",
        "rna_variant_descriptions": null,
        "selected_assembly": "GRCh37",
        "submitted_variant": "NP_000079.2:p.(Gly197Cys)",
        "transcript_description": "",
        "validation_warnings": [
            "Protein level variant descriptions are not fully supported due to redundancy in the genetic code",
            "NP_000079.2:p.(Gly197Cys) is HGVS compliant and contains a valid reference amino acid description"
        ],
        "variant_exonic_positions": null
    }
}

The warnings are here and seem relatively appropriate?

        "validation_warnings": [
            "Protein level variant descriptions are not fully supported due to redundancy in the genetic code",
            "NP_000079.2:p.(Gly197Cys) is HGVS compliant and contains a valid reference amino acid description"
        ],
Peter-J-Freeman commented 1 year ago
import json
import VariantValidator
vval = VariantValidator.Validator()
variant = 'NP_061966.1:p.(Gln289_Gln332del)' # variant 1
genome_build = 'GRCh37'
select_transcripts = 'all'
transcript_set = 'refseq'
validate = vval.validate(variant, genome_build, select_transcripts, transcript_set)
validation = validate.format_as_dict(with_meta=True)
print(json.dumps(validation, sort_keys=True, indent=4, separators=(',', ': ')))

Throws an error which I will try and resolve

leicray commented 1 year ago

Yes, the first example looks as expected.

Peter-J-Freeman commented 1 year ago

The second example will now return

{
    "flag": "warning",
    "metadata": {
        "variantvalidator_hgvs_version": "2.0.2.dev5+g69b1a7c",
        "variantvalidator_version": "2.1.1.dev69+g4e3c76e",
        "vvdb_version": "vvdb_2022_11",
        "vvseqrepo_db": "VV_SR_2022_11/master",
        "vvta_version": "vvta_2022_11"
    },
    "validation_warning_1": {
        "alt_genomic_loci": [],
        "annotations": {},
        "gene_ids": {},
        "gene_symbol": "",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "",
            "lrg_tlr": "",
            "slr": "NP_061966.1:p.(Q289_Q332del)",
            "tlr": "NP_061966.1:p.(Gln289_Gln332del)"
        },
        "hgvs_refseqgene_variant": "",
        "hgvs_transcript_variant": "",
        "primary_assembly_loci": {},
        "reference_sequence_records": {
            "protein": "https://www.ncbi.nlm.nih.gov/nuccore/NP_061966.1"
        },
        "refseqgene_context_intronic_sequence": "",
        "rna_variant_descriptions": null,
        "selected_assembly": "GRCh37",
        "submitted_variant": "NP_061966.1:p.(Gln289_Gln332del)",
        "transcript_description": "",
        "validation_warnings": [
            "Protein level variant descriptions are not fully supported due to redundancy in the genetic code",
            "NP_061966.1:p.(Gln289_Gln332del) is HGVS compliant and contains a valid reference amino acid description"
        ],
        "variant_exonic_positions": null
    }
}
leicray commented 1 year ago

That looks better.

Peter-J-Freeman commented 1 year ago

Excellent. I think the error message Unable to validate the submitted variant NP_061966.1:p.(Gln289_Gln332del) against the GRCh38 assembly is a VVweb message. I will see if I can spin up my laptop dev version. Sometimes it will, sometimes it wont!

Peter-J-Freeman commented 1 year ago

OK, the VV engine has been updated to fix this issue. It is the VVweb interface that needs to be updated. I'm having difficulty spinning up a dev version. @John-F-Wagstaff needs to get on with more important jobs. Leave it with me @leicray. I'll try get my system running

Peter-J-Freeman commented 1 year ago

OK, managed to spin it up @leicray

Currently on dev it looks like this

image

Comments please

leicray commented 1 year ago

The second warning looks fine for now. When time allows, it could be modified to accommodate that there are two amino acids mentioned in the variant description. It looks like it was written for situation where the variant description relates to a single amino acid.

A simple change to the warning might be:

NP_061966.1:p.(Gln289_Gln322del) is HGVS compliant and the reference amino acid(s) in the description is/are valid

Peter-J-Freeman commented 1 year ago

The more complicated we make it, the more likely it is to go wrong. However, this seems to be a simple change to implement as it is a simple text replacement. Will get it done ASAP

Peter-J-Freeman commented 1 year ago

Hold on @leicray . Just realised that some folks may be using this warning as a search term. @ifokkema for example. Let's check before we change it

leicray commented 1 year ago

Do not bother with changing the message if there might be knock-on effects.

ifokkema commented 1 year ago

Hold on @leicray . Just realised that some folks may be using this warning as a search term. @ifokkema for example. Let's check before we change it

Thanks for thinking about me, but we don't look for warnings used for protein variants. Those should get caught and handled before LOVD sends them to VV.

Peter-J-Freeman commented 1 year ago

Thanks. @ifokkema. I have a feeling that we may use it though so for now will leave it and will re-visit if needed