openvar / variantValidator

Public repository for VariantValidator project
GNU Affero General Public License v3.0
71 stars 21 forks source link

Error on submitting specific sets of NM_016373.4 variants with uncertain locations #643

Closed John-F-Wagstaff closed 2 months ago

John-F-Wagstaff commented 3 months ago

Bug description when submitting multiple variant descriptions to VariantValidator as part of a set, from NM_016373.4, with uncertain locations, some variant description combinations cause errors. All individual variant descriptions work when submitted onf their own.

To Reproduce Steps to reproduce the behaviour: in python:

import VariantValidator
import json
vval = VariantValidator.Validator()
transcript_set, select_transcripts, genome_build = 'refseq', 'all', 'GRCh37'
variant = [
        'NM_016373.4:c.(1056+1_1057-1)_(1245_?)del',
        'NM_016373.4:c.(1056+1_1057-1)_(1245_?)del',
        'NM_016373.4:c.(516+1_517-1)_(1056+1_1057-1)del',
        'NM_016373.4:c.(516+1_517-1)_(1056+1_1057-1)dup',
        'NM_016373.4:c.(516+1_517-1)_(605+1_606-1)del',
        'NM_016373.4:c.(605+1_606-1)_(791+1_792-1)del',
        'NM_016373.4:c.(?_1-1)_(107+1_108-1)del',
        'NM_016373.4:c.(?_1-1)_(409+1_410-1)del'
        ]
# the line below may be changed for different error/non error combinations (as described below):
variant = json.dumps([variant[2],variant[6]])
# error triggered on next line in bugged combinations
validate = vval.validate(variant, genome_build, select_transcripts, transcript_set)
# only returns/continues to this point for non bugged combinations
validation = validate.format_as_dict(with_meta=True)
print(json.dumps(validation, sort_keys=True, indent=4, separators=(',', ': ')))

The combination of variant descriptions which work are any single variant, variant = json.dumps([variant[0],variant[1],variant[6],variant[7]]) or any subset, and variant = json.dumps(variant[2:5]) or any subset. The combination of variants that fail are either variant[6] or variant[7] plus any of variant[2:5], so variant = json.dumps([variant[2],variant[6]]) or variant = json.dumps([variant[5],variant[7]]) or variant = json.dumps([variant[4],variant[6],variant[7]]) for example.

Expected behavior Either they should all succeed, or the failures should not be unhanded and should apply even when validated individually.

Desktop (please complete the following information):

Peter-J-Freeman commented 3 months ago

The issue seems to be with intronic positions and fuzzy ends

leicray commented 3 months ago

Intronic positions and fuzzy ends sound plausible, but the fact remains that each variant description validates individually. If certain combinations cause problems, that suggests that there is some unintended interaction between the processing of each variant, either concurrently or sequentially.

leicray commented 3 months ago

It looks like the order of the submitted variant descriptions makes a difference.

A job to the batch validator with variants [2] and [7] fails to complete but swapping the order to [7] and [2] is successful:

batch_job (5).txt

Peter-J-Freeman commented 3 months ago

I have confirmed that the fuzzy end code cannot handle intronic positions. So this is the issue. Will resolve ASAP

Peter-J-Freeman commented 3 months ago

OK we will in a future release get

import json
import VariantValidator
vval = VariantValidator.Validator()
variant = '["NM_016373.4:c.(1056+1_1057-1)_(1245_?)del","NM_016373.4:c.(1056+1_1057-1)_(1245_?)del","NM_016373.4:c.(516+1_517-1)_(1056+1_1057-1)del","NM_016373.4:c.(516+1_517-1)_(1056+1_1057-1)dup","NM_016373.4:c.(516+1_517-1)_(605+1_606-1)del","NM_016373.4:c.(605+1_606-1)_(791+1_792-1)del","NM_016373.4:c.(?_1-1)_(107+1_108-1)del","NM_016373.4:c.(?_1-1)_(409+1_410-1)del"]'
genome_build = 'GRCh37'
select_transcripts = 'all'
validate = vval.validate(variant, genome_build, select_transcripts)
validation = validate.format_as_dict(with_meta=True)
print(json.dumps(validation, sort_keys=True, indent=4, separators=(',', ': ')))
{
    "NM_016373.4:c.(516+1_517-1)_(1056+1_1057-1)del": {
        "alt_genomic_loci": [],
        "annotations": {},
        "gene_ids": {
            "ccds_ids": [
                "CCDS42197",
                "CCDS42196"
            ],
            "ensembl_gene_id": "ENSG00000186153",
            "entrez_gene_id": "51741",
            "hgnc_id": "HGNC:12799",
            "omim_id": [
                "605131"
            ],
            "ucsc_id": "uc002ffk.4"
        },
        "gene_symbol": "WWOX",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "",
            "lrg_tlr": "",
            "slr": "",
            "tlr": ""
        },
        "hgvs_refseqgene_variant": "",
        "hgvs_transcript_variant": "NM_016373.4:c.(516+1_517-1)_(1056+1_1057-1)del",
        "primary_assembly_loci": {
            "grch37": {
                "hgvs_genomic_description": "NC_000016.9:g.(78198187_78420756)_(78466650_79245504)del",
                "vcf": {
                    "alt": null,
                    "chr": null,
                    "pos": null,
                    "ref": null
                }
            }
        },
        "reference_sequence_records": {
            "transcript": "https://www.ncbi.nlm.nih.gov/nuccore/NM_016373.4"
        },
        "refseqgene_context_intronic_sequence": "",
        "rna_variant_descriptions": null,
        "selected_assembly": "GRCh37",
        "submitted_variant": "NM_016373.4:c.(516+1_517-1)_(1056+1_1057-1)del",
        "transcript_description": "",
        "validation_warnings": [
            "Uncertain positions are not fully supported, however the syntax is valid"
        ],
        "variant_exonic_positions": null
    },
    "NM_016373.4:c.(516+1_517-1)_(1056+1_1057-1)dup": {
        "alt_genomic_loci": [],
        "annotations": {},
        "gene_ids": {
            "ccds_ids": [
                "CCDS42197",
                "CCDS42196"
            ],
            "ensembl_gene_id": "ENSG00000186153",
            "entrez_gene_id": "51741",
            "hgnc_id": "HGNC:12799",
            "omim_id": [
                "605131"
            ],
            "ucsc_id": "uc002ffk.4"
        },
        "gene_symbol": "WWOX",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "",
            "lrg_tlr": "",
            "slr": "",
            "tlr": ""
        },
        "hgvs_refseqgene_variant": "",
        "hgvs_transcript_variant": "NM_016373.4:c.(516+1_517-1)_(1056+1_1057-1)dup",
        "primary_assembly_loci": {
            "grch37": {
                "hgvs_genomic_description": "NC_000016.9:g.(78198187_78420756)_(78466650_79245504)dup",
                "vcf": {
                    "alt": null,
                    "chr": null,
                    "pos": null,
                    "ref": null
                }
            }
        },
        "reference_sequence_records": {
            "transcript": "https://www.ncbi.nlm.nih.gov/nuccore/NM_016373.4"
        },
        "refseqgene_context_intronic_sequence": "",
        "rna_variant_descriptions": null,
        "selected_assembly": "GRCh37",
        "submitted_variant": "NM_016373.4:c.(516+1_517-1)_(1056+1_1057-1)dup",
        "transcript_description": "",
        "validation_warnings": [
            "Uncertain positions are not fully supported, however the syntax is valid"
        ],
        "variant_exonic_positions": null
    },
    "NM_016373.4:c.(516+1_517-1)_(605+1_606-1)del": {
        "alt_genomic_loci": [],
        "annotations": {},
        "gene_ids": {
            "ccds_ids": [
                "CCDS42197",
                "CCDS42196"
            ],
            "ensembl_gene_id": "ENSG00000186153",
            "entrez_gene_id": "51741",
            "hgnc_id": "HGNC:12799",
            "omim_id": [
                "605131"
            ],
            "ucsc_id": "uc002ffk.4"
        },
        "gene_symbol": "WWOX",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "",
            "lrg_tlr": "",
            "slr": "",
            "tlr": ""
        },
        "hgvs_refseqgene_variant": "",
        "hgvs_transcript_variant": "NM_016373.4:c.(516+1_517-1)_(605+1_606-1)del",
        "primary_assembly_loci": {
            "grch37": {
                "hgvs_genomic_description": "NC_000016.9:g.(78198187_78420756)_(78420846_78458766)del",
                "vcf": {
                    "alt": null,
                    "chr": null,
                    "pos": null,
                    "ref": null
                }
            }
        },
        "reference_sequence_records": {
            "transcript": "https://www.ncbi.nlm.nih.gov/nuccore/NM_016373.4"
        },
        "refseqgene_context_intronic_sequence": "",
        "rna_variant_descriptions": null,
        "selected_assembly": "GRCh37",
        "submitted_variant": "NM_016373.4:c.(516+1_517-1)_(605+1_606-1)del",
        "transcript_description": "",
        "validation_warnings": [
            "Uncertain positions are not fully supported, however the syntax is valid"
        ],
        "variant_exonic_positions": null
    },
    "NM_016373.4:c.(605+1_606-1)_(791+1_792-1)del": {
        "alt_genomic_loci": [],
        "annotations": {},
        "gene_ids": {
            "ccds_ids": [
                "CCDS42197",
                "CCDS42196"
            ],
            "ensembl_gene_id": "ENSG00000186153",
            "entrez_gene_id": "51741",
            "hgnc_id": "HGNC:12799",
            "omim_id": [
                "605131"
            ],
            "ucsc_id": "uc002ffk.4"
        },
        "gene_symbol": "WWOX",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "",
            "lrg_tlr": "",
            "slr": "",
            "tlr": ""
        },
        "hgvs_refseqgene_variant": "",
        "hgvs_transcript_variant": "NM_016373.4:c.(605+1_606-1)_(791+1_792-1)del",
        "primary_assembly_loci": {
            "grch37": {
                "hgvs_genomic_description": "NC_000016.9:g.(78420846_78458766)_(78458953_78466384)del",
                "vcf": {
                    "alt": null,
                    "chr": null,
                    "pos": null,
                    "ref": null
                }
            }
        },
        "reference_sequence_records": {
            "transcript": "https://www.ncbi.nlm.nih.gov/nuccore/NM_016373.4"
        },
        "refseqgene_context_intronic_sequence": "",
        "rna_variant_descriptions": null,
        "selected_assembly": "GRCh37",
        "submitted_variant": "NM_016373.4:c.(605+1_606-1)_(791+1_792-1)del",
        "transcript_description": "",
        "validation_warnings": [
            "Uncertain positions are not fully supported, however the syntax is valid"
        ],
        "variant_exonic_positions": null
    },
    "flag": "gene_variant",
    "metadata": {
        "variantvalidator_hgvs_version": "2.2.0",
        "variantvalidator_version": "2.2.1.dev677+g36d03e7.d20240725",
        "vvdb_version": "vvdb_2024_8",
        "vvseqrepo_db": "VV_SR_2024_06/master",
        "vvta_version": "vvta_2024_06"
    },
    "validation_warning_1": {
        "alt_genomic_loci": [],
        "annotations": {},
        "gene_ids": {},
        "gene_symbol": "",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "",
            "lrg_tlr": "",
            "slr": "",
            "tlr": ""
        },
        "hgvs_refseqgene_variant": "",
        "hgvs_transcript_variant": "",
        "primary_assembly_loci": {},
        "reference_sequence_records": "",
        "refseqgene_context_intronic_sequence": "",
        "rna_variant_descriptions": null,
        "selected_assembly": "GRCh37",
        "submitted_variant": "NM_016373.4:c.(1056+1_1057-1)_(1245_?)del",
        "transcript_description": "",
        "validation_warnings": [
            "Uncertain positions are not fully supported, however the syntax is valid",
            "Fuzzy/unknown variant end position in submitted variant description"
        ],
        "variant_exonic_positions": null
    },
    "validation_warning_2": {
        "alt_genomic_loci": [],
        "annotations": {},
        "gene_ids": {},
        "gene_symbol": "",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "",
            "lrg_tlr": "",
            "slr": "",
            "tlr": ""
        },
        "hgvs_refseqgene_variant": "",
        "hgvs_transcript_variant": "",
        "primary_assembly_loci": {},
        "reference_sequence_records": "",
        "refseqgene_context_intronic_sequence": "",
        "rna_variant_descriptions": null,
        "selected_assembly": "GRCh37",
        "submitted_variant": "NM_016373.4:c.(1056+1_1057-1)_(1245_?)del",
        "transcript_description": "",
        "validation_warnings": [
            "Uncertain positions are not fully supported, however the syntax is valid",
            "Fuzzy/unknown variant end position in submitted variant description"
        ],
        "variant_exonic_positions": null
    },
    "validation_warning_3": {
        "alt_genomic_loci": [],
        "annotations": {},
        "gene_ids": {},
        "gene_symbol": "",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "",
            "lrg_tlr": "",
            "slr": "",
            "tlr": ""
        },
        "hgvs_refseqgene_variant": "",
        "hgvs_transcript_variant": "",
        "primary_assembly_loci": {},
        "reference_sequence_records": "",
        "refseqgene_context_intronic_sequence": "",
        "rna_variant_descriptions": null,
        "selected_assembly": "GRCh37",
        "submitted_variant": "NM_016373.4:c.(?_1-1)_(107+1_108-1)del",
        "transcript_description": "",
        "validation_warnings": [
            "ExonBoundaryError: Position 1 does not correspond to an exon boundary for transcript NM_016373.4 aligned to GRCh37 genomic reference sequence NC_000016.9"
        ],
        "variant_exonic_positions": null
    },
    "validation_warning_4": {
        "alt_genomic_loci": [],
        "annotations": {},
        "gene_ids": {},
        "gene_symbol": "",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "",
            "lrg_tlr": "",
            "slr": "",
            "tlr": ""
        },
        "hgvs_refseqgene_variant": "",
        "hgvs_transcript_variant": "",
        "primary_assembly_loci": {},
        "reference_sequence_records": "",
        "refseqgene_context_intronic_sequence": "",
        "rna_variant_descriptions": null,
        "selected_assembly": "GRCh37",
        "submitted_variant": "NM_016373.4:c.(?_1-1)_(409+1_410-1)del",
        "transcript_description": "",
        "validation_warnings": [
            "ExonBoundaryError: Position 1 does not correspond to an exon boundary for transcript NM_016373.4 aligned to GRCh37 genomic reference sequence NC_000016.9"
        ],
        "variant_exonic_positions": null
    }
}

So there are errors in the exon boundaries, more specifically in this case, they misnumbered the UTR I think

Peter-J-Freeman commented 3 months ago

Can someone please look through this and check the outputs

leicray commented 3 months ago

Might this be an issue specifically with mapping of intron boundaries for GRCh37? The user who raised the issue did not specify the genome build that was used. It was @John-F-Wagstaff who used GRCh37 for his investigations. Can you generate the output using the patched software version and genome build GRCh38 just in case that make a difference. I'm thinking about possible direction-of-liftover effects.

I will do some checking tomorrow.

Peter-J-Freeman commented 3 months ago

I think it is a user who does not know how to number 3 prime UTRs. c.1-1 is very unlilkely. An exon starting witht the translation initiation codon. Will do the GRCh38 output now too though

Peter-J-Freeman commented 3 months ago
{
    "NM_016373.4:c.(516+1_517-1)_(1056+1_1057-1)del": {
        "alt_genomic_loci": [],
        "annotations": {},
        "gene_ids": {
            "ccds_ids": [
                "CCDS42197",
                "CCDS42196"
            ],
            "ensembl_gene_id": "ENSG00000186153",
            "entrez_gene_id": "51741",
            "hgnc_id": "HGNC:12799",
            "omim_id": [
                "605131"
            ],
            "ucsc_id": "uc002ffk.4"
        },
        "gene_symbol": "WWOX",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "",
            "lrg_tlr": "",
            "slr": "",
            "tlr": ""
        },
        "hgvs_refseqgene_variant": "",
        "hgvs_transcript_variant": "NM_016373.4:c.(516+1_517-1)_(1056+1_1057-1)del",
        "primary_assembly_loci": {
            "grch38": {
                "hgvs_genomic_description": "NC_000016.10:g.(78164290_78386859)_(78432753_79211607)del",
                "vcf": {
                    "alt": null,
                    "chr": null,
                    "pos": null,
                    "ref": null
                }
            }
        },
        "reference_sequence_records": {
            "transcript": "https://www.ncbi.nlm.nih.gov/nuccore/NM_016373.4"
        },
        "refseqgene_context_intronic_sequence": "",
        "rna_variant_descriptions": null,
        "selected_assembly": "GRCh38",
        "submitted_variant": "NM_016373.4:c.(516+1_517-1)_(1056+1_1057-1)del",
        "transcript_description": "",
        "validation_warnings": [
            "Uncertain positions are not fully supported, however the syntax is valid"
        ],
        "variant_exonic_positions": null
    },
    "NM_016373.4:c.(516+1_517-1)_(1056+1_1057-1)dup": {
        "alt_genomic_loci": [],
        "annotations": {},
        "gene_ids": {
            "ccds_ids": [
                "CCDS42197",
                "CCDS42196"
            ],
            "ensembl_gene_id": "ENSG00000186153",
            "entrez_gene_id": "51741",
            "hgnc_id": "HGNC:12799",
            "omim_id": [
                "605131"
            ],
            "ucsc_id": "uc002ffk.4"
        },
        "gene_symbol": "WWOX",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "",
            "lrg_tlr": "",
            "slr": "",
            "tlr": ""
        },
        "hgvs_refseqgene_variant": "",
        "hgvs_transcript_variant": "NM_016373.4:c.(516+1_517-1)_(1056+1_1057-1)dup",
        "primary_assembly_loci": {
            "grch38": {
                "hgvs_genomic_description": "NC_000016.10:g.(78164290_78386859)_(78432753_79211607)dup",
                "vcf": {
                    "alt": null,
                    "chr": null,
                    "pos": null,
                    "ref": null
                }
            }
        },
        "reference_sequence_records": {
            "transcript": "https://www.ncbi.nlm.nih.gov/nuccore/NM_016373.4"
        },
        "refseqgene_context_intronic_sequence": "",
        "rna_variant_descriptions": null,
        "selected_assembly": "GRCh38",
        "submitted_variant": "NM_016373.4:c.(516+1_517-1)_(1056+1_1057-1)dup",
        "transcript_description": "",
        "validation_warnings": [
            "Uncertain positions are not fully supported, however the syntax is valid"
        ],
        "variant_exonic_positions": null
    },
    "NM_016373.4:c.(516+1_517-1)_(605+1_606-1)del": {
        "alt_genomic_loci": [],
        "annotations": {},
        "gene_ids": {
            "ccds_ids": [
                "CCDS42197",
                "CCDS42196"
            ],
            "ensembl_gene_id": "ENSG00000186153",
            "entrez_gene_id": "51741",
            "hgnc_id": "HGNC:12799",
            "omim_id": [
                "605131"
            ],
            "ucsc_id": "uc002ffk.4"
        },
        "gene_symbol": "WWOX",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "",
            "lrg_tlr": "",
            "slr": "",
            "tlr": ""
        },
        "hgvs_refseqgene_variant": "",
        "hgvs_transcript_variant": "NM_016373.4:c.(516+1_517-1)_(605+1_606-1)del",
        "primary_assembly_loci": {
            "grch38": {
                "hgvs_genomic_description": "NC_000016.10:g.(78164290_78386859)_(78386949_78424869)del",
                "vcf": {
                    "alt": null,
                    "chr": null,
                    "pos": null,
                    "ref": null
                }
            }
        },
        "reference_sequence_records": {
            "transcript": "https://www.ncbi.nlm.nih.gov/nuccore/NM_016373.4"
        },
        "refseqgene_context_intronic_sequence": "",
        "rna_variant_descriptions": null,
        "selected_assembly": "GRCh38",
        "submitted_variant": "NM_016373.4:c.(516+1_517-1)_(605+1_606-1)del",
        "transcript_description": "",
        "validation_warnings": [
            "Uncertain positions are not fully supported, however the syntax is valid"
        ],
        "variant_exonic_positions": null
    },
    "NM_016373.4:c.(605+1_606-1)_(791+1_792-1)del": {
        "alt_genomic_loci": [],
        "annotations": {},
        "gene_ids": {
            "ccds_ids": [
                "CCDS42197",
                "CCDS42196"
            ],
            "ensembl_gene_id": "ENSG00000186153",
            "entrez_gene_id": "51741",
            "hgnc_id": "HGNC:12799",
            "omim_id": [
                "605131"
            ],
            "ucsc_id": "uc002ffk.4"
        },
        "gene_symbol": "WWOX",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "",
            "lrg_tlr": "",
            "slr": "",
            "tlr": ""
        },
        "hgvs_refseqgene_variant": "",
        "hgvs_transcript_variant": "NM_016373.4:c.(605+1_606-1)_(791+1_792-1)del",
        "primary_assembly_loci": {
            "grch38": {
                "hgvs_genomic_description": "NC_000016.10:g.(78386949_78424869)_(78425056_78432487)del",
                "vcf": {
                    "alt": null,
                    "chr": null,
                    "pos": null,
                    "ref": null
                }
            }
        },
        "reference_sequence_records": {
            "transcript": "https://www.ncbi.nlm.nih.gov/nuccore/NM_016373.4"
        },
        "refseqgene_context_intronic_sequence": "",
        "rna_variant_descriptions": null,
        "selected_assembly": "GRCh38",
        "submitted_variant": "NM_016373.4:c.(605+1_606-1)_(791+1_792-1)del",
        "transcript_description": "",
        "validation_warnings": [
            "Uncertain positions are not fully supported, however the syntax is valid"
        ],
        "variant_exonic_positions": null
    },
    "flag": "gene_variant",
    "metadata": {
        "variantvalidator_hgvs_version": "2.2.0",
        "variantvalidator_version": "2.2.1.dev677+g36d03e7.d20240725",
        "vvdb_version": "vvdb_2024_8",
        "vvseqrepo_db": "VV_SR_2024_06/master",
        "vvta_version": "vvta_2024_06"
    },
    "validation_warning_1": {
        "alt_genomic_loci": [],
        "annotations": {},
        "gene_ids": {},
        "gene_symbol": "",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "",
            "lrg_tlr": "",
            "slr": "",
            "tlr": ""
        },
        "hgvs_refseqgene_variant": "",
        "hgvs_transcript_variant": "",
        "primary_assembly_loci": {},
        "reference_sequence_records": "",
        "refseqgene_context_intronic_sequence": "",
        "rna_variant_descriptions": null,
        "selected_assembly": "GRCh38",
        "submitted_variant": "NM_016373.4:c.(1056+1_1057-1)_(1245_?)del",
        "transcript_description": "",
        "validation_warnings": [
            "Uncertain positions are not fully supported, however the syntax is valid",
            "Fuzzy/unknown variant end position in submitted variant description"
        ],
        "variant_exonic_positions": null
    },
    "validation_warning_2": {
        "alt_genomic_loci": [],
        "annotations": {},
        "gene_ids": {},
        "gene_symbol": "",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "",
            "lrg_tlr": "",
            "slr": "",
            "tlr": ""
        },
        "hgvs_refseqgene_variant": "",
        "hgvs_transcript_variant": "",
        "primary_assembly_loci": {},
        "reference_sequence_records": "",
        "refseqgene_context_intronic_sequence": "",
        "rna_variant_descriptions": null,
        "selected_assembly": "GRCh38",
        "submitted_variant": "NM_016373.4:c.(1056+1_1057-1)_(1245_?)del",
        "transcript_description": "",
        "validation_warnings": [
            "Uncertain positions are not fully supported, however the syntax is valid",
            "Fuzzy/unknown variant end position in submitted variant description"
        ],
        "variant_exonic_positions": null
    },
    "validation_warning_3": {
        "alt_genomic_loci": [],
        "annotations": {},
        "gene_ids": {},
        "gene_symbol": "",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "",
            "lrg_tlr": "",
            "slr": "",
            "tlr": ""
        },
        "hgvs_refseqgene_variant": "",
        "hgvs_transcript_variant": "",
        "primary_assembly_loci": {},
        "reference_sequence_records": "",
        "refseqgene_context_intronic_sequence": "",
        "rna_variant_descriptions": null,
        "selected_assembly": "GRCh38",
        "submitted_variant": "NM_016373.4:c.(?_1-1)_(107+1_108-1)del",
        "transcript_description": "",
        "validation_warnings": [
            "ExonBoundaryError: Position 1 does not correspond to an exon boundary for transcript NM_016373.4 aligned to GRCh38 genomic reference sequence NC_000016.10"
        ],
        "variant_exonic_positions": null
    },
    "validation_warning_4": {
        "alt_genomic_loci": [],
        "annotations": {},
        "gene_ids": {},
        "gene_symbol": "",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "",
            "lrg_tlr": "",
            "slr": "",
            "tlr": ""
        },
        "hgvs_refseqgene_variant": "",
        "hgvs_transcript_variant": "",
        "primary_assembly_loci": {},
        "reference_sequence_records": "",
        "refseqgene_context_intronic_sequence": "",
        "rna_variant_descriptions": null,
        "selected_assembly": "GRCh38",
        "submitted_variant": "NM_016373.4:c.(?_1-1)_(409+1_410-1)del",
        "transcript_description": "",
        "validation_warnings": [
            "ExonBoundaryError: Position 1 does not correspond to an exon boundary for transcript NM_016373.4 aligned to GRCh38 genomic reference sequence NC_000016.10"
        ],
        "variant_exonic_positions": null
    }
}
leicray commented 3 months ago

I think it is a user who does not know how to number 3 prime UTRs. c.1-1 is very unlilkely. An exon starting witht the translation initiation codon.

Position c.1 is definitely not the start of an exon for this gene. Hence, c.1-1 (which would be the last nucleotide of the 5 prime UTR) is invalid. Are such exon boundary errors not trapped automatically?

Peter-J-Freeman commented 3 months ago

Sorry, I meant 5-prime

This code is independant of existing code that normally traps the boundaries. This is because uncertain positions cannot be parsed into a HGVS.py object in the version we use at lease. So I added code to trap bounday errors which is why the error is generated. The code looks for intronic positions submitted, pulls the exon boundaries for the transcript based on genomic reference sequence and moans if the stated exonic position does not match a listed boundary position (i.e. the first or last base in an exon)

leicray commented 3 months ago

The outputs for GRCh37 and GRCh38 look correct. The proof will come when the code is pushed out to the live servers.

Peter-J-Freeman commented 2 months ago

OK, will close for now ready to be pushed up at the next update