openvar / variantValidator

Public repository for VariantValidator project
GNU Affero General Public License v3.0
67 stars 21 forks source link

Submitted variant description cannot be validated as it is located in a region of the reference sequence represented by base N and not GATC #512

Closed juliejurgens closed 1 year ago

juliejurgens commented 1 year ago

Describe the bug Hello, I'm attempting to validate the following deletion: chr10(GRCh38):g.125976998_133427130del. I receive this error upon submission: "Submitted variant description cannot be validated as it is located in a region of the reference sequence represented by base N and not GATC"

Are there any additional steps I can take to validate this variant? Or will validation not be possible? Thank you!

Screenshots Screen Shot 2023-06-29 at 10 04 34 AM

Desktop:

leicray commented 1 year ago

Thank you for using VariantValidator.

For GRCh38.p14 chromosome 10 (NC_000010.11) there are several sequence gaps which have not been sequenced completely and for which the nucleotides are represented by base N and not GATC.

The region g.125976998_133427130 overlaps two of these gap regions:

The start and the end points of the deletion that you wish to validate do fall within defined chromosomal sequence. However, this is an enormous region which contains 115 genes:

NC_000010.11[125976998..133427130].pdf

Might I ask why you wish to validate this deletion? Do you have sequence data across the deletion break points?

To be honest, I do not think that this deletion can be validated in any useful way. However, it would be better to describe the deletion as NC_000010.11:g.125976998_133427130del

Peter-J-Freeman commented 1 year ago

This is an interesting topic. @leicray, I would have assumed this error would only come up if the break-points provided fall within NNNN sequence and not GATC sequence, however, in this case are the breakpoints in GATC sequence with NNNN sequence in the range?

This will be an interesting edge case. N is a valid character in reference sequences, but we rarely come across instances where they need to be handled. Thanks for the report @juliejurgens. Right now I cannot advise on additional steps until I look at the code. I have limited development time for the next few weeks, but will look at this when I find some time and keep you posted

juliejurgens commented 1 year ago

Hi @leicray and @Peter-J-Freeman, thank you very much for looking into this. This chr10q26-qter deletion was called with the identical breakpoints denoted above in 2 unrelated individuals by genome sequencing. However, prior SNP arrays denoted slightly different deletion breakpoints in these same 2 individuals. We are including this deletion as one of a few findings identified in our genome sequencing cohort and plan to publish in Genetics in Medicine, which requires confirmation of variant annotation with VariantValidator. If there is no good way of validating this variant and no reasonable workaround, I can annotate with the nomenclature above and let the editorial team know. Just let me know what makes most sense for your team. Thank you for developing such a useful tool!

Peter-J-Freeman commented 1 year ago

Hi @leicray and @juliejurgens I have switched this validation on. The warning will read

        "validation_warnings": [
            "This is not a valid HGVS variant description, because no reference sequence ID has been provided",
            "Submitted variant description cannot be fully validated because it spans a region of the reference sequence represented by base N and not bases GATC",
            "No transcripts found that fully overlap the described variation in the genomic sequence"
        ],