openvar / variantValidator

Public repository for VariantValidator project
GNU Affero General Public License v3.0
67 stars 21 forks source link

Add mapping support for variants only partially intergenic #333

Open ifokkema opened 2 years ago

ifokkema commented 2 years ago

Is your feature request related to a problem? Please describe. For LOVD to store a gene-specific effect of a variant, LOVD must store the mapped gene-level representation of that variant. While it is understandable that intergenic variants can not be mapped to genes, variants that do overlap genes should always have a gene-level representation, even if they are also partially intergenic. But, variants entirely deleting genes with both of the deletion's endpoints outside of the gene's bounds, currently do not report any mapping (tested the LOVD endpoint and the VV endpoint). Also, variants deleting half of a gene with the other endpoint outside of the gene's bounds also do not report any mapping. E.g., NC_000016.9:g.2106894_2161281del.

Describe the solution you'd like In order for LOVD to "discover" an effect on the gene, VV should return a mapping. An issue is, however, that the HGVS nomenclature doesn't have any valid rules currently that can describe such a variant. We conducted a poll among 400 LOVD curators, asking them what description would be best to be used on the gene level. It was highlighted to them that none of the possibilities were HGVS compliant, so they were purely asked about their preference. In total, 54 curators replied.

Note, for all given descriptions, the intended reference sequence is NC_000016.9(NM_001009944.2), equal to intronic variation. The given options were;

A. "-" (an empty description) This gives no details on what region of the coding DNA reference sequence is affected.

B. "c.3887_*32834del" This is the current output that the Mutalyzer tool generates, linked to the LOVD database, to check and create variant descriptions. Mutalyzer maps the variant's endpoint assuming c.* numbering continues forever. It gives a clear indication of the full deletion's size but is not supported by HGVS.

C. "c.3887_*1017del" This does give details on what region of the coding DNA reference sequence is affected (c.1017 is the last base of this reference sequence), but it suggests the deletion has been sequenced as c.3887_1017del. Since, in fact, the deletion extends beyond c.*1107, this description is not correct.

D. "c.3887(*1017?)del" This does give details on what region of the coding DNA reference sequence is affected and shows more sequence has been deleted, although it suggests the endpoint of the deletion is not known while it is.

E. "c.3887_*1017[0]" This new format does give details on what region of the coding DNA reference sequence is affected, and the [0] suggests it is present in 0 copies, so deleted. However, the format may be confused with the HGVS allele format, which also uses []. NOTE: For a duplication, we would use [2].

F. "c.3887_*1017{0}" This new format does give details on what region of the coding DNA reference sequence is affected, and the {0} suggests it is present in 0 copies. Since HGVS does not use {}, there can not be any confusion. NOTE: For a duplication, we would use {2}.

Note, as a response to the survey, Peter Taschner noted another option; G. "c.3887_*1017+d31817del" This has been proposed before but was rejected by the HGVS. It indicates clearly the extent of the deletion, including the extent of the reference sequence, and more closely resembles the intronic variant notation.

The results were as follows; survey_results_2020-07-06

My personal worry is also to generate any description that cannot be mapped back to the genome. I.e., options A, C, D, E, and F, can not be mapped back to the genome if their source was the transcript. So, information is lost. Personally, I feel that the "we cannot describe positions not mentioned in the reference sequence" is solved by using the NC(NM) construct, just like intronic variants are handled now. I haven't heard any argument why it can't work like this, that would also not apply to how we describe intronic variants.

Describe alternatives you've considered Note that Mutalyzer currently uses option B and that descriptions like these are currently widely spread in LOVD.

Additional context Note, that Johan decided to ignore the wishes of the curators, and decided to implement option F in the GV shared LOVD. For many "new" submissions (up to one and a half years old or so), option F is used and not B.

Peter-J-Freeman commented 2 years ago

My initial thoughts on this @ifokkema and @leicray is to set up a vid call. I have huge concerns with opening the can of worms again that is letting variant descriptions in the context of transcript reference sequences whereby there would be a need to describe variation beyond the boundaries of the reference sequence. It is not a good idea, so I think we need to have a very good think about this and make recommendations for the HGVS SVD group

ifokkema commented 2 years ago

Sounds good! Another thing that popped up in my head is that this is also related to fusion transcripts. Deletions like these can cause fusion transcripts, and those do have a transcript-based description. So we might also go in that direction, even though that doesn't solve whole-gene deletions yet but only deletions where half genes are deleted. On a related note; did "recruiting" for the SVD group already start? I'm interested to join. Same for the VIJ group. Even though I'm already incredibly busy, it's important for me to be involved in these.

Peter-J-Freeman commented 2 years ago

Fusions are on the agenda for description formats that we need to crack. @leicray has certainly been working in this area. We should definately talk about those too. Let's sort out some dates via email.

Not sure about the SVD recruitment. Another thing to chat about

leicray commented 2 years ago

Please include me in the any proposed chat session.

Peter-J-Freeman commented 2 years ago

You are needed