openvar / variantValidator

Public repository for VariantValidator project
GNU Affero General Public License v3.0
67 stars 21 forks source link

Use the "full" variant description for intronic variants #616

Open ifokkema opened 2 months ago

ifokkema commented 2 months ago

Describe the bug VariantValidator provides incorrect HGVS descriptions for intronic variants. E.g.,

Obviously, these descriptions are valid in the context of a given genome build but not as a standalone description. Also the "HGVS-compliant variant descriptions" table contains the description NM_001283009.2:c.1266+3_1266+80del, but it's not HGVS-compliant. The only place on the page where the correct description is given is further down the page in the table "Transcript and protein descriptions".

To Reproduce See links above.

Expected behavior All mentions of NM_001283009.2:c.1266+3_1266+80del should be changed to NC_000020.11(NM_001283009.2):c.1266+3_1266+80del, using the genome build that was used for the input.

Additional context Although VV validates NM_001283009.2:c.1266+3_1266+80del well because a genome build must always be selected, it would make sense to educate users to always use the NC in the description for intronic variants.

leicray commented 2 months ago

It is probable that the layout of the validation results page could be altered to accommodate your request.

That said, I have a long-standing dislike of descriptions such as NC_000020.11(NM_001283009.2):c.1266+3_1266+80del because the genome and transcript reference sequences are logically in the wrong order. The description ought to be NM_001283009.2(NC_000020.11):c.1266+3_1266+80del because the genomic sequence is subsidiary to the transcript based description NM_001283009.2:c.1266+3_1266+80del. The genomic reference sequence is included purely to indicate the genome build which could be recorded as GRCh37 or GRCh38 as appropriate, e.g. NM_001283009.2(GRCh38):c.1266+3_1266+80del.

I had this argument several years ago when I was a member of the HGVS nomenclature committee, but the others decided to adopt the format is now used.

Just out of interest, are there any known examples where it's essential that the genome build be specified to enable validation with tools other than VariantValidator?

ifokkema commented 2 months ago

That said, I have a long-standing dislike of descriptions such as NC_000020.11(NM_001283009.2):c.1266+3_1266+80del because the genome and transcript reference sequences are logically in the wrong order. The description ought to be NM_001283009.2(NC_000020.11):c.1266+3_1266+80del because the genomic sequence is subsidiary to the transcript based description NM_001283009.2:c.1266+3_1266+80del. The genomic reference sequence is included purely to indicate the genome build which could be recorded as GRCh37 or GRCh38 as appropriate, e.g. NM_001283009.2(GRCh38):c.1266+3_1266+80del.

That actually depends on the exact interpretation of NC_000020.11(NM_001283009.2), which is, sadly, under debate. Mutalyzer and VariantValidator have two completely different interpretations of this mapping, and, therefore, that should be resolved before the reference sequence description could possibly be altered. To explain a bit more, Mutalyzer actually takes the NC sequence, finds (within the GenBank annotation) the positional information of the exons of the given NM, and then constructs the gene's sequence using the NC's sequence. For Mutalyzer, the NC sequence is the only sequence used, and the given NM is just used to decide what parts of the NC sequence should be used for what. On the other hand, VariantValidator takes the NM sequence, and the NCs are used to fill in the gaps. VV does neatly detect differences between the NC and the NM, but any exonic NC(NM) nucleotide is the same as the NM nucleotide at that position.

Just out of interest, are there any known examples where it's essential that the genome build be specified to enable validation with tools other than VariantValidator?

Do you mean as a separate input field? I'm not sure - Mutalyzer simply requires the NC-based syntax; idem for LOVD. I'm not that familiar with all the other tools out there.

leicray commented 2 months ago

That actually depends on the exact interpretation of NC_000020.11(NM_001283009.2), which is, sadly, under debate.

For what reason is the interpretation of NC_000020.11(NM_001283009.2) under debate?

@Peter-J-Freeman and/or @John-F-Wagstaff are better placed than I am to explain the process by which VariantValidator validates a description. The sequential steps may differ from the process used by Mutalyzer. I wonder if the adopted format for describing intronic sequence variants was influenced by what best suited Mutalyzer.

Do you mean as a separate input field?

I think that I did not ask my question clearly enough. Let's try again. Are there any real-world examples where a given intronic variant description is valid with respect to just one genome build, but not the other?

Peter-J-Freeman commented 2 months ago

we already make these recommendation in the Recommended Variant Descriptions table. Also, if you print the pdf, this is very clear.

"HGVS-compliant variant descriptions" table contains the description NM_001283009.2:c.1266+3_1266+80del, but it's not HGVS-compliant.

I disagree. the HGVS nomenclature states that variants should be described at g. c. and p. If the g. is provided i.e. in the top table, the format NC_000020.10(NM_001283009.2):c.1266+3_1266+80del is redundant and needlessly long. It is also not very interoperable since most databases use both the g. and the c. My preference when advising in publications is to ditch NC_000020.10(NM_001283009.2):c.1266+3_1266+80del and use the g. and the c.

I think this is related to education not processing and I think we do a decent job already.

As for the way Mutalyzer and other platforms like VEP handle NM vs NC, we have reference sequences. The variant MUST be in the context of the actual reference sequence, hence VV does make correction to descriptions based on the content of the reference sequence under review. There is a simple solution for Mutalyzer and others, drop NM_ and use ENST then this goes away, as does any worries over alignment gaps :)

ifokkema commented 2 months ago

That actually depends on the exact interpretation of NC_000020.11(NM_001283009.2), which is, sadly, under debate.

For what reason is the interpretation of NC_000020.11(NM_001283009.2) under debate?

Simply because the HVNC didn't specify how it should be implemented on a technical level, and Mutalyzer and VariantValidator implemented it in different ways, which then highlighted the ambiguity of how the HVNC "defines" the NC(NM) mapping (mostly, the lack of a clear definition).

@Peter-J-Freeman and/or @John-F-Wagstaff are better placed than I am to explain the process by which VariantValidator validates a description. The sequential steps may differ from the process used by Mutalyzer. I wonder if the adopted format for describing intronic sequence variants was influenced by what best suited Mutalyzer.

I don't know. I don't think there are meeting notes from that period, so we may need to rely on what people remember if we would want to find out.

Do you mean as a separate input field?

I think that I did not ask my question clearly enough. Let's try again. Are there any real-world examples where a given intronic variant description is valid with respect to just one genome build, but not the other?

Yes, but I don't think I tagged it or stored it somewhere with a label so I can find it back. I ran into it by accident because an intronic variant didn't validate, and I had just always used the hg19 NCs. The variant was valid in the context of hg38, so it clarified for me that the variant must have been called on hg38.

we already make these recommendation in the Recommended Variant Descriptions table. Also, if you print the pdf, this is very clear.

People rarely read properly, though...

"HGVS-compliant variant descriptions" table contains the description NM_001283009.2:c.1266+3_1266+80del, but it's not HGVS-compliant.

I disagree. the HGVS nomenclature states that variants should be described at g. c. and p. If the g. is provided i.e. in the top table, the format NC_000020.10(NM_001283009.2):c.1266+3_1266+80del is redundant and needlessly long.

The NC(NM) description may be redundant when the genomic variant is present, but that doesn't make the NM-based description valid HGVS nomenclature.

It is also not very interoperable since most databases use both the g. and the c. My preference when advising in publications is to ditch NC_000020.10(NM_001283009.2):c.1266+3_1266+80del and use the g. and the c.

What is not interoperable? Do you mean that most databases use NM-based c. descriptions instead of NC(NM)-based descriptions? It's fine to recommend ditching the NC(NM), but the HGVS nomenclature explicitly states that NM-based descriptions of (partial) intronic variants are invalid. And since the interpretation of NM-based intronic variants requires the genome build, simply because the variant may be a completely different variant depending on the genome build, supporting NM-based descriptions without also providing the genome build will only cause ambiguity. Again, sure, the g.-based description is (often) leading, but why support ambiguous descriptions alongside it? People, in general, don't understand why it's a bad idea, will copy it, will use it outside of the context of a genome build, and we will be supporting/enabling bad practices.

I think this is related to education not processing and I think we do a decent job already.

Education is indeed lacking. In the sense that, in general, people don't understand the complexity. That's also why they often don't use valid descriptions at all. And I don't hope I seem to suggest you're not doing a decent job already! My thought is, though, that tools like VV are educational as well. Therefore, I'm afraid users will say; "Why is NM:c.100-5A>G invalid? I checked it with VV!". Gene-focused papers rarely show the genomic variant description. Having NM-based intronic descriptions in there will just cause ambiguity and prevent education.

As for the way Mutalyzer and other platforms like VEP handle NM vs NC, we have reference sequences. The variant MUST be in the context of the actual reference sequence, hence VV does make correction to descriptions based on the content of the reference sequence under review. There is a simple solution for Mutalyzer and others, drop NM_ and use ENST then this goes away, as does any worries over alignment gaps :)

That will then first require the entire user base of Mutalyzer, VEP, and VV, to switch over :wink: Oh, and LOVD :rofl:

Peter-J-Freeman commented 2 months ago

That will then first require the entire user base of Mutalyzer, VEP, and VV, to switch over 😉 Oh, and LOVD 🤣

your point is? ;)

It's fine to recommend ditching the NC(NM), but the HGVS nomenclature explicitly states that NM-based descriptions of (partial) intronic variants are invalid

However, we also try to make descriptions precise, so providing the g. and the c. is fine and negated the need for NC(NM) otherwise we are needlessly complicating descriptions. So, the simple solution is to ensure the correct NC is added to the top table (since, as you say, the intronic sequence varies dependant on the genopme build) since we provide both in a separate table. We could make this clearer, but I personally think it is pointless to provide the NC(NM) if the correct g. is provided. And, as you say, all users would have to swiutch to the NC(NM_) format because I know of pretty much nowhere that is is used :)

"Why is NM:c.100-5A>G invalid? I checked it with VV!". Gene-focused papers rarely show the genomic variant description. FYI, Genetics in medicine and GiMo require Genome build, HGVS g. and HGVS c. to be provided. I am very strict!!!!!!

But I totally agree and hopefully the professional standard will address this

So I think the only real action could be to add the correct g. into the top table and perhaps add some information to the interface to state what authors should use when publishing. This we could do for sure

John-F-Wagstaff commented 2 months ago

We currently have 3039 transcripts with variation in the internal transcript exon structure between mappings, though this includes alt mappings (NG_ and older blat alignments where excluded). Many of these will just be truncated mappings, but other more complex differences do exist.

For those transcripts with multiple mappings sourced from identical within transcript exon position sets we have 11182 transcripts where the introns for different mappings of these transcripts have differing lengths for the same intron. Of these 7464 are between the different versions of the main (NC_) chromosomes, so the rest are main versus alt differences for the same genome build. This ignores SNPs or other length neutral changes between reference genomes or main and alt mappings.

As such the same NM definition definitely could mean different things depending on the NC or alt version it is paired with. Just specifying GRCh37/8 won't be enough for alt sequences. I think this means that any support of alts particularly in batch inputs is basically dependent on the 2 sequence bracketed form just to make the input boxes make sense, let alone provide accurate answers. We need to be able to support alts, particularly if we want to be able to handle rare disease data.

I unfortunately also have to agree that if you give the users a pair of definitions that "should go together" they often won't bother with the genomic one. The original quote from Murphy that lead to Murphy's law is after all "If there are two or more ways to do something and one of those results in a catastrophe, then someone will do it that way."

Edit: all of the problems with Ensembl having the same id for sequences that had different sequence content for the different genome builds was because, you should always know which genome build you are working with so it won't be a problem...

Peter-J-Freeman commented 2 months ago

On a side note, and slightly relevant.

I mapped the MANE Select from ensembl COL1A1 and RefSeq to GRCh38

NM_000088.4:c.589-1_589delGGinsG > NC_000017.11:g.50198002_50198003delCCinsC ENST00000225964.10:c.589-1_589delGGinsG > NC_000017.11:g.50199085_50199086delGAinsC

I thought MANE Select were identical from start to finish, but I guess the alignments can be different. @John-F-Wagstaff can you please check these alignmnts against the source. I am really surprised by this because the UCSC database shows different. I will keep trying to figure out what is going on.

Peter-J-Freeman commented 2 months ago

@John-F-Wagstaff , is this to do with the dodgy ENST data we got from an archive. It's COL1A1 playing up again and this is post your patch on my local system.

John-F-Wagstaff commented 2 months ago

ugh yes, looks like we need a fix for the alignment table as well as the exons, I will get a fix to you ASAP

leicray commented 1 month ago

Do you mean that most databases use NM-based c. descriptions instead of NC(NM)-based descriptions?

Arguably, LOVD does just that. If I select a gene, say COL1A1, and then select the Variants tab, a long list of variants is displayed including intronic variants such as c.588+4A>T. Clicking on that variant takes me to another page showing the two instances of that variant in the database along with the information "The variants shown are described using the NM_000088.3 transcript reference sequence.".

However, no information is displayed regarding the corresponding genome build. That information can only be found by clicking on one or other of the two displayed instances of the variant. Once there, the header at the top of the page says: (NC_000017.10:g.48275518T>A, COL1A1(NM_000088.3):c.588+4A>T)

I cannot find where the variant is described as NC_000017.11(NM_000088.3):c.588+4A>T in accordance with the HGVS nomenclature and COL1A1(NM_000088.3):c.588+4A>T) is certainly ambiguous.

Am I missing something in the interface that ought to be more obvious to me?

Peter-J-Freeman commented 1 month ago

@leicray scroll down to the recommended variant description table or print the PDF.

image

leicray commented 1 month ago

My comment was entirely about LOVD. Nothing to do with VariantValidator. @ifokkema had asked about examples of databases that do not display "NC(NM)-based descriptions".

Peter-J-Freeman commented 1 month ago

I'm not sure if we need to do anything our end. We display all relevant descriptions including the genome build selected. So the full description can be used, but is really not necessary since we provide the Genome build and the relevant HGVS genomic description.

leicray commented 1 month ago

I agree that we do not have to do anything more at our end.

ifokkema commented 1 month ago

@Peter-J-Freeman

It's fine to recommend ditching the NC(NM), but the HGVS nomenclature explicitly states that NM-based descriptions of (partial) intronic variants are invalid

However, we also try to make descriptions precise, so providing the g. and the c. is fine and negated the need for NC(NM) otherwise we are needlessly complicating descriptions. So, the simple solution is to ensure the correct NC is added to the top table (since, as you say, the intronic sequence varies dependant on the genopme build) since we provide both in a separate table. We could make this clearer, but I personally think it is pointless to provide the NC(NM_) if the correct g. is provided.

This sounds like an argument for dropping the NM description entirely, not an argument for using the invalid NM-based description over the valid NC(NM)-based description. Like I said, the NC(NM) description may be redundant when the genomic variant is present, but that doesn't make the NM-based description valid HGVS nomenclature.

And, as you say, all users would have to switch to the NC(NM) format because I know of pretty much nowhere that is used :)

This is Heidi's argument for removing the parentheses from predicted protein descriptions - "Because others aren't using the standards, neither should we..." I still don't understand that reasoning, to be honest. Isn't the whole point of VV to educate people on how to use the standards? And if we don't want to educate the users on a certain thing (like LOVD is doing, see below), at least not show them how not to do it?

@leicray

Do you mean that most databases use NM-based c. descriptions instead of NC(NM)-based descriptions?

Arguably, LOVD does just that. (...)

LOVD is far from perfect, but it doesn't print NM_001283009.2:c.1266+3_1266+80del. We do need to fix our page titles, as that format is also incorrect.

However, no information is displayed regarding the corresponding genome build.

The genomic DNA field shows the build in the header, it's on the detailed page, and on all data entry forms. It's not perfect, but we do mention it. We'll run into issues when we start supporting multiple genome builds, but that's a different story for which I still don't have a solution. Anyway, we're not perfect, but we don't write NM_001283009.2:c.1266+3_1266+80del.

I cannot find where the variant is described as NC_000017.11(NM_000088.3):c.588+4A>T in accordance with the HGVS nomenclature (...)

I never said we do that 😅 The HGVS nomenclature states you can use the variant description without the reference sequence as long as that is mentioned elsewhere, and that is what we're currently doing. We mention the genome build, the NC, the NM, and then display the DNA descriptions (g. and c.) without the reference sequences. The page titles, a feature built separately, then uses an invalid gene-based format for all cDNA descriptions. I don't remember how that happened, but I'll fix it.

What I asked was what you meant with, "It is also not very interoperable since most databases use both the g. and the c.". I didn't understand what you meant and I still don't.

My comment was entirely about LOVD. Nothing to do with VariantValidator. @ifokkema had asked about examples of databases that do not display "NC(NM)-based descriptions".

I didn't ask for that; I know, for instance, that ClinVar shows invalid NM-based intronic variant descriptions.

Peter-J-Freeman commented 1 month ago

This is Heidi's argument for removing the parentheses from predicted protein descriptions - "Because others aren't using the standards, neither should we..." I still don't understand that reasoning, to be honest.

I see what you are saying but I slightly disagree. I'm not saying use it because others do. Rather, I see this as another area where there is duplication in the HGVS guidelines.

HGVS states that variants should be described at all relevant levels, usually g. c. p. In my opinion, and certainly based on my editing experience (especially when reviewing ACMG papers) is essential since it stops a lot of significan errors which ought to be avoided. In this case, providing the full NC(NM):c. description contains redundancy i.e. duplicates information.

It would be useful to discuss this in an HGVS meeting because I am not saying either is incorrect. I agree that the NC(NM):c. is correct, but I also argue that there is no need to use it if HGVS is correctly applied and the g., c. are both provided.

Isn't the whole point of VV to educate people on how to use the standards? And if we don't want to educate the users on a certain thing (like LOVD is doing, see below), at least not show them how not to do it?

Absolutely it is, and I think we can make a much stronger statement about the use of the variant descriptions in the recommended variant descriptions table. We could also pull the NC(NM):c. descriptions into the top table as in re-structure the layout.

My worry is that by dropping NM_:c. descriptions we lose a format that is used in all databases like LOVD and ClinVar, journals, dbSNP etc.

More than happy to look at the layout, but are you suggesting we do drop the NM:c. descriptions and just show the NC(NM_):c.. My feeling is that this would lose us users and open up a lot of complaints :).

So, we do provide all the correct descriptions, so this seems to be a matter of adjusting how we display the dayta. The alternatives I can think of are:

leicray commented 1 month ago

I would certainly not wish to see removal of NM_:c. descriptions from the VV output. Arguably, we could rearrange the results page order to emphasise that there are HGVS recommended variant descriptions. However, I would prefer just to emphasise use of HGVS recommended variant descriptions.

I would certainly be against any suggestion that an NM:c. description submitted by a user should be immediately converted to an NC(NM_):c. description at the top of the results page. That would be confusing for users.

I agree with @Peter-J-Freeman that this needs to be carefully discussed by the HVNC as there do seem to be two sets of recommendations. I would be happy to participate in such discussions in my role as an "emeritus" committee member.

Finally, as Garry Cutting has often said "Do not let perfection be the enemy of progress".

ifokkema commented 1 month ago

@Peter-J-Freeman

HGVS states that variants should be described at all relevant levels, usually g. c. p. In my opinion, and certainly based on my editing experience (especially when reviewing ACMG papers) is essential since it stops a lot of significan errors which ought to be avoided. In this case, providing the full NC(NM):c. description contains redundancy i.e. duplicates information.

Of the NC reference sequence? Or of the c. and p. descriptions? In principle, the c. and p. descriptions are redundant information, although different mapping tools exist and not everybody does it the same way... and, of course, the p. description could be some sort of observed change rather than a prediction. But anyway, assuming you meant that the NC was redundant; that's true... but only when these descriptions stay together. Also, when multiple g. descriptions are given (e.g., both the GRCh37 and GRCh38 g. descriptions, like in ClinVar and the Leiden LOVD), the NC(NM) becomes a requirement again (for intronic positions only!) to indicate what nucleotides are actually affected. (obviously, the c. notation can be derived from the given g. notations, but the above was assuming that the c. description itself wasn't considered redundant)

I believe, but the HVNC may want to correct me, that it's the "separation" that I mentioned that is the problem. As far as I know, each and every variant description should, by itself, be interpretable. That would mean that the c. notation should be interpretable, with or without a g. notation somewhere near (in the same table, sentence, etc). Therefore, with that assumption, the NC in the NC(NM) is not considered redundant, as it facilitates the interpretation of the variant description.

It would be useful to discuss this in an HGVS meeting because I am not saying either is incorrect. I agree that the NC(NM):c. is correct, but I also argue that there is no need to use it if HGVS is correctly applied and the g., c. are both provided.

I believe Alex opened up the agenda already for the next meeting; we could put it in there to discuss it?

My worry is that by dropping NM_:c. descriptions we lose a format that is used in all databases like LOVD and ClinVar, journals, dbSNP etc.

I totally get that feeling... and of course, the whole NM/NC(NM) debate only applies to intronic variants, so we're talking about a subset of all c. descriptions. LOVD doesn't use the NM:c. format in its human interface but mentions the reference sequence elsewhere. And when ClinVar describes intronic variants as NM:c., we'd have no way of knowing what they mean; we'd need to grab one of their genomic descriptions and use that. So, either way, simple string-to-string matching can't currently be used reliably. That said, this discussion triggered me to have a good look at how we do it in LOVD, and I will make updates there to make sure things are more clear.

More than happy to look at the layout, but are you suggesting we do drop the NM:c. descriptions and just show the NC(NM_):c.. My feeling is that this would lose us users and open up a lot of complaints :).

Only for intronic variants, but yes... since, as far as I interpret the HGVS rules, the NM:c. description of an intronic variant is always invalid, also when elsewhere the g. description is given. But unfortunately, I have no clue what users will think.

I'm wondering, though, related to the Aries integration, if authors provide a list of variant descriptions as used in their manuscript and the g. and c. notations end up on a different line, how will VV determine which genome build to use for intronic variants? Will it again be an input field? If so, we'd be assuming (but probably, rightfully so), that all descriptions in one manuscript will use the same genome build.

  • to pull the relevant NC(NM):c. (based on submitted genome build) into the top table and also to emphasise the use of the recommended variants table.
  • Move the recommended vatiants table to the top of the page
  • Just emphasise the need to use the recommended variants table and leave the layout as it is
  • Other suggestions please @ifokkema @leicray

Perhaps, depending on what the HVNC says, any clear indication that the NM:c. description for intronic variants is not a valid description by itself would already be a great addition. Easier access to what is the valid standalone HGVS c. description for that variant, would help the user find the HGVS-compliant description they might want to use somewhere.

@leicray

I would certainly be against any suggestion that an NM:c. description submitted by a user should be immediately converted to an NC(NM_):c. description at the top of the results page. That would be confusing for users.

Even for intronic variants? I wasn't trying to suggest doing it for exonic variants...

I agree with @Peter-J-Freeman that this needs to be carefully discussed by the HVNC as there do seem to be two sets of recommendations. I would be happy to participate in such discussions in my role as an "emeritus" committee member.

If you could point out what in the HGVS documentation conflicts with what other part, there is a better chance of having a good discussion within the committee of what conflict needs resolution. Otherwise, I think the clearest question would be "Is an NM:c. intronic variant description valid when the genome build is mentioned elsewhere?"

Finally, as Garry Cutting has often said "Do not let perfection be the enemy of progress".

That is definitely true, although I don't know what progress we're holding back :stuck_out_tongue_closed_eyes: (other than that we're discussing this instead of something else) I definitely thought this issue was a simple one when I started it :sweat_smile: But it actually highlighted where I need to improve in LOVD, so there's progress either way! :smile:

leicray commented 1 month ago

I would certainly be against any suggestion that an NM:c. description submitted by a user should be immediately >>converted to an NC(NM_):c. description at the top of the results page. That would be confusing for users.

Even for intronic variants? I wasn't trying to suggest doing it for exonic variants...

I only intended this comment to refer to intronic variants.

I still maintain that NC(NM) descriptions are inherently confusing. Parentheses are defined in HGVS as follows ( ) (parentheses) are used to indicate uncertainties and predicted consequences; NC_000023.9:g.(123456_234567)_(345678_456789)del, p.(Ser123Arg).

Placing parentheses around the NM_ implies uncertainty. I cannot find any alternative account of the use of parentheses, but I may have missed something.

The next point is the the parentheses could be interpreted to indicate that the NM_ component of the description is, in some way, additional but non-essential (subsidiary) information. (I am thinking here in terms of how parentheses are used in normal English grammar.)

If (NM) is just additional, but non-essential. information, an NC(NM):c. description becomes a NC:c. description which is certainly not valid.

The variant description NM_000088.4:c.1767+6T>C is deemed to be incorrect and should, according to HGVS guidelines, be written instead as NC_000017.11(NM_000088.4):c.1767+6T>C. It would be far more logical to write a description that says "Here is a transcript-based intronic sequence variant that, by the way, is valid in the context of the reference sequence for genome build GRCh38". In other words, NM_000088.4(NC_000017.11):c.1767+6T>C.

ifokkema commented 1 month ago

I still maintain that NC(NM) descriptions are inherently confusing. Parentheses are defined in HGVS as follows ( ) (parentheses) are used to indicate uncertainties and predicted consequences; NC_000023.9:g.(123456_234567)_(345678_456789)del, p.(Ser123Arg).

The parentheses in NC(NM) format don't indicate uncertainty, indeed. The definition given on the "general" recommendations page doesn't explain the use of parentheses in reference sequences, indeed.

Placing parentheses around the NM implies uncertainty. (...) The next point is that the parentheses could be interpreted to indicate that the NM component of the description is, in some way, additional but non-essential (subsidiary) information. (...) It would be far more logical to write a description that says "Here is a transcript-based intronic sequence variant that, by the way, is valid in the context of the reference sequence for genome build GRCh38". In other words, NM_000088.4(NC_000017.11):c.1767+6T>C.

It's not meant as uncertainty or non-essential additional information. It is meant as additional information, but in the opposite way that you mention. It's the NM that provides the context, not the NC. This is also why Mutalyzer interprets NC(NM) so differently from VariantValidator. For Mutalyzer, the NC in NC(NM) provides all the sequence. For VariantValidator, both reference sequences provide sequence; exons are provided by the NM, and intronic sequences by the NC. This allows VariantValidator to drop the NC from the NC(NM) descriptions and create variant valid descriptions (for exonic positions, that is), while NC(NM_123456.1):c.100del in Mutalyzer may actually map to NM_123456.1:c.95del.

The NM in NC(NM) is meant as a form of selection; the NM annotation (positions) is selected from within the NC reference sequence. Mutalyzer takes the data from the GenBank file, VariantValidator takes the mappings from the official alignments and constructs a new sequence based on that. With this, we are actually moving into the domain "what does NC(NM) actually mean?", were Mutalyzer and VV are doing completely opposite things. The HVNC does not define it well enough.

leicray commented 1 month ago

I think that I agree with you.

The parentheses in NC(NM) format don't indicate uncertainty, indeed. The definition given on the "general" recommendations page doesn't explain the use of parentheses in reference sequences, indeed.

The definitions need to be updated to indicate that there are two usages of parentheses.

The NM in NC(NM) is meant as a form of selection; the NM annotation (positions) is selected from within the NC reference sequence.

It's clear that VariantValidator and Mutalyzer work in different ways. What's also clear is that Mutalyzer is incapable of working out an intronic variant in the absence of the corresponding NC_ sequence record.

If I submit NM_000090.4:c.2553+2T>A to VariantValidator it copes with the validation process. However, if I submit the same variant to Mutalyzer it reports "DESCRIPTION COULD NOT BE INTERPRETED Intronic position 2553+2 given for a non-genomic reference sequence. Tip: make use of a genomic reference sequence like NC*(NM*)".

The wording of the second part is interesting. It implies that the designation of introns is inherent in genomic reference sequences.

As far as I can see, NC*(NM*) is part of the HGVS guidelines solely to satisfy the operational need of Mutalyzer. I would again argue that any need to specify the genome build because of possible intronic sequence differences between builds could, where necessary, be satisfied by descriptions such as NM_000090.4(GRCh38):c.2553+2T>A. Nobody would then need to lookup the correct NC_ record for the chromosome in question when compiling a fully compliant variant description.

This needs open and honest discussion at HVNC with participation of people, such as me, from outside the committee.

ifokkema commented 3 weeks ago

@leicray Sorry, I got really busy and missed a lot of emails.

The definitions need to be updated to indicate that there are two usages of parentheses.

I agree, although I think the definitions currently explain the use of characters in variant descriptions. Technically, this is the use of a character in the reference sequence. But I've added it to my (very long) list to figure out where that should go.

It's clear that VariantValidator and Mutalyzer work in different ways. What's also clear is that Mutalyzer is incapable of working out an intronic variant in the absence of the corresponding NC_ sequence record.

Well... technically, VV has the same issue. VV just fetches the NC based on the genome build input. The issue is hidden, but, in reality, both tools have the same limitation.

If I submit NM_000090.4:c.2553+2T>A to VariantValidator it copes with the validation process.

Only because there is a genome build as an input.

However, if I submit the same variant to Mutalyzer it reports "DESCRIPTION COULD NOT BE INTERPRETED Intronic position 2553+2 given for a non-genomic reference sequence. Tip: make use of a genomic reference sequence like NC(NM)".

The wording of the second part is interesting. It implies that the designation of introns is inherent in genomic reference sequences.

Yes, that's what I meant when I said Mutalyzer uses the NC for all sequences in an NC(NM) context, while VV uses the NC only for the intronic sequences.

As far as I can see, NC(NM) is part of the HGVS guidelines solely to satisfy the operational need of Mutalyzer.

Although I can't be sure whether VV was considered when that syntax was invented, any tool processing intronic variants will require an NC for this. The logic on how to obtain that NC can differ (VV uses a genome build input for this), but requiring an NC input is unambiguous (genome build is not, actually).

I would again argue that any need to specify the genome build because of possible intronic sequence differences between builds could, where necessary, be satisfied by descriptions such as NM_000090.4(GRCh38):c.2553+2T>A.

That would be ambiguous; does NM_000451.3(GRCh38):c.277+12del mean NC_000023.11(NM_000451.3):c.277+12del or NC_000024.10(NM_000451.3):c.277+12del?

Nobody would then need to lookup the correct NC_ record for the chromosome in question when compiling a fully compliant variant description.

Any tool will need to, however, and it will need to be able to do so unambiguously. Not only that, but I think we also already have lots of examples where people don't use tools to create their variant descriptions and, therefore, make mistakes. Tools can do this for people, just like applying the 3' forward rule. I wouldn't expect people to do that manually, either.

This needs open and honest discussion at HVNC with participation of people, such as me, from outside the committee.

Sure! The official way to go about this is to start a new discussion (see the list of discussions). A discussion is also easier to add to the HVNC meeting agenda, and it allows all committee members to catch up without long email threads.