openvar / variantValidator

Public repository for VariantValidator project
GNU Affero General Public License v3.0
69 stars 21 forks source link

Combining allelic variants prior to generating the protein-level description #511

Open leicray opened 1 year ago

leicray commented 1 year ago

Describe the bug A user has queried why the variant description NM_000059.4:c.[1916dup;1929del] is handled as two separate variants for the purposes of validation: NM_000059.4:c.1916dup predicting protein-level description NP_000050.3:p.(Leu639PhefsTer13), and NM_000059.4:c.1929del predicting protein-level description NP_000050.3:p.(Arg645GlufsTer15).

Treated separately, each predicts a frameshift, but if the pair of variants are considered together as a delins variant, notwithstanding their separation, the predicting protein-level description becomes NP_000050.3:p.(Leu639_Val643delinsPheAlaPhePheCys).

Examples such as this need to be handled more intelligently. This probably requires that allelic descriptions be combined before generating the protein-level description.

Peter-J-Freeman commented 1 year ago

@leicray The Variant Nomenclature Committee now has a GitHub page. https://github.com/HGVSnomenclature/hgvs-nomenclature/issues?q=

Think this should map to there now. This is an open topic I believe

ifokkema commented 1 year ago

I'm curious how this was resolved? I mean, is this meant as a "won't fix" then? I think it's a very interesting use case to have one single protein description from a combined allele.

Peter-J-Freeman commented 1 year ago

Not currently resolved @ifokkema. Closed because as things stand the HGVS rule is being applied. Variants that can be brought proximal to each other are merged. Those then cannot are not, so nothing (currently) to do. HGVS rules currently state that these are separate variants. I agree with Raymond that for the case of perdicting the p. outcome, they should be allowed to be merged, but was overruled in the Variant Nomenclature Committee.

ifokkema commented 1 year ago

Actually, I disagree with that. :wink: These variants overlap on protein level, and therefore, since they're given as one allele, I believe they should be returned as one allele and not as two separate predictions.

Each variant should be considered separately on each level. Meaning, two variants far enough apart on DNA level to be considered two separate variants, on protein level may be directly adjacent. When that happens, HGVS nomenclature states they should be merged on protein level (assuming we're talking about an allele here). They shouldn't be merged on DNA level, as each level is considered separately.

Since p.(Leu639PhefsTer13) describes a variant technically from codon 639 to the end of the protein (or, if we disagree, at least we can agree to 13 codons downstream), this overlaps with p.(Arg645GlufsTer15). Both variants can not co-exist in the same molecule.

That said, separately from the discussion of whether or not the protein variants overlap, I also believe that since the user provided one variant description (an allele), they should not get two responses. Even if the two DNA variants in the allele do not overlap on protein level, I believe an allele syntax should be returned. Now, VV behaves as if the allele syntax is a way to submit a batch query.

leicray commented 1 year ago

The fact remains that the consequence of the two individual frameshifts is that there is no frameshift at all. The c.1929del variant restores the reading frame before the translation process reaches the stop codon predicted by the c.1916dup variant.

Whatever we decided about how to describe the allelic variants at the transcipt level, the protein-level description ought to be NP_000050.3:p.(Leu639_Val643delinsPheAlaPhePheCys).

Peter-J-Freeman commented 1 year ago

Actually, I disagree with that. 😉 These variants overlap on protein level, and therefore, since they're given as one allele, I believe they should be returned as one allele and not as two separate predictions.

Don't get me wrong, i disagree too, and brought this up un the HGVS committee and was over-ruled. However, this might be a good case to bring this up again and try and tweak the rules.

VV currently only merges c. descriptions if they normalize together. This is not incorrect!

However, a dimension that has not been considered is whether when 2 separate variants are cponsidered at the c. level, whether the p. needs to be merged. In this case it may be. That does sound like it should apply.

I can prototype it.

ifokkema commented 8 months ago

However, a dimension that has not been considered is whether when 2 separate variants are cponsidered at the c. level, whether the p. needs to be merged. In this case it may be. That does sound like it should apply.

I ran into this when reviewing the HGVS nomenclature pages:

NM_004006.2:c.145_147delinsTGG: two substitutions replacing codon CGC (position c.145 to c.147) by TGG: NOTE: two variants separated by one nucleotide, together affecting one amino acid, should be described as a "delins" so the description c.[145C>T;147C>G] is not correct (see deletion/insertion)

(source)

It seems the nomenclature actually understands that contradicting protein descriptions in one allele doesn't make sense. They even went as far as merging variants on the DNA level when they affect one codon. It seems to me the nomenclature supports this cause.

Peter-J-Freeman commented 8 months ago

Think we need to re-open this then. I have been working on the allele syntax so now is a good time.

NM_004006.2:c.145_147delinsTGG: two substitutions replacing codon CGC (position c.145 to c.147) by TGG: NOTE: two variants separated by one nucleotide, together affecting one amino acid, should be described as a "delins" so the description c.[145C>T;147C>G] is not correct (see deletion/insertion)

Interesting. This is one that can certainly be coded up as it is a fixed and simple rule with an example

A user has queried why the variant description NM_000059.4:c.[1916dup;1929del] is handled as two separate variants for the purposes of validation: NM_000059.4:c.1916dup predicting protein-level description NP_000050.3:p.(Leu639PhefsTer13), and NM_000059.4:c.1929del predicting protein-level description NP_000050.3:p.(Arg645GlufsTer15).

For variants like this, My thought is, could we use the r. syntax as an intermediate. Then we can maintain the "correct" nomenclature for c. i.e. not merge, but also generate a delins for the r. and additionally give the protein consequence based on the r. nomenclature

leicray commented 8 months ago

The solution that you propose for dup and del variant makes perfect sense.

ifokkema commented 8 months ago

Another interesting find:

https://groups.google.com/g/hgvs-nomenclature/c/xBJg2sNIM_s

Johan clearly combines the two DNA variants into one single protein variant, so I'd say we're good to go. We wouldn't even need the RNA for this (but it could work).

leicray commented 8 months ago

Is that the agreed opinion of the HGVS nomenclature committee? If so, it contradicts what's said on the web site regarding the maximum separation of the two allelic variants.

Peter-J-Freeman commented 8 months ago

The RNA would not be needed for c.[145C>T;147C>G]

However, I think it is more appropriate to use for NM_000059.4:c.[1916dup;1929del] where existing HGVS rules would prevent merging.

ifokkema commented 8 months ago

Is that the agreed opinion of the HGVS nomenclature committee? If so, it contradicts what's said on the web site regarding the maximum separation of the two allelic variants.

I believe so, and no, it doesn't. Two frameshifts can not co-exist in one molecule. So predicting that NM_000059.4:c.[1916dup;1929del] causes NP_000050.3:p.([Leu639PhefsTer13;Arg645GlufsTer15]) is incorrect because, in reality, the variants overlap.

Digging even further, I actually found this on the protein frameshift page:

the (predicted) amino acid changes of additional variants on the same allele (in cis) downstream of the frameshift are not described unless they change the amino acid sequence or length of the shifted reading frame (i.e. introduce an amino acid substitution, an earlier translational termination (stop) codon or affect the termination codon of the shifted frame).

This seems clear to me. Only one allele should be described, and it should describe the combined effect of the two variants.

Peter-J-Freeman commented 8 months ago

So far I have

        "submitted_variant": "NM_004006.2:c.[145C>T;147C>G]",
        "transcript_description": "",
        "validation_warnings": [
            "AlleleSyntaxError: Variants [145C>T;147C>G] should be merged into NM_004006.2:c.145_147delinsTGG"]

Which covers the c.[145C>T;147C>G] case. However, it also expands the logic so that for any variants where the last base of the 5 prime variant and the first base of the 3 prime variant are in the same amino acid, then we merge

However, is this against the current HGVS rules. In my opinion it is unless both variants are types that cannot be normalized e.g. substituttions.

Peter-J-Freeman commented 8 months ago

the (predicted) amino acid changes of additional variants on the same allele (in cis) downstream of the frameshift are not described unless they change the amino acid sequence or length of the shifted reading frame (i.e. introduce an amino acid substitution, an earlier translational termination (stop) codon or affect the termination codon of the shifted frame).

This may be clear, but it is not particularly practical to code up. I'm also looking at the practicality of what is being stated. In my opinion, we merge if 2 frame shifts restore the original reading frame. That makes sense. The rest of the scenarios listed in these rules would only really lead to predictions as to whether NMD would occur, but we are still gonna get LOF.

We can revisit this if we think that it is worth committing time to, but my opinion is that we at this stage only merge variants if the original frame is restored.

Peter-J-Freeman commented 8 months ago

This then raises another question. What if we have a non frameshifting variant between the 2 frame-shifts, these likely need to become part of the merge.

Its begnning to look to me as if we need to always merge if we see 2 frameshifts in a list of alleles but this would not really comply with strict HGVS rules. Not entirely sure where to go with this and what is feasible

leicray commented 8 months ago

I agree entirely that we ought to combine two frameshifting variants where their combination results in restoration of the reading frame beyond the second variant. If possible, we ought to also accommodate any non-frameshifting variants that lie between the two frameshifts.

I say this knowing that this might be contrary to the HGVS nomenclature guidelines, but sometimes the law is an ass.

ifokkema commented 8 months ago

@Peter-J-Freeman

However, is this against the current HGVS rules. In my opinion it is unless both variants are types that cannot be normalized e.g. substitutions.

I believe that this is part of the new proposal, which hasn't been finished yet. Do you know the status of this?

This may be clear, but it is not particularly practical to code up. I'm also looking at the practicality of what is being stated. In my opinion, we merge if 2 frame shifts restore the original reading frame. That makes sense. The rest of the scenarios listed in these rules would only really lead to predictions as to whether NMD would occur, but we are still gonna get LOF.

I can't comment on how hard it is to program in VV since I don't know its code; but I would try to prevent generating incorrect HGVS. A simple rule would be that an allele syntax on protein level is incorrect if the non-last variant is a frameshift variant. Because that means later variants are given relative to a sequence that no longer exists.

This then raises another question. What if we have a non frameshifting variant between the 2 frame-shifts, these likely need to become part of the merge.

Anything following a frameshift variant should become part of the merge.

Its begnning to look to me as if we need to always merge if we see 2 frameshifts in a list of alleles but this would not really comply with strict HGVS rules. Not entirely sure where to go with this and what is feasible

It fully complies; the frameshift causes all following variants to overlap with the frameshift variant.

@leicray

I agree entirely that we ought to combine two frameshifting variants where their combination results in restoration of the reading frame beyond the second variant. If possible, we ought to also accommodate any non-frameshifting variants that lie between the two frameshifts.

I say this knowing that this might be contrary to the HGVS nomenclature guidelines, but sometimes the law is an ass.

This would be fully HGVS-compliant, but it should include any variant that follows a frameshift variant.

leicray commented 8 months ago

I'm beginning to get confused here.

@ifokkema

This would be fully HGVS-compliant, but it should include any variant that follows a frameshift variant.

Does this mean every frameshift? What if another allelic variant lies 100 nucleotides downstream of the second frameshift in the scenario that we are discussing.

I suspect that this might need a TC rather than an exchange of comments.

Peter-J-Freeman commented 8 months ago

I have managed to write code that looks for the following pattern

However,

Anything following a frameshift variant should become part of the merge. It looks like @ifokkema you are saying that any cis variants following the first frame shift should be merged??? This is only a slight deviation from what I just completed

ifokkema commented 8 months ago

This would be fully HGVS-compliant, but it should include any variant that follows a frameshift variant.

Does this mean every frameshift? What if another allelic variant lies 100 nucleotides downstream of the second frameshift in the scenario that we are discussing.

When there's a second frameshift that restores the reading frame, there's essentially no frameshift anymore, and you get a deletion-insertion. So I meant frameshift, as a frameshift whose frame stays shifted :sweat_smile:

I suspect that this might need a TC rather than an exchange of comments.

That's also an option.

I have managed to write code that looks for the following pattern

  • Find a frameshift
  • Any variant following the frameshift is maintained
  • If we see another frame-shift merge the variants
  • if we do not see another frame shift, then do not merge the variants

However,

Anything following a frameshift variant should become part of the merge. It looks like @ifokkema you are saying that any cis variants following the first frame shift should be merged??? This is only a slight deviation from what I just completed

Basically, variants overwriting each other or touching each other on the protein level, should be merged. E.g.,

Peter-J-Freeman commented 8 months ago

Frameshift + restoring frameshift = delins Frameshift + variant + restoring frameshift = delins

These 2 I can do for now.