Promoting INS to DUP issue

drtconway commented 5 years ago

Hi Mutalyzer,

Thanks for your awesome work! We use mutalyzer on well over 1000 research & clinical assays per month and it's great!

We've run in to an issue which I think is actually a flaw in the HGVS specification, which Mutalyzer is faithfully implementing.

We have an annotation pipeline which has the following steps (mixed in with other stuff):

transliterate variants from the VCF to HGVSg.
use mutalyzer to do position conversion, from which we pick a preferred HGVSc name
use the name checker to normalise the nomenclature
convert the HGVSc back to "normalised" HGVSg
rewrite the VCF in "normalised" form (with the HGVS{g,c,p} in the info field) for downstream tools

We've recently had a couple of cases where this blows up in our faces.

Consider the transilterated HGVSg for a 4bp insertion:

chr1:g.33479001_33479002insATGTC

This is converted by Mutalyzer to the following HGVSc:

NM_013411.4:c.500_501insGACAT

Which in turn is normalised by Mutalyzer to:

NM_013411.4:c.c.496_500dup

And the corresponding HGVSg produced by Mutalyzer for this is:

chr1:g.33479002_33480125dup

What started out as a 4bp insertion has turned into a 1124bp duplication (insertion)!

The reason is clear when we examine (e.g. in the web interface) the HGVSc form. In particular, the coding positions of the 5th and 6th exons:

5 426 498 6 499 694

The original HGVSc is an insertion just after the splice junction joining exons 5 & 6. Just by coincidence, the inserted sequence happens be the same as the preceding 4bp of the transcript (i.e. the last 2bp of exon 5 and the first 2bp of exon 6) (1/(4^4)=1/256, so I guess we shouldn't be too shocked). Accordingly, and following the prioritisation rules, the INS variant is promoted to a DUP variant. However, like 3' shifting variants onto or across splice junctions, this is misleading.

It seems to me that just as the 3' shifting rule has explicit exceptions relating to splice junctions, the prioritisation rules for duplications and inversions should also disallow promotion from insertion or deletion-insertion. Both duplications and inversions describe variants in terms of the sequence content of a reference sequence, not just position in the reference. As such, it is problematic when the implied sequence contains a splice junction.

Would you consider modifying Mutalyzer to avoid promoting ins->dup and delins->inv in such circumstances.

Thanks, Tom.

ifokkema commented 5 years ago

Hi Tom,

Not as a Mutalyzer developer but as a long-time user; you can make your life much easier by combining steps 2-4 (or even 1-4 if you want) in only one Mutalyzer call, which also prevents you from having this issue. Your steps would then look like:

transliterate variants from the VCF to HGVSg (optional).
use Mutalyzer to normalize the HGVSg, generate HGVSc and provide the protein change predictions by calling runMutalyzerLight: https://mutalyzer.nl/json/runMutalyzerLight?variant=NC_000010.10:g.12129619A>C Note 1: Optionally, A>C can be rewritten to the delREFinsALT format (delAinsC) to save you some computational effort in your step 1. You'll get a warning sometimes that this is not an indel but something else, which you can ignore. Note 2: This method also takes care of the problem where the 3' "roll forward" rule for deletions and insertions has to be done in two different directions for genes on the reverse strand. The position converter does not handle this, and will generate incorrect descriptions. Note 3: You can cache this call if you want, to speed up your application even more (cache example). Then, pick your HGVSc and find the related protein change prediction in the output.
rewrite the VCF in "normalised" form (with the HGVS{g,c,p} in the info field) for downstream tools

Cheers, Ivo (LOVD developer)

mihailefter commented 5 years ago

Hi Tom,

Thanks for posting this issue. Indeed, it seems like Mutalyzer should not normalize NM_013411.4:c.500_501insGACAT to NM_013411.4:c.c.496_500dup. As Ivo suggested, directly making use of NCs would simplify and improve your process flow.

Best regards, Mihai

drtconway commented 5 years ago

Hi @ifokkema,

I'm going through your suggestion in detail and I tried sending the variant with an NC name (i.e. NC_000001.10:g.33479003_33479004insGTCAT) to the runMutalyzerLight service, and I see that it doesn't do so directly, it does give back the c. description after you collate the legend against the genomic description.

I'm just a bit confused about your second note on point 2, about rolling in both directions. My understanding, which now might be local dogma not canonical at all, was that 3' shifting should be applied only at the transcript level. It certainly makes sense in terms of our use case. We rewrite the VCF so that the variants line up with the c. description, so that when the curators in Molecular Pathology look at them in IGV, the c., g. and VCF descriptions all line up. (We don't however do any kind of realignment to the BAM, so occasionally the pileup doesn't line up.)

T.

jfjlaros commented 5 years ago

In the recommendations the following is mentioned: "the 3’rule applies to ALL descriptions (genome, gene, transcript and protein) of a given variant".

I hope this answers your question.

ifokkema commented 5 years ago

Hi @drtconway,

(...) I see that it doesn't do so directly (...)

Yeah, you do need to reason over the JSON reply a bit; open legend, find needed transcript in the id field, read out the name (e.g. "AK2_v012"), then loop over the transcriptDescriptions to find the needed cDNA notation, and take the protein description out of proteinDescriptions. But at least you'll need only one API call, which in the end is the most time consuming part of your script.

(...) 3' shifting should be applied only at the transcript level.

No, like @jfjlaros already mentions, it should be applied on all levels. This makes the variant "misalign" in the case of shifting in genes on the reverse strands, and makes genomic HGVS descriptions misalign with VCF notation always, as VCF likes to 5'-align the variant. Sources don't always fully 5'-shift their VCF files, so for comparison between data sets on a genomic level, both sources should be fully normalized to correctly compare them. (If I may shamelessly plug my own work, see the two "Unique variant descriptions (VCF and HGVS)" subsections in my recent paper.)

drtconway commented 5 years ago

Yeah, I totally understand and agree about the necessity for shifting to compare genomic descriptions. Possibly, we should compute both - for the reasons I cited. Thanks for the link - I'll take a look at your paper (I'm all for shamelessly plugging one's own work. :) ).

jfjlaros commented 5 years ago

We actually do calculate both, but at the moment this information is not exposed via the API, we will do so in an upcoming version of Mutalyzer.

mutalyzer / mutalyzer2

Promoting INS to DUP issue #488