Interesting problems handling a gap between genome and transcript in a repeat region

ifokkema commented 4 years ago

I've got another one to really think about...

I was playing around with HTT (huntingtin) gene which has a mismatch between genome and transcript because both reference sequences have a different number of repeats in that gene; the genome has 21 repeats, the transcript's repeat length is 23 (6 bases extra). So a WT genome should be a deletion on the transcript level, and creating duplications on the genome will slowly remove the deletion. But then something funny happened when I decided to increase the duplication by one or two bases instead of three.

1) WT genome, deletion on the transcript (I had to take a large range to trigger the del on the transcript). NC_000004.11:g.3076600_3076662= --> NM_002111.8:c.111_116del :+1: 2) One extra repeat on the genome, smaller deletion on the transcript. NC_000004.11:g.3076660_3076662dup --> NM_002111.8:c.114_116del :+1: 3) One more repeat, WT transcript. Perhaps a bit odd that the transcript positions suddenly move 5', but maybe not wrong? Not sure what the rules are and if 3' shifting applies here as well. NC_000004.11:g.3076657_3076662dup --> NM_002111.8:c.51_63= :thinking: 4) OK, adding one more repeat. NC_000004.11:g.3076654_3076662dup --> NM_002111.8:c.114_116dup :+1: But then... 5) Removing one base from the previous variant, so an insertion of two bases relative to the transcript, suddenly results in a duplication that is 6 bases (two repeats) longer than it should be. NC_000004.11:g.3076655_3076662dup --> NM_002111.8:c.109_116dup :open_mouth: 6) Removing another base, so an insertion of one base relative to the transcript, still adds these 6 bases. NC_000004.11:g.3076656_3076662dup --> NM_002111.8:c.111_117dup :open_mouth:

6 bases is also the length of the gap. So it seems that when not duplicating a perfect repeat, meaning, creating an insertion somewhere, the gap is suddenly gone. Or most likely, shifted somewhere else.

So this is, I think, what is happening. The gap between the genome and the transcript is somewhere, anywhere, in a large region. Because it's a repeated region, there's no telling where the gap precisely is. That's why my first variant had to hold the entire repeat region before the transcript annotation changed from WT to del. Sending NC_000004.11:g.3076657_3076662= got me nothing but a WT call on the transcript as well, which isn't entirely correct (as there's a gap!), but not really wrong either, because nobody knows if I meant to indicate the gap or not. The gap is not described or included, when it can be ignored. With this notation, VV looks at these bases on the genome, they match the transcript, so all good. The gap is "shifted" 5' and doesn't come near my variant. Then, when I indicate the entire genomic repeat region hasn't changed (variant 1), VV has to handle the gap, and reports a deletion. Also when inserting clean duplications (variants 2 through 4), VV includes the gap in the calculations, so the gap is "shifted" 3', just like the duplications. But variants 5 and 6 are strange. An imperfect duplication leads to an insertion in this repeat region. But, for an insertion in a repeat region with a gap, the question then becomes... where will we put the gap? Before or after the insertion? VV makes an interesting decision. The insertion is mapped to the transcript in such a way, that the sequence 3' of it includes the duplication on the genomic level. So the gap is "shifted" 5' and subsequently ignored, after which the duplication on the genome (which is the gap again!) is handled not as just an insertion, but as a duplication on the transcript level.

Things to consider:

When should a gap with an unclear position be handled and included in the variant description?
- Whenever possible?
- Only when necessary?
- Should VV try both to include and exclude the gap, and see which variant is simplest? Variants 5 and 6 should then be small insertions instead of bigger duplications, and short WT calls like NC_000004.11:g.3076657_3076662= will remain short WT calls on the transcript.
- Otherwise, are rules necessary to always 3' shift a gap, just like other variants?
Variants 5 and 6 do not contain a gap_statement and gapped_alignment_warning because the gap is shifted 5'. Should these warnings be included always when the gap can be shifted to overlap the variant?
Also, when handling clean, e.g. single-base, insertions in the genome, how will we determine where to insert this base into the transcript? Rules are needed for consistency. VV currently 5' shifts the gap in this case (see NC_000004.11:g.3076603_3076604insA).

I hope you'll enjoy this one, although I know you hate that gap-handling code...

Peter-J-Freeman commented 4 years ago

Now you are talking. I'll start digging with this one later. I am making great progress with https://github.com/openvar/variantValidator/issues/201. Code is now in place and I have added a small local cache that reduces the test time from >30 min to ~15 min. The liftover incurs a time penalty as we discussed, but the cache means that liftover is only done once using the slow method for each variant

I now need to update the tests and push up.

Before updating the servers, I also want to have another look at the issues around time-outs. See if I can find any more time savings to get the tests faster again

ifokkema commented 4 years ago

It all sounds great! I'm still making small improvements to the verification script to make sure it doesn't die anymore; being able to run it fully autonomously would speed up the process a lot as then it can run day and night. I'm considering whether I should use a cache on my side since I do need to repeat it as I've skipped large parts of the genome due to the known alignment issues, but on the other hand, at a certain point, we'll run on new variants only.

Peter-J-Freeman commented 2 years ago

Again, this might be fixed a little by https://github.com/openvar/variantValidator/issues/372 in terms of speed and we will get round to profiling

Peter-J-Freeman commented 2 years ago

@ifokkema , just spotted that this one is still open. Does it still need addressing? I think it probably does, but I played a lot with the gap code at this stage so may have addressed it. HMMMMM! Brain hurts

ifokkema commented 2 years ago

Hi Pete! Unfortunately, the links in the original report still show the same output. I have updated them to use rest.variantvalidator.org so you can easily test them for yourself. Cheers!

openvar / variantValidator

Interesting problems handling a gap between genome and transcript in a repeat region #205

Things to consider: