possibly incorrect determination of the frame

anoronh4 commented 1 year ago

We have a fusion called SCP2-NTRK1 and we are curious as to why the fusion is being called as out of frame. The nt sequence as reported by arriba is as follows:

ATAATTTAGGCATTGGAGGAGCTGTGGTTGTAACACTCTACAAGATGGGTTTTCCGGAAGCCGCCAG___TTCTTTTAGAACTCATCAAATTGAAGCTGTTCCAACCAGCTCTGCAAGTGATGGATTTAAGGCAAATCTTGTTTTTAAGGAGATTGAGAAGAAACTTGAAGAG___TCTCGCTCTGTCACCCAGGCTGGAGTGCAGTGGCGCGATCTTGGCTCACTGCAACCTCCACCTCCCGGGTTCAAGCGATTCTCGTGCCTCAGCCTCCCGAGTAGCTGGCATTACAGGCACGTGCCACCACACCCAG|GACGGAGAAACAAGTTTGGGATCAACC___GCCCGGCTGTGCTGGCTCCAGAGGATGGGCTGGCCATGTCCCTGCATTTCATGACATTGGGTGGCAGCTCCCTGTCCCCCACCGAGGGCAAAGGCTCTGGGCTCCAAGGCCACATCATCGAGAACCCACAATACTTCAGTGATGCCTGTGAGGGGCTATGCTGGG...CAAGGGCAGGGACGA...___GTGTTCACCACATCAAGCGC

The sequence between the breakpoint | the last ___ before it is actually intron sequence (site1 == intron and site2 == CDS). The aa sequence that arriba predicts is as follows:

NLGIGGAVVVTLYKMGFPEAASSFRTHQIEAVPTSSASDGFKANLVFKEIEKKLEEsrsvtqagvqwrdlgslqppppgfkrfsclslpsswhyrhvpphp|grrnkfginrpavlapedglamslhfmtlggsslsptegkgsglqghiienpqyfsdaceglcw

we are a bit confused because the protein sequence after the breakpoint looks to be in-frame. so i'm just wondering what about this causes arriba to say it is out-of-frame.

by the way, we are using 2.3.0

suhrig commented 1 year ago

What's the transcript that Arriba reports in column transcript_id2? Only if the protein sequence is in-frame with respect to this transcript, then Arriba will consider the fusion protein to be in-frame. Of course, this merely shifts the problem from incorrect reading frame prediction to incorrect transcript selection, but this would explain why Arriba considers the fusion to be a frameshift, even though it looks like an in-frame fusion, and this is where I would have to start to fix the problem.

anoronh4 commented 1 year ago

ENST00000530298 is the transcript_id2 which is a noted in Ensembl as a retained intron...which is weird because arriba also says that site2 is CDS. is this an instance of incorrect transcript selection, as you mentioned in your comment? how is a particular transcript prioritized?

suhrig commented 1 year ago

I have taken a close look at this case. The reason why Arriba chooses the intron retention transcript is because there is some intron retention going on. Namely, the bases GTGAGGGGCTATGCTGGGTCAAGGGCAGGGACGA are transcribed as part of the fusion, which are intronic bases.

Arriba chooses the transcript based on the expression level and splice patterns. Basically, it assembles a transcript sequence from the fusion-supporting reads. When there are multiple transcripts being expressed at the same time, it picks the one with the highest expression. Then, it tries to match this transcript sequence to the most closely matching annotated transcript ID. In this case, Arriba comes to the conclusion that it's the transcript ID with intron retention, because of a small stretch of intron being transcribed.

This is a bit of an ambiguous case, because there is both intronic transcription and splicing. The transcribed bases do not match any annotated transcript perfectly. And although it's not entirely wrong of Arriba to choose an intron retention transcript, it could have chosen one of the transcripts without intron retention, and it would be as good of a match.

I suspect that Arriba did not determine the highest expressed transcript correctly. This may be the real mistake here. Can you send me an IGV screenshot of the region chr1:156,875,483-156,876,233 (hg38 coordinates) or chr1:156,845,289-156,846,025 (hg19 coordinates)? I would like to see the coverage in this region to confirm that the above mentioned short intronic stretch really is the dominant transcript in this region.

anoronh4 commented 1 year ago

arriba.zip is this igv report ok for checking out the reads? from what i can see in the alignment, several of the split reads from the fusion are skipping the intronic regions. but i'm not practiced at looking at these.

suhrig commented 1 year ago

Apologies for the long delay and also for the fact that I still have not come up with a solution yet despite the long time. However, your provided data was very helpful. I can confirm that Arriba did indeed pick the wrong transcript and that this is the underlying cause of the wrong reading frame. I will need some time to play with this example and see if I can improve the transcript selection. My problem is that I have hardly any spare time for this at the moment due to many other obligations. I will get back to you as soon as I can.

suhrig / arriba

possibly incorrect determination of the frame #190