suhrig / arriba

Fast and accurate gene fusion detection from RNA-Seq data
Other
225 stars 50 forks source link

intragenic inversions: intragenic_exonic filer #78

Closed tobzac closed 4 years ago

tobzac commented 4 years ago

First of all thanks for sharing this great tool!

I am planning to use it for a pipeline and therefore am also interested in its limitations for smaller variants. For this usually I prepare some smaller insilico RNA variants and look at the calling of it with different tools. I was able to call internal duplications with arriba, as described in the manual/docs (if (a) whole exon(s) is/are duplicated and the exon(s) is/are large enough), but am not suceeding in calling in silico inversions (where I have e.g. inverted 1 or 2 or more exons). They get filtered with the intragenic_exonic filter although I am quite sure I that I have put the inversion to have splice-site breakpoints for both ends. Do you know anything about this? What is the rationale to have it filtered? This is how the line in the discarded.tsv file looks like:

EGFR EGFR +/- +/+ 7:55221704 7:55224452 CDS splice-site inversion/3'-3' upstream upstream 14 12 12 10276 5485 low . . duplicates(39),intragenic_exonic

Is this related to the assignment of spliced1/2, which depends on the strand and when inversions somehow flip the strand? Or are the inversions that are called by arriba of some different kind (I am currently taking the word inversions very literally and just invert a few exons irrespective of how reasonable or biologically relevant that might be)?

Please let me know if you need further information. By the way: when do you plan to update to arriba2.0, with the enhanced tandem duplication sensitivity?

Thanks a lot in advance for your help! Best,

Tobias

suhrig commented 4 years ago

For this usually I prepare some smaller insilico RNA variants

I know this is a popular approach for lack of better benchmarking data, but if you have the possibility to use real data by any means, I recommend using that for benchmarking rather than in silico data. Simulated data never fully recapitulate the characteristics of real data, such as insertion of non-template bases, common germline variants, allele-specific expression, non-sense mediated decay, etc.

I am quite sure I that I have put the inversion to have splice-site breakpoints for both ends

I can confirm that you put both breakpoints at splice sites, BUT biologically splicing will happen at only one of the breakpoints, because the other one is on the antisense strand. Arriba's prediction - that one breakpoint must be on the sense strand and the other on the antisense strand - is correct. Which one is which is probably arbitrary in your example, because both work equally well. Think about it: If you invert an entire exon BOTH splice sites are lost, because they are antisense now. In a real cell, this will usually lead to the exon being skipped completely during transcription, such that you won't see the inversion in the RNA-Seq data in the shape of an inversion, but rather as an exon skip (which Arriba is not able to detect and probably never will be, because this is better handled by tools with a focus on differential splicing). If you put the inversion breakpoint in the middle of an exon, then theoretically it could happen that transcription continues on the antisense strand for a while, until a new splice site (on the antisense strand) is encountered triggering the spliceosome to jump to the next exon. In real life, however, I have usually seen this lead to exon skips as well or the generation of a novel splice site in the intron rather than at the next exon boundary.

And that is why I recommend using real data rather than synthetic data. BTW, if you come across some good real-world RNA-Seq samples with intragenic inversions, I would be happy to get my hands on them. They are hard to find, because intragenic inversions are really rare it seems - which is yet another argument why it's a good thing that the intragenic_exonic filter punishes them so hard.

What is the rationale to have it filtered?

The rationale of the intragenic_exonic filter is the following: Genes are mostly made up of introns. If we assume that genomic breakpoints are randomly scattered over a gene, then it is very unlikely that both breakpoints are located inside an intron. Much more likely is the situation where both breakpoints are in introns. On the transcriptomic level, this will lead to splicing at the next exon boundary, such that the transcriptomic breakpoints are NOT inside an exon, but rather at an exon boundary. Even if one of the breakpoints happen to hit an exon, it is very unlikely that the second breakpoint does, too. Much more likely is that the second breakpoint is in an intron, which will manifest as an intronic breakpoint on the transcriptomic level (or as a splice site breakpoint if the spliceosome gets involved). Due to the small probability that both breakpoints hit exons, the intragenic_exonic filter discards such events. This makes even more sense in view of the fact that in RNA-Seq the most abundant type of artifacts are intragenic chimeras with both breakpoints in exons. These artifacts are presumably introduced during library preparation. In order to get rid of them, you have to filter hard.

By the way: when do you plan to update to arriba2.0, with the enhanced tandem duplication sensitivity?

I find it hard to name a release date, but I am working on it a lot currently and hope to finish soon. I just pushed some more commits to the develop branch which will improve internal tandem duplication substantially. Feel free to try it out if you want. Please beware that this will call some germline ITDs, because the current blacklist does not yet filter them.

And lastly: If you are interested in a certain type of variant that is consistently lost due to a filter, you can always turn the filter off using -f. Don't hesitate to do so if you are convinced that a certain filter makes no sense in your particular use case. I designed Arriba to work fine for most users. I am sure that for some niche analyses not all of the filters make sense.

tobzac commented 4 years ago

Thanks a lot for your comprehensive answer! That´s very helpful and thanks for your advice. I see that I had a too simplistic approach there for the inversions... I understand it now better. Looking forward to 2.0

Best,

Tobias