High false positive rate

Evansef commented 5 years ago

Hello, I met a problem with Arriba. Before using it on my samples I'm testing it and other tools on test dataset (positive, negative & real breast line cancer, paired-end data). It appears that I got a lot of false positive for negative dataset made with Beers, I take it from Jaffa dataset, available here : https://github.com/Oshlack/JAFFA/wiki/Download

For comparison, I run the analysis on the same data with FusionCatcher, Star-Fusion & Infusion. For these 3 tools I got 1, 8 and 38 false positive respectively while I got 196 fusions with Arriba ! I tried these parameters to improve my results but it's unsuccessful.

Max evalue to 0.05 (instead default 0.3) : 186 fusions.
Anchor Lenght to 40bp (instead default 23) : 196.

Any ideas to improve this ? Why does so many false positives I don't understand, I will have thought that changing anchor to 40 would have decrease drastically the number of fusions but It still the same...

Thank you in advance for your answer.

suhrig commented 5 years ago

Almost all of Arriba's predictions on the BEERS dataset are read-through fusions. These are transcripts which - under real-life conditions - arise from the RNA polymerase missing a stop sign and continuing transcription beyond the end of the gene, creating a fusion-like transcript between the gene and a neighboring gene or some intergenic region in the vicinity. The second mechanism how these transcripts can be generated is through focal deletions. A common example for this is the GOPC-ROS1 fusion. There is a section on read-through fusions in Arriba's manual.

The BEERS dataset was generated using RefSeq annotation and a bunch of additional annotation files. Apart from normal transcripts annotated in RefSeq, it simulates (read-through) transcripts which are not seen in normal tissue and which are not observed under real-life conditions (certainly not at the simulated expression levels). Arriba reports these as aberrant transcripts, hence. This is sensible and even desirable, because in a cancer sample, these transcripts would indicate aberrant transcription with potentially oncogenic effect. The reason why other tools do not report them, is because they heavily penalize potential read-through transcripts. If you inspect the breakpoints of the transcripts closely, you will notice almost all of them have a distance of a few dozen kb. Most tools do not report fusion transcripts with breakpoints that close, which means they are blind to these aberrant transcripts. You can achieve the same effect with Arriba by increasing the minimum read-through distance, e.g., by passing the parameter -R 200000. I do not recommend this, though, because with real (i.e., non-simulated) sequencing data, Arriba's filters do a decent job at removing common/frequent/benign read-through fusions and there is no need to increase this parameter. Doing so will run the risk of missing fusions arising from focal deletions.

Generally speaking, simulated data are not well-suited for benchmarking fusion tools. They do not reflect the artifacts inherent to real sequencing data and they harbor artifacts that are not seen under real-life conditions (such as the read-through fusions in the BEERS dataset). This was also nicely demonstrated in the DREAM SMC-RNA Challenge, where most tools performed really well on simulated data (rounds 1-3), but suffered when real sequencing data was used (rounds 4-5). Have a look at the Leaderboards. Notice how the tools achieved high precision and recall on the simulated data, but how the accuracy of all tools deteriorated substantially when real sequencing data was used. Arriba ranked only in the top third on the simulated data, but advanced to the best-performing method with real sequencing data. Instead of a simulated dataset, you should use a real sequencing sample from benign tissue to measure the false positive rate. Here are a few sources for RNA-Seq samples from benign tissue:

Human Protein Atlas
Illumina Human BodyMap 2.0
ENCODE Project
RoadMap Epigenomics Project

Evansef commented 5 years ago

Thank you very much for your answer, It's very interressant ! I understand better now

suhrig commented 5 years ago

Evaluating the sensitivity of fusion detection tools is even harder. Benchmarking data on gene fusions are scarce. There are only a handful of somewhat well-characterized samples available (MCF-7, BT-474, SK-BR-3, ...). Here are two suggestions on how to benchmark the recall rate of fusion detection tools beyond those samples:

merge simulated fusions (generated with FUSIM and art_illumina, for example) into a real RNA-Seq sample from benign tissue
call gene fusions from RNA-Seq data and structural variants from whole-genome sequencing data and look for correlating events

With the first approach you can easily generate an arbitrary amount of true positives and the background model is still realistic. The disadvantage is that the fusion transcripts are simulated and do not reflect some of the special circumstances observed in a real-world scenario (fusions with intergenic regions are not simulated, for instance).

The second approach is 100% realistic, but depends on the accuracy of the structural variant caller used to detect SVs in the WGS data. On top of that, WGS data are scarce, too.

suhrig commented 4 years ago

Hi Evansef, I'm closing this issue, since your question seems to be answered. Feel free to reopen if you still need help/advice on how to benchmark fusion tools properly. Kind regards, Sebastian

suhrig / arriba

High false positive rate #22