suhrig / arriba

Fast and accurate gene fusion detection from RNA-Seq data
Other
227 stars 49 forks source link

forever running at the filtering mismappers step #8

Closed snowlover closed 6 years ago

snowlover commented 6 years ago

Hi Sebastian,

I really like your tool arriba, it's really fast. But recently I encountered this weird issue, what's the possible reason for it ? is it because some GTF records ? very similar splicing junction structures ? or anything else ...

Thanks.

suhrig commented 6 years ago

Hi snowlover,

This sounds very much like a bug I discovered just a week ago, which manifested, when there were chimeric reads in regions where many genes overlap according to the GTF annotation. In my case it was a cluster of genes named PCDHA[1-13]. This caused the runtime of the mismappers filter to increase exponentially with the number of overlapping genes. I fixed it in the latest release, which I'll push to GitHub this or next week. The fix cut down the overall runtime of Arriba from 2h to the more usual 2 min. I suppose the same issue could be affecting you.

Alternatively, there might simply be a lot of events left for the mismappers filter to process. How many events remain after the filtering step prior to the mismappers step? It's usually not more than a few hundred. A few thousand or tens of thousands can cause the mismappers filter to take a while, because it does very sensitive re-alignment of the reads, which is CPU-intensive. How much time did you give it, before terminating Arriba?

Regards, Sebastian

snowlover commented 6 years ago

Hi Sebastian,

That's great, probably it's the first reason. I leave it running for the whole night when using hg19, while hg38 finished at 30min. It has about 3500 events left when mismappers filtering. Can I try the fix before your official release ? Another question, what if I turn off the mismappers ? is it sensitive to the read length (mine is ~50bp) ? Thank you very much.

suhrig commented 6 years ago

the whole night when using hg19, while hg38 finished at 30min

Since this means you were also using another GTF, it could very well be the same issue. Maybe the newer version of the annotation does not annotate as many overlapping genes as the old one, which is why it runs faster there.

A quick (but not 100% accurate check) is to look if there are many breakpoints in the output file with different gene names, but the exact same breakpoint coordinates. If that's the case, then it's almost certainly the same issue as I ran into.

It has about 3500 events left when mismappers filtering.

That is a lot, but it should certainly not take the whole night to process. Out of curiosity: what's your sequencing depth (total number of mapped reads)?

Can I try the fix before your official release ?

I'll do my best to push the new release as fast as possible. The new Arriba version comes with a streamlined workflow (no more extract_read-through_fusions) and I would not want to distribute it without documentation. Don't worry, the new workflow is simpler than before, but the new Arriba is not a drop-in replacement for the old version and requires some more explanation, hence.

Another question, what if I turn off the mismappers ? is it sensitive to the read length (mine is ~50bp) ?

You should not turn of the mismappers filter regardless of the read length (at least the ones I have tested Arriba with: 50bp - 250bp), unless you are willing to deal with a whole lot more false positives. The filter runs faster the shorter the read lengths, because there is less sequence to re-align.

Regards, Sebastian

snowlover commented 6 years ago

Thanks, Sebastian.

  1. the mapped reads is about 70m pairs.
  2. may i know if including too much overlapping genes will significantly affect arriba's performance (false positives) ?
  3. is the new version's results comparable to the current one ? (i.e. mostly just processing time or steps different ?)
suhrig commented 6 years ago
  1. may i know if including too much overlapping genes will significantly affect arriba's performance (false positives) ?

No, it does not impact performance negatively. Whenever Arriba finds a fusion in a region where two genes overlap, it first tries to determine, if the fusion can clearly be attributed to one of the genes (e.g., if the breakpoint is located at a splice site of gene A and in an intron of gene B, then Arriba will attribute the fusion to gene A). If the fusion cannot be attributed to one of the overlapping genes, Arriba reports multiple lines with different gene names for the same event.

Besides, the number of overlapping genes is beyond your control (it's dictated by the annotation), so this should not be your concern.

  1. is the new version's results comparable to the current one ? (i.e. mostly just processing time or steps different ?)

The algorithm has not changed dramatically. I made some incremental improvements. The sensitivity should be improved marginally and the specificity should improve a bit more (~10% fewer false positives). The format of the output file contains a few additional columns (peptide sequence, coverage).

snowlover commented 6 years ago

very glad to know that. looking forward to the new release :)

suhrig commented 6 years ago

The new release is out. Please give it a try whether it solves your issue.

As announced, the workflow has changed. So you will need to make some adaptations to your pipeline. extract_read-through_fusions is obsolete. Arriba now directly takes STAR alignments as input, which makes the overall workflow even simpler and faster (because alignments no longer need to be sorted prior to running Arriba). Moreover, according to the developer of STAR the output file Chimeric.out.sam might be deprecated in future releases, so at some point the main output file Aligned.out.bam is going to be the only input file that Arriba requires.

snowlover commented 6 years ago

Thanks. I've tried the new version. it indeed solve the long time running issue. but i don't know why the running time is a little longer than v0.12, maybe it's the problem of my machine. besides, does arriba fully use the cpus provided to it ? for v1.0, it seems even cpu usage for STAR now is at most 100% if using 10 cpus.

suhrig commented 6 years ago

why the running time is a little longer than v0.12

Arriba 1.0.0 is more sensitive towards intragenic rearrangements, which comes at additional computational expenses, particularly when there is a high number of artifacts (i.e., high fraction of chimeric alignments). The speed advantage compared to v0.12.0 is only really noticeable, when you run STAR solely for the sake of gene fusion detection, because then the alignment file no longer needs to be sorted and indexed, which used to be required for Arriba v0.12.0. Most users will want to sort and index the alignment file, though, for downstream analyses.

does arriba fully use the cpus provided to it ?

Arriba is single-threaded. It never uses more than 1 CPU. Parallelizing Arriba's filter would be technically challenging. At best I could parallelize reading the BAM file, but this should not be necessary, because Arriba can read BAM files multiple times faster than STAR can produce them (even when using 16 threads for STAR).

it seems even cpu usage for STAR now is at most 100%

That sounds odd. Are you sure STAR uses only a single CPU for the whole runtime? It is normal that STAR uses just 1 CPU for loading the index as well as for the final steps. But in-between it should use as many CPUs as given in --runThreadN. Arriba should not slow down STAR according to my tests.

suhrig commented 6 years ago

it seems even cpu usage for STAR now is at most 100%

Your question made me aware of a related issue, which I hadn't noticed before: When STAR is instructed to write BAM files with --outSAMtype BAM, then it does not use all CPUs beyond a certain threshold. On my system it is futile to specify more CPUs than 8, because STAR hardly ever exceeds 8 CPUs anyway. I suspect this is because it uses only a single thread to compress the output BAM file, which becomes the bottleneck when using more than 8 CPUs. When I disable compression using --outBAMcompression 0, then STAR is capable of utilizing more CPUs than 8. This is another aspect how the workflow of Arriba v1.0.0 and v0.12.0 differ, which might explain your observations to some degree (STAR should definitely still use substantially more than 1 CPU). But Arriba cannot really be blamed for this.

snowlover commented 6 years ago

thanks, Sebastian. i'm very glad to use this new version and know the details. since my problem is solved, i'm closing this issue. looking forward to your future improvement, if possible, on this and more outstanding tools :)