Time-consuming to run RSeQC geneBody_coverage2.py on bigwig

jun-wan commented 6 years ago

Running geneBody_coverage2.py on bigwig files instead of the entire BAM as input requires much less memory but is still time-consuming (32hrs for 15M reads,1.1G BAM).

jun-wan commented 6 years ago

More details:

For human data, the .command.sh takes ~6hrs to finish but takes >50hrs to finish by running as a nextflow task.
Sometimes nextflow couldn't detect the exit of genebody_coverage:
- all result/output files generated properly
- no .exitcode generated
- job failed due to time limit

apeltzer commented 6 years ago

Thanks, Jun for looking into this.

Maybe it's a good point to think about using something else, QualiMap can do genebody coverage, too:

http://kokonech.github.io/qualimap/kidney_rnaseqqc/qualimapReport.html#Coverage%20Profile%20Along%20Genes%20(Total)

jun-wan commented 6 years ago

Thank you for the suggestion. I will look at it. I have no idea why nextflow took so long time to run genebody_coverage2.py. Any ideas?

ewels commented 6 years ago

MultiQC also doesn't run with this output. That can be fixed, but I wonder if we should look into using a different tool to do this step (and others..?).

apeltzer commented 6 years ago

Parts could be replaced at least by QualiMap: http://qualimap.bioinfo.cipf.es/doc_html/command_line.html#cmdline-rnaseqqc

apeltzer commented 6 years ago

Hm, so I guess we could just go back and introduce the subsampling procedure again to make the genebody issues go away. The bigwig based process seems to take ages - I'm running a pipeline run on some published data for ~2 days and another project manager here at QBiC has similar issues with the extremely long runtime of genebody_coverage2.py as well unfortunately.

I'd also be happy to implement the QualiMap step and test it here locally before pushing anything on here in a separate branch - any concerns about this?

ewels commented 6 years ago

Reverting would be a quick fix, but it feels dirty. The BigWig outputs are useful and I feel that we should be able to generate the gene body plots with this pretty easily. I'd quite like to have a proper look into this / profile the job and try to see what's going wrong. It's a really short script so I'm curious as to where the time is going.

Alternatively, we can indeed switch wholesale to another package. I'd prefer not to mix up QC packages too much as it can get pretty complicated and the software requirements are tougher.

jun-wan commented 6 years ago

I did quite a lot tests for this and feel like it might be a nextflow problem. Actually, using bigwig file takes much less memory and shorter time (~6hrs, human) than using bam file (~25hrs). But if ran it as a nextflow task, it took very long time (>50hrs), and sometimes nextflow couldn't detect the exit of the task. I looked at the trace file and saw rchar > 10TB, maybe the problem caused by this?

ewels commented 6 years ago

Aha! @jun-wan you are confirming my suspicions! Now we just need to figure out what's happening 😆

apeltzer commented 6 years ago

Can you be a bit more explicit what you mean with rchar > 10 TB? Then I could also have a look at my issues... happy to help here - we're quite affected by this unfortunately :-(

jun-wan commented 6 years ago

It's I/O counter: chars read, written in .command.trace and also summarised in nfcore-rnaseq_trace.txt

apeltzer commented 6 years ago

To narrow things down - same input dataset on same pipeline (1.0) release on same computing system binac: Still high rchar values of 3.5 TB for a single sample (human rnaseq).

I had a power outage today, so the run didn't finish. Running again with 0.30.1 to test whether its version specific.

apeltzer commented 6 years ago

Update: Nextflow 0.30.1 (double checked, same input script - same behaviour). didn't fail yet but rchar 2763774903995 , so around 2.7TB already.

ewels commented 6 years ago

Gah, why the hell is that process doing some much disk access? Can anyone see a reason for this in the code? What happens if we try running the job outside of nextflow - does it take a comparable amount of time?

apeltzer commented 6 years ago

I'll take one of the BAMs and run it on a workstation to test. can't really test it on the cluster atm... Super weird thing this script ...

Update: Test running!

orzechoj commented 6 years ago

I think I'm having the same problem: I'm running the pipeline on Uppmax/rackham, with singularity, using nextflow v 30.1.

geneBody_coverage2.py finished ok according to the log file, and all results are there. Still it was cancelled due to time limit.

Looking at the trace file, the disc io is not so high: rchar is 12.2 MB and vchar is 663.4 KB

.command.trace in the work directory says

pid state %cpu %mem vmem rss peak_vmem peak_rss rchar wchar syscr syscw read_bytes write_bytes
2 0 1035 0 225292 34100 225292 34100 12746009 679299 1082 29972 12900352 0

orzechoj commented 6 years ago

I should also mention that I get a similar error using the ChIP-seq pipeline https://github.com/nf-core/chipseq, when running BWA. Here I ran on Uppmax/bianca, with singularity and I tried nextflow v 0.31.0 och v 0.30.1

drpatelh commented 6 years ago

CollectRnaSeqMetrics from Picard can do this too if you set CHART=<OUTPUT_FILE>

It looks like picard is already part of the software requirements but as Phil mentioned it wont be as tidy...

ewels commented 6 years ago

CollectRnaSeqMetrics could be a nice solution actually if we can't figure this out soon... 👍

drpatelh commented 6 years ago

Yes, thats what I routinely use. You can also feed it ribosomal intervals to look for rRNA contamination (mainly in total RNA preps) - another metric that is useful to have but you need to create a *.list file containing the intervals from the gtf. The parameters like strandedness can also be set.

drpatelh commented 6 years ago

I should also mention that the annotation has to be in reflat format as far as I know which might be a deal breaker.

santiagorevale commented 6 years ago

Hi there,

I took the time to read the script and performed some changes, being the main one replacing the python package bx.bbi for pyBigWig for reading the BigWig file. This resulted in a 21X performance increase on the script.

I've sent the script to its creator to see if he can make the changes permanent in a new release.

In the meantime, I'm attaching the script geneBody_coverage3.txt to this post (the extension txt should be changed back to py). If you add it to the bin folder of the pipeline and add the conda package pybigwig=0.3.12 to the environment.yml and to the Docker image, we should be able to fix this issue. Or at least it will give us enough time to properly test alternatives and think if we want to change.

I've also noticed that MultiQC is not producing several of the RSeQC plots, like:

bam_stat
gene_body_coverage: for this one I've noticed that the folder where files are being stored (geneBodyCoverage) is different than the one expected by MultiQC (gene_body_coverage); also, when publishing the files to the results folder, some of the rules have incorrect file names;
inner_distance
junction_annotation
read_gc

On other order of things, I've seen the rchar issue (it being GB/TB big) in other processes. Maybe it could be an overhead created by Nextflow or other intermediary (I'm using Singularity)?

Please, give it a try and let me know if it works for you.

Cheers, Santiago

santiagorevale commented 6 years ago

gene_body_coverage: for this one I've noticed that the folder where files are being stored (geneBodyCoverage) is different than the one expected by MultiQC (gene_body_coverage); also, when publishing the files to the results folder, some of the rules have incorrect file names;

Sorry, regarding this point, "folder name" is unrelated to why this metric is not being produced. This was part of my ignorance regarding how MultiQC works.

However, there are a couple of errors in the module, but I will open a ticket in MultiQC for that.

@ewels, do you have a test to check if all the plots for the tools in the pipeline are being properly generated (or generated at all)?

ewels commented 6 years ago

@santiagorevale - no pipeline tests like this yet, no. It could be nice to add. We've thought about validating outputs, perhaps just grepping the MultiQC log output would be an easy start on that road..

santiagorevale commented 5 years ago

Liguo Wang (RSeQC's developer) has released version 3.0 of RSeQC, with my changes on the Gene Body Coverage script. If we updated the package in the Docker image, we should be noticing the improvement in performance. Although I haven't check if all the other scripts are still behaving the same (no reason why they shouldn't, though).

matrulda commented 5 years ago

Hello everyone! Does anyone know if this is still an issue in rnaseq v1.1 with nextflow>=0.32.0?

ewels commented 5 years ago

Probably still an issue yes, hoping to get to it very soon.

alneberg commented 5 years ago

RSeQC 3.0 does include the modified script by @santiagorevale (fortunately still named geneBody_coverage2.py). However, RSeQC 3.0 is not yet available on bioconda and it does require python version >= 3. Does this cause any troubles or are we lucky enough to not have any python 2 only packages?

ewels commented 5 years ago

Thanks @alneberg - I've moved this item to a new issue (#148), as this issue has gotten to be very long and difficult to track.

Hopefully integrating rseqc will solve the problems on this issue and we can close it. If not, we can keep this issue to track them and perhaps look into migrating to Picard instead.

ewels commented 5 years ago

@apeltzer has now updated the dev branch to use RSeQC 3.0 🎉

Would be great if anyone gets a chance to try running the pipeline with some big inputs to see if there's an improvement.

Thanks all!

Phil

apeltzer commented 5 years ago

grep -e "genebody*" execution_trace.txt
165     29/ee512c       88280   genebody_coverage (QPB05024AJ__AS184983LR28503AlignedByCoord.out)       COMPLETED       0       2019-03-19 13:57:29.587 3h 28m 51s      1h 37m 51s      101,5%  5042 MB 7301 MB 4846 GB 22 MB
166     ac/a264b6       88281   genebody_coverage (QPB05022A3__AS184981LR28503AlignedByCoord.out)       COMPLETED       0       2019-03-19 13:57:29.612 3h 29m 41s      1h 38m 31s      101,6%  504 MB  7304 MB 4974 GB 23 MB
164     d1/47f623       88282   genebody_coverage (QRDGM008ALAS225023LR34148AlignedByCoord.out) COMPLETED       0       2019-03-19 13:57:29.633 4h 6m 16s       1h 34m 49s      101,4%  5037 MB 7292 MB 4806 GB 22 MB
163     1c/2ae0ad       88283   genebody_coverage (QRDGM011A6AS225026LR34148AlignedByCoord.out) COMPLETED       0       2019-03-19 13:57:29.668 4h 6m 46s       1h 35m 12s      100,8%  5038 MB 7296 MB 4799 GB 22 MB
167     30/0767d9       88285   genebody_coverage (QPB05018AA__AS184977LR28503AlignedByCoord.out)       COMPLETED       0       2019-03-19 13:57:29.713 4h 46m 36s      1h 38m 57s      100,6%  5041 MB 7288 MB 4842 GB 22 MB
168     6f/0ef199       88286   genebody_coverage (QRDGM010AWAS225025LR34148AlignedByCoord.out) COMPLETED       0       2019-03-19 13:57:29.736 5h 1s   1h 31m 9s       101,2%  5033 MB 7288 MB 4681 GB 21 MB
169     86/74b6df       88287   genebody_coverage (QRDGM009ATAS225024LR34148AlignedByCoord.out) COMPLETED       0       2019-03-19 13:57:29.757 5h 6m 16s       1h 36m 33s      101,7%  5041 MB 7302 MB 4884 GB 22 MB
170     a5/9a0053       88288   genebody_coverage (QPB05013A4__AS184972LR28503AlignedByCoord.out)       COMPLETED       0       2019-03-19 13:57:29.777 5h 39m 1s       1h 32m 47s      101,4%  504 MB  7294 MB 4766 GB 22 MB
171     6d/c0a079       88289   genebody_coverage (QPB05017A2__AS184976LR28503AlignedByCoord.out)       COMPLETED       0       2019-03-19 13:57:29.797 5h 42m 16s      1h 35m 24s      100,9%  5036 MB 7325 MB 4849 GB 22 MB
173     50/1d6001       88290   genebody_coverage (QPB05023AB__AS184982LR28503AlignedByCoord.out)       COMPLETED       0       2019-03-19 13:57:29.823 5h 53m 36s      1h 38m 9s       101,1%  5041 MB 7294 MB 4914 GB 23 MB
172     8d/274122       88291   genebody_coverage (QPB05014AC__AS184973LR28503AlignedByCoord.out)       COMPLETED       0       2019-03-19 13:57:29.856 6h 27m 21s      1h 40m 45s      100,7%  504 MB  7297 MB 4843 GB 22 MB
175     9c/b0217d       88292   genebody_coverage (QPB05015AK__AS184974LR28503AlignedByCoord.out)       COMPLETED       0       2019-03-19 13:57:29.886 6h 35m 51s      1h 35m 47s      100,8%  504 MB  7298 MB 4764 GB 22 MB
176     8e/00bae9       88293   genebody_coverage (QRDGM012AEAS225027LR34148AlignedByCoord.out) COMPLETED       0       2019-03-19 13:57:29.914 6h 40m 26s      1h 34m 6s       100,9%  5039 MB 7295 MB 4833 GB 22 MB
174     77/65462b       88294   genebody_coverage (QRDGM007ADAS225022LR34148AlignedByCoord.out) COMPLETED       0       2019-03-19 13:57:29.943 7h 12m 21s      1h 33m 16s      100,7%  5037 MB 7297 MB 4719 GB 22 MB
178     83/9091e1       88295   genebody_coverage (QPB05016AS__AS184975LR28503AlignedByCoord.out)       COMPLETED       0       2019-03-19 13:57:29.973 7h 20m 51s      1h 38m 33s      100,8%  5042 MB 7295 MB 4961 GB 23 MB
180     c2/d3f413       88296   genebody_coverage (QPB05020AL__AS184979LR28503AlignedByCoord.out)       COMPLETED       0       2019-03-19 13:57:30.002 7h 38m 36s      1h 44m 54s      100,7%  5047 MB 7307 MB 5119 GB 24 MB
177     90/41d727       88297   genebody_coverage (QPB05019AI__AS184978LR28503AlignedByCoord.out)       COMPLETED       0       2019-03-19 13:57:30.031 8h 8m 7s        1h 40m 41s      101,4%  5039 MB 7297 MB 4741 GB 22 MB
179     a3/686b0c       88298   genebody_coverage (QPB05021AT__AS184980LR28503AlignedByCoord.out)       COMPLETED       0       2019-03-19 13:57:30.059 8h 17m 47s      1h 41m 52s      100,7%  5047 MB 7302 MB 5144 GB 24 MB

These were some mouse samples that I just ran for integration testing of the dev branch these days. Can't really say whether this is much faster than before unfortunately.

So between 3-8.5hours per sample is a lot IMHO.

I'm going to run some human data today, too.

apeltzer commented 5 years ago

Human Data:

79      18/f4cc73       88447   genebody_coverage (I16R003f01_01_S5_L003_R1_001AlignedByCoord.out)      COMPLETED       0       2019-03-22 16:41:30.167     2h 14m 41s      2h 1m 14s       100,6%  9395 MB 11 GB   9504 GB 25 MB
76      3f/6c2192       88444   genebody_coverage (I16R003f02_01_S6_L002_R1_001AlignedByCoord.out)      COMPLETED       0       2019-03-22 16:41:30.031     2h 27m 2s       2h 26m 58s      101,6%  9393 MB 11 GB   9321 GB 24 MB
77      47/63af20       88443   genebody_coverage (I16R003f02_01_S6_L004_R1_001AlignedByCoord.out)      COMPLETED       0       2019-03-22 16:41:29.988     2h 28m 47s      2h 28m 36s      101,5%  9393 MB 11 GB   9348 GB 24 MB
80      ab/1bee49       88446   genebody_coverage (I16R003f01_01_S5_L002_R1_001AlignedByCoord.out)      COMPLETED       0       2019-03-22 16:41:30.134     3h 28m 38s      3h 28m 31s      100,7%  9397 MB 11 GB   9506 GB 25 MB
78      32/a4c8ba       88449   genebody_coverage (I16R003f02_01_S6_L003_R1_001AlignedByCoord.out)      COMPLETED       0       2019-03-22 16:41:30.227     3h 42m 33s      2h 27m 13s      100,5%  9395 MB 11 GB   9278 GB 24 MB
82      75/c10a0e       88450   genebody_coverage (I16R003f01_01_S5_L001_R1_001AlignedByCoord.out)      COMPLETED       0       2019-03-22 16:41:30.258     4h 17m 13s      2h 2m 31s       101,6%  9397 MB 11 GB   9494 GB 25 MB
75      76/a10aa6       88445   genebody_coverage (I16R003f02_01_S6_L001_R1_001AlignedByCoord.out)      COMPLETED       0       2019-03-22 16:41:30.078     4h 20m 23s      4h 20m 17s      100,3%  9395 MB 11 GB   9272 GB 24 MB
81      85/b42b38       88451   genebody_coverage (I16R003f01_01_S5_L004_R1_001AlignedByCoord.out)      COMPLETED       0       2019-03-22 16:41:30.290     4h 29m 58s      2h 2m 55s       100,9%  940 MB  11 GB   9445 GB 25 MB

apeltzer commented 5 years ago

I think its fine now - we should keep an eye on it maybe still...

nf-core / rnaseq

Time-consuming to run RSeQC geneBody_coverage2.py on bigwig #81