Closed jun-wan closed 5 years ago
More details:
genebody_coverage
:
.exitcode
generatedThanks, Jun for looking into this.
Maybe it's a good point to think about using something else, QualiMap can do genebody coverage, too:
Thank you for the suggestion. I will look at it.
I have no idea why nextflow took so long time to run genebody_coverage2.py
. Any ideas?
MultiQC also doesn't run with this output. That can be fixed, but I wonder if we should look into using a different tool to do this step (and others..?).
Parts could be replaced at least by QualiMap: http://qualimap.bioinfo.cipf.es/doc_html/command_line.html#cmdline-rnaseqqc
Hm, so I guess we could just go back and introduce the subsampling procedure again to make the genebody issues go away. The bigwig based process seems to take ages - I'm running a pipeline run on some published data for ~2 days and another project manager here at QBiC has similar issues with the extremely long runtime of genebody_coverage2.py
as well unfortunately.
I'd also be happy to implement the QualiMap step and test it here locally before pushing anything on here in a separate branch - any concerns about this?
Reverting would be a quick fix, but it feels dirty. The BigWig outputs are useful and I feel that we should be able to generate the gene body plots with this pretty easily. I'd quite like to have a proper look into this / profile the job and try to see what's going wrong. It's a really short script so I'm curious as to where the time is going.
Alternatively, we can indeed switch wholesale to another package. I'd prefer not to mix up QC packages too much as it can get pretty complicated and the software requirements are tougher.
I did quite a lot tests for this and feel like it might be a nextflow problem. Actually, using bigwig file takes much less memory and shorter time (~6hrs, human) than using bam file (~25hrs). But if ran it as a nextflow task, it took very long time (>50hrs), and sometimes nextflow couldn't detect the exit of the task. I looked at the trace file and saw rchar > 10TB, maybe the problem caused by this?
Aha! @jun-wan you are confirming my suspicions! Now we just need to figure out what's happening 😆
Can you be a bit more explicit what you mean with rchar > 10 TB? Then I could also have a look at my issues... happy to help here - we're quite affected by this unfortunately :-(
It's I/O counter: chars read, written in .command.trace
and also summarised in nfcore-rnaseq_trace.txt
To narrow things down - same input dataset on same pipeline (1.0) release on same computing system binac
: Still high rchar
values of 3.5 TB
for a single sample (human rnaseq).
I had a power outage today, so the run didn't finish. Running again with 0.30.1
to test whether its version specific.
Update: Nextflow 0.30.1
(double checked, same input script - same behaviour). didn't fail yet but
rchar 2763774903995
, so around 2.7TB already.
Gah, why the hell is that process doing some much disk access? Can anyone see a reason for this in the code? What happens if we try running the job outside of nextflow - does it take a comparable amount of time?
I'll take one of the BAMs and run it on a workstation to test. can't really test it on the cluster atm... Super weird thing this script ...
Update: Test running!
I think I'm having the same problem: I'm running the pipeline on Uppmax/rackham, with singularity, using nextflow v 30.1.
geneBody_coverage2.py
finished ok according to the log file, and all results are there. Still it was cancelled due to time limit.
Looking at the trace file, the disc io is not so high: rchar is 12.2 MB and vchar is 663.4 KB
.command.trace
in the work directory says
pid state %cpu %mem vmem rss peak_vmem peak_rss rchar wchar syscr syscw read_bytes write_bytes
2 0 1035 0 225292 34100 225292 34100 12746009 679299 1082 29972 12900352 0
I should also mention that I get a similar error using the ChIP-seq pipeline https://github.com/nf-core/chipseq, when running BWA. Here I ran on Uppmax/bianca, with singularity and I tried nextflow v 0.31.0 och v 0.30.1
CollectRnaSeqMetrics from Picard can do this too if you set CHART=<OUTPUT_FILE>
It looks like picard is already part of the software requirements but as Phil mentioned it wont be as tidy...
CollectRnaSeqMetrics could be a nice solution actually if we can't figure this out soon... 👍
Yes, thats what I routinely use. You can also feed it ribosomal intervals to look for rRNA contamination (mainly in total RNA preps) - another metric that is useful to have but you need to create a *.list file containing the intervals from the gtf. The parameters like strandedness can also be set.
I should also mention that the annotation has to be in reflat format as far as I know which might be a deal breaker.
Hi there,
I took the time to read the script and performed some changes, being the main one replacing the python package bx.bbi for pyBigWig for reading the BigWig file. This resulted in a 21X performance increase on the script.
I've sent the script to its creator to see if he can make the changes permanent in a new release.
In the meantime, I'm attaching the script geneBody_coverage3.txt to this post (the extension txt
should be changed back to py
). If you add it to the bin
folder of the pipeline and add the conda package pybigwig=0.3.12
to the environment.yml
and to the Docker image, we should be able to fix this issue. Or at least it will give us enough time to properly test alternatives and think if we want to change.
I've also noticed that MultiQC is not producing several of the RSeQC plots, like:
geneBodyCoverage
) is different than the one expected by MultiQC (gene_body_coverage
); also, when publishing the files to the results
folder, some of the rules have incorrect file names;On other order of things, I've seen the rchar
issue (it being GB/TB big) in other processes. Maybe it could be an overhead created by Nextflow or other intermediary (I'm using Singularity)?
Please, give it a try and let me know if it works for you.
Cheers, Santiago
- gene_body_coverage: for this one I've noticed that the folder where files are being stored (
geneBodyCoverage
) is different than the one expected by MultiQC (gene_body_coverage
); also, when publishing the files to theresults
folder, some of the rules have incorrect file names;
Sorry, regarding this point, "folder name" is unrelated to why this metric is not being produced. This was part of my ignorance regarding how MultiQC works.
However, there are a couple of errors in the module, but I will open a ticket in MultiQC for that.
@ewels, do you have a test to check if all the plots for the tools in the pipeline are being properly generated (or generated at all)?
@santiagorevale - no pipeline tests like this yet, no. It could be nice to add. We've thought about validating outputs, perhaps just grepping the MultiQC log output would be an easy start on that road..
Liguo Wang (RSeQC's developer) has released version 3.0 of RSeQC, with my changes on the Gene Body Coverage script. If we updated the package in the Docker image, we should be noticing the improvement in performance. Although I haven't check if all the other scripts are still behaving the same (no reason why they shouldn't, though).
Hello everyone! Does anyone know if this is still an issue in rnaseq v1.1 with nextflow>=0.32.0?
Probably still an issue yes, hoping to get to it very soon.
RSeQC 3.0 does include the modified script by @santiagorevale (fortunately still named geneBody_coverage2.py). However, RSeQC 3.0 is not yet available on bioconda and it does require python version >= 3. Does this cause any troubles or are we lucky enough to not have any python 2 only packages?
Thanks @alneberg - I've moved this item to a new issue (#148), as this issue has gotten to be very long and difficult to track.
Hopefully integrating rseqc will solve the problems on this issue and we can close it. If not, we can keep this issue to track them and perhaps look into migrating to Picard instead.
@apeltzer has now updated the dev
branch to use RSeQC 3.0 🎉
Would be great if anyone gets a chance to try running the pipeline with some big inputs to see if there's an improvement.
Thanks all!
Phil
grep -e "genebody*" execution_trace.txt
165 29/ee512c 88280 genebody_coverage (QPB05024AJ__AS184983LR28503AlignedByCoord.out) COMPLETED 0 2019-03-19 13:57:29.587 3h 28m 51s 1h 37m 51s 101,5% 5042 MB 7301 MB 4846 GB 22 MB
166 ac/a264b6 88281 genebody_coverage (QPB05022A3__AS184981LR28503AlignedByCoord.out) COMPLETED 0 2019-03-19 13:57:29.612 3h 29m 41s 1h 38m 31s 101,6% 504 MB 7304 MB 4974 GB 23 MB
164 d1/47f623 88282 genebody_coverage (QRDGM008ALAS225023LR34148AlignedByCoord.out) COMPLETED 0 2019-03-19 13:57:29.633 4h 6m 16s 1h 34m 49s 101,4% 5037 MB 7292 MB 4806 GB 22 MB
163 1c/2ae0ad 88283 genebody_coverage (QRDGM011A6AS225026LR34148AlignedByCoord.out) COMPLETED 0 2019-03-19 13:57:29.668 4h 6m 46s 1h 35m 12s 100,8% 5038 MB 7296 MB 4799 GB 22 MB
167 30/0767d9 88285 genebody_coverage (QPB05018AA__AS184977LR28503AlignedByCoord.out) COMPLETED 0 2019-03-19 13:57:29.713 4h 46m 36s 1h 38m 57s 100,6% 5041 MB 7288 MB 4842 GB 22 MB
168 6f/0ef199 88286 genebody_coverage (QRDGM010AWAS225025LR34148AlignedByCoord.out) COMPLETED 0 2019-03-19 13:57:29.736 5h 1s 1h 31m 9s 101,2% 5033 MB 7288 MB 4681 GB 21 MB
169 86/74b6df 88287 genebody_coverage (QRDGM009ATAS225024LR34148AlignedByCoord.out) COMPLETED 0 2019-03-19 13:57:29.757 5h 6m 16s 1h 36m 33s 101,7% 5041 MB 7302 MB 4884 GB 22 MB
170 a5/9a0053 88288 genebody_coverage (QPB05013A4__AS184972LR28503AlignedByCoord.out) COMPLETED 0 2019-03-19 13:57:29.777 5h 39m 1s 1h 32m 47s 101,4% 504 MB 7294 MB 4766 GB 22 MB
171 6d/c0a079 88289 genebody_coverage (QPB05017A2__AS184976LR28503AlignedByCoord.out) COMPLETED 0 2019-03-19 13:57:29.797 5h 42m 16s 1h 35m 24s 100,9% 5036 MB 7325 MB 4849 GB 22 MB
173 50/1d6001 88290 genebody_coverage (QPB05023AB__AS184982LR28503AlignedByCoord.out) COMPLETED 0 2019-03-19 13:57:29.823 5h 53m 36s 1h 38m 9s 101,1% 5041 MB 7294 MB 4914 GB 23 MB
172 8d/274122 88291 genebody_coverage (QPB05014AC__AS184973LR28503AlignedByCoord.out) COMPLETED 0 2019-03-19 13:57:29.856 6h 27m 21s 1h 40m 45s 100,7% 504 MB 7297 MB 4843 GB 22 MB
175 9c/b0217d 88292 genebody_coverage (QPB05015AK__AS184974LR28503AlignedByCoord.out) COMPLETED 0 2019-03-19 13:57:29.886 6h 35m 51s 1h 35m 47s 100,8% 504 MB 7298 MB 4764 GB 22 MB
176 8e/00bae9 88293 genebody_coverage (QRDGM012AEAS225027LR34148AlignedByCoord.out) COMPLETED 0 2019-03-19 13:57:29.914 6h 40m 26s 1h 34m 6s 100,9% 5039 MB 7295 MB 4833 GB 22 MB
174 77/65462b 88294 genebody_coverage (QRDGM007ADAS225022LR34148AlignedByCoord.out) COMPLETED 0 2019-03-19 13:57:29.943 7h 12m 21s 1h 33m 16s 100,7% 5037 MB 7297 MB 4719 GB 22 MB
178 83/9091e1 88295 genebody_coverage (QPB05016AS__AS184975LR28503AlignedByCoord.out) COMPLETED 0 2019-03-19 13:57:29.973 7h 20m 51s 1h 38m 33s 100,8% 5042 MB 7295 MB 4961 GB 23 MB
180 c2/d3f413 88296 genebody_coverage (QPB05020AL__AS184979LR28503AlignedByCoord.out) COMPLETED 0 2019-03-19 13:57:30.002 7h 38m 36s 1h 44m 54s 100,7% 5047 MB 7307 MB 5119 GB 24 MB
177 90/41d727 88297 genebody_coverage (QPB05019AI__AS184978LR28503AlignedByCoord.out) COMPLETED 0 2019-03-19 13:57:30.031 8h 8m 7s 1h 40m 41s 101,4% 5039 MB 7297 MB 4741 GB 22 MB
179 a3/686b0c 88298 genebody_coverage (QPB05021AT__AS184980LR28503AlignedByCoord.out) COMPLETED 0 2019-03-19 13:57:30.059 8h 17m 47s 1h 41m 52s 100,7% 5047 MB 7302 MB 5144 GB 24 MB
These were some mouse samples that I just ran for integration testing of the dev
branch these days. Can't really say whether this is much faster than before unfortunately.
So between 3-8.5hours per sample is a lot IMHO.
I'm going to run some human data today, too.
Human Data:
79 18/f4cc73 88447 genebody_coverage (I16R003f01_01_S5_L003_R1_001AlignedByCoord.out) COMPLETED 0 2019-03-22 16:41:30.167 2h 14m 41s 2h 1m 14s 100,6% 9395 MB 11 GB 9504 GB 25 MB
76 3f/6c2192 88444 genebody_coverage (I16R003f02_01_S6_L002_R1_001AlignedByCoord.out) COMPLETED 0 2019-03-22 16:41:30.031 2h 27m 2s 2h 26m 58s 101,6% 9393 MB 11 GB 9321 GB 24 MB
77 47/63af20 88443 genebody_coverage (I16R003f02_01_S6_L004_R1_001AlignedByCoord.out) COMPLETED 0 2019-03-22 16:41:29.988 2h 28m 47s 2h 28m 36s 101,5% 9393 MB 11 GB 9348 GB 24 MB
80 ab/1bee49 88446 genebody_coverage (I16R003f01_01_S5_L002_R1_001AlignedByCoord.out) COMPLETED 0 2019-03-22 16:41:30.134 3h 28m 38s 3h 28m 31s 100,7% 9397 MB 11 GB 9506 GB 25 MB
78 32/a4c8ba 88449 genebody_coverage (I16R003f02_01_S6_L003_R1_001AlignedByCoord.out) COMPLETED 0 2019-03-22 16:41:30.227 3h 42m 33s 2h 27m 13s 100,5% 9395 MB 11 GB 9278 GB 24 MB
82 75/c10a0e 88450 genebody_coverage (I16R003f01_01_S5_L001_R1_001AlignedByCoord.out) COMPLETED 0 2019-03-22 16:41:30.258 4h 17m 13s 2h 2m 31s 101,6% 9397 MB 11 GB 9494 GB 25 MB
75 76/a10aa6 88445 genebody_coverage (I16R003f02_01_S6_L001_R1_001AlignedByCoord.out) COMPLETED 0 2019-03-22 16:41:30.078 4h 20m 23s 4h 20m 17s 100,3% 9395 MB 11 GB 9272 GB 24 MB
81 85/b42b38 88451 genebody_coverage (I16R003f01_01_S5_L004_R1_001AlignedByCoord.out) COMPLETED 0 2019-03-22 16:41:30.290 4h 29m 58s 2h 2m 55s 100,9% 940 MB 11 GB 9445 GB 25 MB
I think its fine now - we should keep an eye on it maybe still...
Running
geneBody_coverage2.py
on bigwig files instead of the entire BAM as input requires much less memory but is still time-consuming (32hrs for 15M reads,1.1G BAM).