Closed muffato closed 1 year ago
nf-core lint
overall result: Passed :white_check_mark: :warning:Posted for pipeline commit a5d21f4
+| ✅ 132 tests passed |+
#| ❔ 20 tests were ignored |#
!| ❗ 1 tests had warnings |!
Now that I've generated all the charts, I realise that some resource requirements are actually too low ! I should be asking 150 MB for MultiQC, not 50 MB. I guess it worked because the jobs are too fast for MEMLIMIT to have time kill.
Some COOLER_ZOOMIFY processes also take more than the 12 GB I'm requesting. The processes take 10 min, so I thought MEMLIMIT would kick in ? Anyway, I'll sort all those things out in another commit
I've made the few changes I mentioned in https://github.com/sanger-tol/genomenote/pull/92/#issuecomment-1816628012 : just updated up / down some requirements. I reran the pipeline on all species and it worked fine.
I also merged the dev
branch in to solve the conflict coming from #93
@BethYates : this PR just needs an approval and then I can merge it
I merged #90 by accident, so reopening a new PR.
Closes #16, #18, #20, #91
Like in https://github.com/sanger-tol/readmapping/pull/82 the goal is to stop using the
process_*
labels and instead optimise the resource requests of every process.I'm using the same dataset: 10 genomes of increasing size, with 1 Hi-C library and 1 PacBio library each.
I found much less correlation than in the read-mapping pipeline. The only input size that I found useful was the genome size, now collected at the beginning of the pipeline and added to the
meta
map. There is some correlation between the number of Hi-C reads and some process runtime, but not memory usage. Since runtime estimates don't need to be very accurate (really, it's only normal/long/week that matters), I don't even pull that input size.I am using helper functions to grow values (like the number of CPUs) in a logarithm fashion. In effect, this is to limit the increase of the number of CPUs, especially as the advantage of multi-threading tends to decrease with a higher number of threads.
Also:
GrabFiles
process with some Groovy magic, as per https://community.seqera.io/t/is-it-bad-practice-to-try-and-pluck-files-from-an-element-in-a-channel-that-is-a-directory-with-channel-manipulation/224/2 . This saves 1 LSF job.GNU_SORT
parameters to fix #91In this PR, the new resource requirements make every process succeed at the first attempt. The formulas are the lowest legible-ish correlations I could find.
Detailed charts showing the memory/CPU/time used/requested for every process: before (PDF), after (PDF)
If we want to tolerate processes failing at the first attempt, being resubmitted once or twice to finally complete, I'm sure some requirements may be lowered even more. We would have to make sure that the resources wasted on those first attempts doesn't outweigh the savings we'll make on other processes. Something to investigate later...
PR checklist
nf-core lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).