Nextflow DSL2 pipeline to generate a Genome Note, including assembly statistics, quality metrics, and Hi-C contact maps. This workflow is part of the Tree of Life production suite.
I'm using the same dataset: 10 genomes of increasing size, with 1 Hi-C library and 1 PacBio library each.
Fasta size (bytes)
PacBio size (# reads)
Hi-C (# reads)
GCA_939531405.1
13,824,461
1,546,435
955,654,834
GCA_937625935.1
26,683,271
189,202
980,890,138
GCA_951394315.1
58,010,196
1,965,084
704,258,466
GCA_947172415.1
118,858,594
799,796
87,833,110
GCA_910589235.2
232,212,321
1,586,931
727,465,652
GCA_949987625.1
417,566,504
2,211,570
705,705,280
GCA_946406115.1
810,357,340
1,872,695
842,629,084
GCA_963513935.1
1,803,897,959
7,338,871
3,305,634,916
GCA_951213105.1
3,609,437,155
1,121,856
3,127,898,040
GCA_946902985.2
9,152,113,672
1,537,548
886,707,886
I found much less correlation than in the read-mapping pipeline. The only input size that I found useful was the genome size, now collected at the beginning of the pipeline and added to the meta map. There is some correlation between the number of Hi-C reads and some process runtime, but not memory usage. Since runtime estimates don't need to be very accurate (really, it's only normal/long/week that matters), I don't even pull that input size.
I am using helper functions to grow values (like the number of CPUs) in a logarithm fashion. In effect, this is to limit the increase of the number of CPUs, especially as the advantage of multi-threading tends to decrease with a higher number of threads.
In this PR, the new resource requirements make every process succeed at the first attempt. The formulas are the lowest legible-ish correlations I could find. I will investigate later lowering the requirements even more, with the hope that the savings on some processes will balance out others having to be rerun.
Metric
Before
After
Improvement
Total memory requested (GB)
3,660.0
997.1
÷3.7
Memory efficiency (used/requested, %)
21.2
78.0
Total memory reservation (GB-hours)
2,792.4
2,405.4
÷1.2
Memory reservation efficiency (used/requested, %)
86.0
89.4
Total CPUs requested (n)
610.0
510.0
÷1.2
CPU efficiency (used/requested, %)
47.6
60.5
Total CPU reservation (CPU-hours)
465.4
332.0
÷1.4
CPU reservation efficiency (used/requested, %)
62.0
86.3
Job failures (%)
0.5
0.0
PR checklist
[ ] This comment contains a description of changes (with reason).
[ ] If you've fixed a bug or added code that should be tested, add tests!
[ ] If you've added a new tool - have you followed the pipeline conventions in the contribution docs
[ ] Make sure your code lints (nf-core lint).
[ ] Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
[ ] Usage Documentation in docs/usage.md is updated.
[ ] Output Documentation in docs/output.md is updated.
[ ] CHANGELOG.md is updated.
[ ] README.md is updated (including new tool citations and authors/contributors).
Like in https://github.com/sanger-tol/readmapping/pull/82 the goal is to stop using the
process_*
labels and instead optimise the resource requests of every process.I'm using the same dataset: 10 genomes of increasing size, with 1 Hi-C library and 1 PacBio library each.
I found much less correlation than in the read-mapping pipeline. The only input size that I found useful was the genome size, now collected at the beginning of the pipeline and added to the
meta
map. There is some correlation between the number of Hi-C reads and some process runtime, but not memory usage. Since runtime estimates don't need to be very accurate (really, it's only normal/long/week that matters), I don't even pull that input size.I am using helper functions to grow values (like the number of CPUs) in a logarithm fashion. In effect, this is to limit the increase of the number of CPUs, especially as the advantage of multi-threading tends to decrease with a higher number of threads.
Also:
GrabFiles
process with some Groovy magic, as per https://community.seqera.io/t/is-it-bad-practice-to-try-and-pluck-files-from-an-element-in-a-channel-that-is-a-directory-with-channel-manipulation/224/2 . This saves 1 LSF job.GNU_SORT
parameters to fix #91In this PR, the new resource requirements make every process succeed at the first attempt. The formulas are the lowest legible-ish correlations I could find. I will investigate later lowering the requirements even more, with the hope that the savings on some processes will balance out others having to be rerun.
PR checklist
nf-core lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).