The current memory bottleneck is the MERGE_PREDICTIONS process which uses csvtk concat to concatenate the prediction TSV files (probably this loads everything into memory at once). I replaced this with a custom Pandas script, which processes one file (chunk-wise) after the other and simply appends the chunks to the output.
For a full-size test dataset this changed the used resources for the MERGE_PREDICTIONS process as follows:
1h 09min -> 1h 24h
208 GB -> 1GB
For MERGE_PREDICTIONS_BUFFER:
4 - 7 min -> ~ 5 min
14 - 15GB -> ~ 308MB
PR checklist
[x] This comment contains a description of changes (with reason).
[ ] If you've fixed a bug or added code that should be tested, add tests!
[ ] If you've added a new tool - have you followed the pipeline conventions in the contribution docs- [ ] If necessary, also make a PR on the nf-core/metapep branch on the nf-core/test-datasets repository.
[ ] Make sure your code lints (nf-core lint).
[ ] Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
[ ] Usage Documentation in docs/usage.md is updated.
[ ] Output Documentation in docs/output.md is updated.
[x] CHANGELOG.md is updated.
[ ] README.md is updated (including new tool citations and authors/contributors).
The current memory bottleneck is the
MERGE_PREDICTIONS
process which usescsvtk concat
to concatenate the predictionTSV
files (probably this loads everything into memory at once). I replaced this with a custom Pandas script, which processes one file (chunk-wise) after the other and simply appends the chunks to the output.For a full-size test dataset this changed the used resources for the
MERGE_PREDICTIONS
process as follows:For
MERGE_PREDICTIONS_BUFFER
:PR checklist
nf-core lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).