I optimized the prepare_entity_binding_ratios.py script to reduce the memory usage. The main changes are:
processing of predictions chunk-wise based on peptide_id chunks (restructured code)
using join() on sorted index (peptide_id) instead of merge() where possible and critical
added parameter --ds_prep_chunk_size
downcast numerical columns
Additionally:
fixed a bug in the computation of entity-wise binding ratios (seems before the number of binders was divided by number of total peptides and not by thenumber of peptides for the specific entity :( )
include protein-wise peptide counts (were ignored before)
Change in memory and runtime:
With the given parameters, on a full-size test this reduced the memory usage from ~156GB to 64GB, while the runtime was slightly reduced (~30min to ~27min).
The memory usage should now mainly increase proportionally to the size of the predictions.tsv and proteins_peptides.tsv. It could still be improved by using directly chunks of the respective files, but since this would reduce readability I would suggest to keep it like this for now as long as it's not the bottleneck.
PR checklist
[x] This comment contains a description of changes (with reason).
[ ] If you've fixed a bug or added code that should be tested, add tests!
[ ] If you've added a new tool - have you followed the pipeline conventions in the contribution docs
[ ] If necessary, also make a PR on the nf-core/metapep branch on the nf-core/test-datasets repository.
[ ] Make sure your code lints (nf-core lint).
[ ] Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
[ ] Usage Documentation in docs/usage.md is updated.
[ ] Output Documentation in docs/output.md is updated.
[x] CHANGELOG.md is updated.
[ ] README.md is updated (including new tool citations and authors/contributors).
I optimized the
prepare_entity_binding_ratios.py
script to reduce the memory usage. The main changes are:join()
on sorted index (peptide_id
) instead ofmerge()
where possible and critical--ds_prep_chunk_size
Additionally:
Change in memory and runtime:
With the given parameters, on a full-size test this reduced the memory usage from ~156GB to 64GB, while the runtime was slightly reduced (~30min to ~27min).
The memory usage should now mainly increase proportionally to the size of the
predictions.tsv
andproteins_peptides.tsv
. It could still be improved by using directly chunks of the respective files, but since this would reduce readability I would suggest to keep it like this for now as long as it's not the bottleneck.PR checklist
nf-core lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).