Closed skrakau closed 2 years ago
nf-core lint
overall result: Failed :x:Posted for pipeline commit 8f494ee
+| ✅ 148 tests passed |+
!| ❗ 13 tests had warnings |!
-| ❌ 4 tests failed |-
Linting will be fixed when updating to new template, so no worries there. Unfortunately (?), the template update will reformat python scripts heavily via black, and you seem to have here some.
I understand this might be too much right now, but it might be interesting to specify the size of the test dataset you used (genome size, gene counts, peptide counts). Also, an estimate of how much a decrease (or increase?) of --proc_chunk_size
will affect mem/runtime. This might help to find a suitable value quicker for the user.
Thanks for looking at this @d4straub !
Since this PR will be only one of multiple changes in this context, I will address such more detailed information for the user after everything works, i.e. prior the release :)
Since currently some processes that merge db tables containing peptides, such as
SPLIT_PRED_TASKS
, cause problems because they require large amounts of memory for larger input datasets, I rewrote the code ingen_prediction_chunks.py
. Main changes:peptides
input is read in and processed chunk-wise (without changing the final output chunks). Added parameterproc_chunk_size
.join()
on sorted indices instead of applyingmerge()
allele_name
in dfusecols
Pandas parameter ofread_csv()
functionWith the given parameters, on a full-size test this reduced the memory usage from ~214GB to 36GB, while the runtime was slightly reduced (~1h07min to ~57min). The impact will of course depend on the used chunk-size parameters.
Now, one of the biggest impacts on the memory usage is caused by loading the (full)
protein_peptides
datatable, which requires for the Pandas df ~3x the size of the original input size. For now this should be fine, one can still think about further optimisations in the future if necessary.PR checklist
nf-core lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).