Closed AntoniaSchuster closed 1 year ago
nf-core lint
overall result: Passed :white_check_mark: :warning:Posted for pipeline commit b31dbf5
+| ✅ 147 tests passed |+
!| ❗ 12 tests had warnings |!
Hi @AntoniaSchuster, Thanks for the implementation and this PR! I did some tests and decided in the end against integrating this PR for the following reasons:
CSVTK_CONCAT
failed also on the - relatively small - full-size test dataset, so we would have to implement or include further adjustments to avoid this problem I optimised the corresponding scripts mainly by
See also #21 - #44 . This reduced the memory usage such that for the above mentioned Chung et al. data the peak memory usage was ~150GB for the process with the highest memory usage.
I compared the two different pipeline versions (ignoring the failed CSVTK_CONCAT
process and corresponding downstream processes of this PR):
If the memory becomes a problem again, we will keep some of the proposed changes of this PR in mind (e.g. for the downstream scripts to prepare the visualisations), since it reduced the memory drastically.
This PR integrates modules from nf-core/epitopeprediction into Metapep. My last PR (#2) is included in this one since I assumed it would be closed by now. However, most changes are deleted or have been reimplemented here anyway, so I left it in. Since the data model of Metapep doesn’t fit well with the modules, most of it was discarded. The tables that can be created directly from the samplesheet (conditions.tsv, conditions_microbiomes.tsv, conditions_alleles.tsv, conditions_weights.tsv, weights.tsv) are still created and written to results/db_tables/. The microbiome.tsv table still exists as well but in a modified version. Only microbiomes.tsv and weights.tsv are used within the pipeline anymore, the others are just written to the results.
Another reason for restructuring the pipeline was its scalability. Running it with a large dataset (Coassembly produced by nf-core/mag of 8 samples out of this publication didn't work due to memory (2 TB) running out at the SPLIT_PRED_TASKS step after a few merges in pandas.
The main change concerns the datamodel, database tables were replaced by channels with metadata. Since pandas turned out to be slow and memory-inefficient, most scripts using pandas were replaced or removed. The ones that were removed were made obsolete by the new datamodel that uses channels instead of tables.
All new processes except for SORT_PEPTIDES, REMOVE_DUPLICATE_PEPTIDES and PREPARE_PLOTS were taken from nf-core/epitopeprediciton. We should write a subworkflow that can be shared by metapep and epitopeprediciton.
Here is a description of the new processes:
SORT_PEPTIDES: takes a csv containing peptides and sorts them alphabetically by their sequence.
REMOVE_DUPLICATE_PEPTIDES: This implements a k-way merge in order to sort the whole list of peptides without having all of it in memory at once. This enables removal of duplicate peptides in a memory-efficient way. An additional function of this process is to write metadata into the tsv that is then given to the epitopeprediction. This is necessary because there is no way to have the information for single peptides (e.g. entity weight) in the meta information of the channel. The PEPTIDE_PREDICTION process takes arbitrary columns of the input table and writes them to the output table as well.
PREPARE_PLOTS: This process has the same functionality as PREPARE_SCORE_DISTRIBUTION and PREPARE_ENTITY_BINDING_RATIOS. It goes through the list of peptides without having everything in memory at once and prepares the tables needed for the plots.
Things that are still work in progress:
PR checklist
nf-core lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).