Integrate epitopeprediction

AntoniaSchuster commented 2 years ago

This PR integrates modules from nf-core/epitopeprediction into Metapep. My last PR (#2) is included in this one since I assumed it would be closed by now. However, most changes are deleted or have been reimplemented here anyway, so I left it in. Since the data model of Metapep doesn’t fit well with the modules, most of it was discarded. The tables that can be created directly from the samplesheet (conditions.tsv, conditions_microbiomes.tsv, conditions_alleles.tsv, conditions_weights.tsv, weights.tsv) are still created and written to results/db_tables/. The microbiome.tsv table still exists as well but in a modified version. Only microbiomes.tsv and weights.tsv are used within the pipeline anymore, the others are just written to the results.

Another reason for restructuring the pipeline was its scalability. Running it with a large dataset (Coassembly produced by nf-core/mag of 8 samples out of this publication didn't work due to memory (2 TB) running out at the SPLIT_PRED_TASKS step after a few merges in pandas.

The main change concerns the datamodel, database tables were replaced by channels with metadata. Since pandas turned out to be slow and memory-inefficient, most scripts using pandas were replaced or removed. The ones that were removed were made obsolete by the new datamodel that uses channels instead of tables.

Removed/replaced process	Replacement
ASSIGN_NUCL_ENTITY_WEIGHTS	REMOVE_DUPLICATE_PEPTIDES
GENERATE_PROTEIN_AND_ENTITY_IDS	REMOVE_DUPLICATE_PEPTIDES
FINALIZE_CONDITION_ENTITIES	REMOVE_DUPLICATE_PEPTIDES
GENERATE_PEPTIDES	FRED2_GENERATEPEPTIDES
COLLECT_STATS	REMOVE_DUPLICATE_PEPTIDES
SPLIT_PRED_TASKS	REMOVE_DUPLICATE_PEPTIDES, SPLIT_PEPTIDES
PREDICT_EPITOPES	PEPTIDE_PREDICTION
MERGE_PREDICTIONS_BUFFER, MERGE_PREDICTIONS	CAT_FILES, CSVTK_CONCAT
PREPARE_SCORE_DISTRIBUTION, PREPARE_ENTITY_BINDING_RATIOS	PREPARE_PLOTS

All new processes except for SORT_PEPTIDES, REMOVE_DUPLICATE_PEPTIDES and PREPARE_PLOTS were taken from nf-core/epitopeprediciton. We should write a subworkflow that can be shared by metapep and epitopeprediciton.

Here is a description of the new processes:

SORT_PEPTIDES: takes a csv containing peptides and sorts them alphabetically by their sequence.

REMOVE_DUPLICATE_PEPTIDES: This implements a k-way merge in order to sort the whole list of peptides without having all of it in memory at once. This enables removal of duplicate peptides in a memory-efficient way. An additional function of this process is to write metadata into the tsv that is then given to the epitopeprediction. This is necessary because there is no way to have the information for single peptides (e.g. entity weight) in the meta information of the channel. The PEPTIDE_PREDICTION process takes arbitrary columns of the input table and writes them to the output table as well.

PREPARE_PLOTS: This process has the same functionality as PREPARE_SCORE_DISTRIBUTION and PREPARE_ENTITY_BINDING_RATIOS. It goes through the list of peptides without having everything in memory at once and prepares the tables needed for the plots.

Things that are still work in progress:

stats.txt: Some stats are now counted in REMOVE_DUPLICATE_PEPTIDES. They are not written out yet because I haven't had time to check how exactly they should be counted.
remove_duplicates_buffer_size: This is a new parameter that can be used to define the size of the reading buffer that is used in REMOVE_DUPLICATE_PEPTIDES. Increasing this number leads to a higher memory usage but faster runtime. In the script, this number is the number of bytes read at once. The actual memory consumption will be more than this number because it depends on how many sorted lists of peptides are given to this process and how much metadata is added. Maybe we should find a clearer parameter for this purpose.
Runtime and memory:
- I haven't had time to really benchmark this. I ran it on a large dataset that the current version of Metapep couldn't handle because of memory issues. The process that were an issue before (SPLIT_PRED_TASKS) was replaced by REMOVE_DUPLICATE_PEPTIDES in this version. This solved the memory issues for this step, now the pipeline successfully predicted the epitopes for this dataset but runs into new memory issues at the CSVTK_CONCAT, MERGE_JSON_MULTI steps. I implemented a temporary workaround in another branch. Now all processes except PLOT_SCORE_DISTRIBUTION completed successfully. Maybe a subsampling option could be implemented for PLOT_SCORE_DISTRIBUTION.
sample_n: parameter still exists but has to be reimplemented

PR checklist

[x] This comment contains a description of changes (with reason).
[x] If you've fixed a bug or added code that should be tested, add tests!
- [ ] If you've added a new tool - have you followed the pipeline conventions in the contribution docs
- [x] If necessary, also make a PR on the nf-core/metapep branch on the nf-core/test-datasets repository.
[x] Make sure your code lints (nf-core lint).
[x] Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
[x] Usage Documentation in docs/usage.md is updated.
[] Output Documentation in docs/output.md is updated.
[x] CHANGELOG.md is updated.
[ ] README.md is updated (including new tool citations and authors/contributors).

github-actions[bot] commented 2 years ago

`nf-core lint` overall result: Passed :white_check_mark: :warning:

Posted for pipeline commit b31dbf5

+| ✅ 147 tests passed       |+
!| ❗  12 tests had warnings |!

### :heavy_exclamation_mark: Test warnings: * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File not found: `conf/igenomes.config` * [pipeline_todos](https://nf-co.re/tools-docs/lint_tests/pipeline_todos.html) - TODO string in `README.md`: _Add full-sized test dataset and amend the paragraph below if applicable_ * [pipeline_todos](https://nf-co.re/tools-docs/lint_tests/pipeline_todos.html) - TODO string in `README.md`: _If applicable, make list of people who have also contributed_ * [pipeline_todos](https://nf-co.re/tools-docs/lint_tests/pipeline_todos.html) - TODO string in `README.md`: _Add citation for pipeline after first release. Uncomment lines below and update Zenodo doi and badge at the top of this file._ * [pipeline_todos](https://nf-co.re/tools-docs/lint_tests/pipeline_todos.html) - TODO string in `output.md`: _Write this documentation describing your workflow's output_ * [pipeline_todos](https://nf-co.re/tools-docs/lint_tests/pipeline_todos.html) - TODO string in `usage.md`: _Add documentation about anything specific to running your pipeline. For general topics, please point to (and add to) the main nf-core website._ * [pipeline_todos](https://nf-co.re/tools-docs/lint_tests/pipeline_todos.html) - TODO string in `WorkflowMain.groovy`: _Add Zenodo DOI for pipeline after first release_ * [pipeline_todos](https://nf-co.re/tools-docs/lint_tests/pipeline_todos.html) - TODO string in `test.config`: _Specify the paths to your test data on nf-core/test-datasets_ * [pipeline_todos](https://nf-co.re/tools-docs/lint_tests/pipeline_todos.html) - TODO string in `test.config`: _Give any required params for the test so that command line flags are not needed_ * [pipeline_todos](https://nf-co.re/tools-docs/lint_tests/pipeline_todos.html) - TODO string in `test_full.config`: _Specify the paths to your full test data ( on nf-core/test-datasets or directly in repositories, e.g. SRA)_ * [pipeline_todos](https://nf-co.re/tools-docs/lint_tests/pipeline_todos.html) - TODO string in `test_full.config`: _Give any required params for the test so that command line flags are not needed_ * [pipeline_todos](https://nf-co.re/tools-docs/lint_tests/pipeline_todos.html) - TODO string in `awsfulltest.yml`: _You can customise AWS full pipeline tests as required_ ### :white_check_mark: Tests passed: * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `.gitattributes` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `.gitignore` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `.nf-core.yml` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `.editorconfig` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `.prettierrc.yml` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `CHANGELOG.md` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `CITATIONS.md` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `CODE_OF_CONDUCT.md` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `CODE_OF_CONDUCT.md` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `LICENSE` or `LICENSE.md` or `LICENCE` or `LICENCE.md` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `nextflow_schema.json` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `nextflow.config` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `README.md` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `.github/.dockstore.yml` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `.github/CONTRIBUTING.md` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `.github/ISSUE_TEMPLATE/bug_report.yml` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `.github/ISSUE_TEMPLATE/config.yml` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `.github/ISSUE_TEMPLATE/feature_request.yml` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `.github/PULL_REQUEST_TEMPLATE.md` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `.github/workflows/branch.yml` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `.github/workflows/ci.yml` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `.github/workflows/linting_comment.yml` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `.github/workflows/linting.yml` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `assets/email_template.html` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `assets/email_template.txt` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `assets/sendmail_template.txt` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `assets/nf-core-metapep_logo_light.png` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `conf/modules.config` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `conf/test.config` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `conf/test_full.config` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `docs/images/nf-core-metapep_logo_light.png` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `docs/images/nf-core-metapep_logo_dark.png` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `docs/output.md` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `docs/README.md` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `docs/README.md` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `docs/usage.md` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `lib/nfcore_external_java_deps.jar` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `lib/NfcoreSchema.groovy` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `lib/NfcoreTemplate.groovy` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `lib/Utils.groovy` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `lib/WorkflowMain.groovy` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `main.nf` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `assets/multiqc_config.yml` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `conf/base.config` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `.github/workflows/awstest.yml` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `.github/workflows/awsfulltest.yml` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `lib/WorkflowMetapep.groovy` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File found: `modules.json` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File not found check: `Singularity` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File not found check: `parameters.settings.json` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File not found check: `bin/markdown_to_html.r` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File not found check: `conf/aws.config` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File not found check: `.github/workflows/push_dockerhub.yml` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File not found check: `.github/ISSUE_TEMPLATE/bug_report.md` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File not found check: `.github/ISSUE_TEMPLATE/feature_request.md` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File not found check: `docs/images/nf-core-metapep_logo.png` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File not found check: `.markdownlint.yml` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File not found check: `.yamllint.yml` * [files_exist](https://nf-co.re/tools-docs/lint_tests/files_exist.html) - File not found check: `.travis.yml` * [nextflow_config](https://nf-co.re/tools-docs/lint_tests/nextflow_config.html) - Config variable found: `manifest.name` * [nextflow_config](https://nf-co.re/tools-docs/lint_tests/nextflow_config.html) - Config variable found: `manifest.nextflowVersion` * [nextflow_config](https://nf-co.re/tools-docs/lint_tests/nextflow_config.html) - Config variable found: `manifest.description` * [nextflow_config](https://nf-co.re/tools-docs/lint_tests/nextflow_config.html) - Config variable found: `manifest.version` * [nextflow_config](https://nf-co.re/tools-docs/lint_tests/nextflow_config.html) - Config variable found: `manifest.homePage` * [nextflow_config](https://nf-co.re/tools-docs/lint_tests/nextflow_config.html) - Config variable found: `timeline.enabled` * [nextflow_config](https://nf-co.re/tools-docs/lint_tests/nextflow_config.html) - Config variable found: `trace.enabled` * [nextflow_config](https://nf-co.re/tools-docs/lint_tests/nextflow_config.html) - Config variable found: `report.enabled` * [nextflow_config](https://nf-co.re/tools-docs/lint_tests/nextflow_config.html) - Config variable found: `dag.enabled` * [nextflow_config](https://nf-co.re/tools-docs/lint_tests/nextflow_config.html) - Config variable found: `process.cpus` * [nextflow_config](https://nf-co.re/tools-docs/lint_tests/nextflow_config.html) - Config variable found: `process.memory` * [nextflow_config](https://nf-co.re/tools-docs/lint_tests/nextflow_config.html) - Config variable found: `process.time` * [nextflow_config](https://nf-co.re/tools-docs/lint_tests/nextflow_config.html) - Config variable found: `params.outdir` * [nextflow_config](https://nf-co.re/tools-docs/lint_tests/nextflow_config.html) - Config variable found: `params.input` * [nextflow_config](https://nf-co.re/tools-docs/lint_tests/nextflow_config.html) - Config variable found: `params.show_hidden_params` * [nextflow_config](https://nf-co.re/tools-docs/lint_tests/nextflow_config.html) - Config variable found: `params.schema_ignore_params` * [nextflow_config](https://nf-co.re/tools-docs/lint_tests/nextflow_config.html) - Config variable found: `manifest.mainScript` * [nextflow_config](https://nf-co.re/tools-docs/lint_tests/nextflow_config.html) - Config variable found: `timeline.file` * [nextflow_config](https://nf-co.re/tools-docs/lint_tests/nextflow_config.html) - Config variable found: `trace.file` * [nextflow_config](https://nf-co.re/tools-docs/lint_tests/nextflow_config.html) - Config variable found: `report.file` * [nextflow_config](https://nf-co.re/tools-docs/lint_tests/nextflow_config.html) - Config variable found: `dag.file` * [nextflow_config](https://nf-co.re/tools-docs/lint_tests/nextflow_config.html) - Config variable (correctly) not found: `params.version` * [nextflow_config](https://nf-co.re/tools-docs/lint_tests/nextflow_config.html) - Config variable (correctly) not found: `params.nf_required_version` * [nextflow_config](https://nf-co.re/tools-docs/lint_tests/nextflow_config.html) - Config variable (correctly) not found: `params.container` * [nextflow_config](https://nf-co.re/tools-docs/lint_tests/nextflow_config.html) - Config variable (correctly) not found: `params.singleEnd` * [nextflow_config](https://nf-co.re/tools-docs/lint_tests/nextflow_config.html) - Config variable (correctly) not found: `params.igenomesIgnore` * [nextflow_config](https://nf-co.re/tools-docs/lint_tests/nextflow_config.html) - Config variable (correctly) not found: `params.name` * [nextflow_config](https://nf-co.re/tools-docs/lint_tests/nextflow_config.html) - Config ``timeline.enabled`` had correct value: ``true`` * [nextflow_config](https://nf-co.re/tools-docs/lint_tests/nextflow_config.html) - Config ``report.enabled`` had correct value: ``true`` * [nextflow_config](https://nf-co.re/tools-docs/lint_tests/nextflow_config.html) - Config ``trace.enabled`` had correct value: ``true`` * [nextflow_config](https://nf-co.re/tools-docs/lint_tests/nextflow_config.html) - Config ``dag.enabled`` had correct value: ``true`` * [nextflow_config](https://nf-co.re/tools-docs/lint_tests/nextflow_config.html) - Config ``manifest.name`` began with ``nf-core/`` * [nextflow_config](https://nf-co.re/tools-docs/lint_tests/nextflow_config.html) - Config variable ``manifest.homePage`` began with https://github.com/nf-core/ * [nextflow_config](https://nf-co.re/tools-docs/lint_tests/nextflow_config.html) - Config ``dag.file`` ended with ``.svg`` * [nextflow_config](https://nf-co.re/tools-docs/lint_tests/nextflow_config.html) - Config variable ``manifest.nextflowVersion`` started with >= or !>= * [nextflow_config](https://nf-co.re/tools-docs/lint_tests/nextflow_config.html) - Config ``manifest.version`` ends in ``dev``: ``'1.0dev'`` * [nextflow_config](https://nf-co.re/tools-docs/lint_tests/nextflow_config.html) - Config `params.custom_config_version` is set to `master` * [nextflow_config](https://nf-co.re/tools-docs/lint_tests/nextflow_config.html) - Config `params.custom_config_base` is set to `https://raw.githubusercontent.com/nf-core/configs/master` * [nextflow_config](https://nf-co.re/tools-docs/lint_tests/nextflow_config.html) - Lines for loading custom profiles found * [files_unchanged](https://nf-co.re/tools-docs/lint_tests/files_unchanged.html) - `.gitattributes` matches the template * [files_unchanged](https://nf-co.re/tools-docs/lint_tests/files_unchanged.html) - `.prettierrc.yml` matches the template * [files_unchanged](https://nf-co.re/tools-docs/lint_tests/files_unchanged.html) - `CODE_OF_CONDUCT.md` matches the template * [files_unchanged](https://nf-co.re/tools-docs/lint_tests/files_unchanged.html) - `LICENSE` matches the template * [files_unchanged](https://nf-co.re/tools-docs/lint_tests/files_unchanged.html) - `.github/.dockstore.yml` matches the template * [files_unchanged](https://nf-co.re/tools-docs/lint_tests/files_unchanged.html) - `.github/CONTRIBUTING.md` matches the template * [files_unchanged](https://nf-co.re/tools-docs/lint_tests/files_unchanged.html) - `.github/ISSUE_TEMPLATE/bug_report.yml` matches the template * [files_unchanged](https://nf-co.re/tools-docs/lint_tests/files_unchanged.html) - `.github/ISSUE_TEMPLATE/config.yml` matches the template * [files_unchanged](https://nf-co.re/tools-docs/lint_tests/files_unchanged.html) - `.github/ISSUE_TEMPLATE/feature_request.yml` matches the template * [files_unchanged](https://nf-co.re/tools-docs/lint_tests/files_unchanged.html) - `.github/PULL_REQUEST_TEMPLATE.md` matches the template * [files_unchanged](https://nf-co.re/tools-docs/lint_tests/files_unchanged.html) - `.github/workflows/branch.yml` matches the template * [files_unchanged](https://nf-co.re/tools-docs/lint_tests/files_unchanged.html) - `.github/workflows/linting_comment.yml` matches the template * [files_unchanged](https://nf-co.re/tools-docs/lint_tests/files_unchanged.html) - `.github/workflows/linting.yml` matches the template * [files_unchanged](https://nf-co.re/tools-docs/lint_tests/files_unchanged.html) - `assets/email_template.html` matches the template * [files_unchanged](https://nf-co.re/tools-docs/lint_tests/files_unchanged.html) - `assets/email_template.txt` matches the template * [files_unchanged](https://nf-co.re/tools-docs/lint_tests/files_unchanged.html) - `assets/sendmail_template.txt` matches the template * [files_unchanged](https://nf-co.re/tools-docs/lint_tests/files_unchanged.html) - `assets/nf-core-metapep_logo_light.png` matches the template * [files_unchanged](https://nf-co.re/tools-docs/lint_tests/files_unchanged.html) - `docs/images/nf-core-metapep_logo_light.png` matches the template * [files_unchanged](https://nf-co.re/tools-docs/lint_tests/files_unchanged.html) - `docs/images/nf-core-metapep_logo_dark.png` matches the template * [files_unchanged](https://nf-co.re/tools-docs/lint_tests/files_unchanged.html) - `docs/README.md` matches the template * [files_unchanged](https://nf-co.re/tools-docs/lint_tests/files_unchanged.html) - `lib/nfcore_external_java_deps.jar` matches the template * [files_unchanged](https://nf-co.re/tools-docs/lint_tests/files_unchanged.html) - `lib/NfcoreSchema.groovy` matches the template * [files_unchanged](https://nf-co.re/tools-docs/lint_tests/files_unchanged.html) - `lib/NfcoreTemplate.groovy` matches the template * [files_unchanged](https://nf-co.re/tools-docs/lint_tests/files_unchanged.html) - `.gitignore` matches the template * [actions_ci](https://nf-co.re/tools-docs/lint_tests/actions_ci.html) - '.github/workflows/ci.yml' is triggered on expected events * [actions_ci](https://nf-co.re/tools-docs/lint_tests/actions_ci.html) - '.github/workflows/ci.yml' checks minimum NF version * [actions_awstest](https://nf-co.re/tools-docs/lint_tests/actions_awstest.html) - '.github/workflows/awstest.yml' is triggered correctly * [actions_awsfulltest](https://nf-co.re/tools-docs/lint_tests/actions_awsfulltest.html) - `.github/workflows/awsfulltest.yml` is triggered correctly * [actions_awsfulltest](https://nf-co.re/tools-docs/lint_tests/actions_awsfulltest.html) - `.github/workflows/awsfulltest.yml` does not use `-profile test` * [readme](https://nf-co.re/tools-docs/lint_tests/readme.html) - README Nextflow minimum version badge matched config. Badge: `21.10.3`, Config: `21.10.3` * [readme](https://nf-co.re/tools-docs/lint_tests/readme.html) - README Nextflow minimum version in Quick Start section matched config. README: `21.10.3`, Config: `21.10.3` * [pipeline_name_conventions](https://nf-co.re/tools-docs/lint_tests/pipeline_name_conventions.html) - Name adheres to nf-core convention * [template_strings](https://nf-co.re/tools-docs/lint_tests/template_strings.html) - Did not find any Jinja template strings (110 files) * [schema_lint](https://nf-co.re/tools-docs/lint_tests/schema_lint.html) - Schema lint passed * [schema_lint](https://nf-co.re/tools-docs/lint_tests/schema_lint.html) - Schema title + description lint passed * [schema_params](https://nf-co.re/tools-docs/lint_tests/schema_params.html) - Schema matched params returned from nextflow config * [actions_schema_validation](https://nf-co.re/tools-docs/lint_tests/actions_schema_validation.html) - Workflow validation passed: linting_comment.yml * [actions_schema_validation](https://nf-co.re/tools-docs/lint_tests/actions_schema_validation.html) - Workflow validation passed: ci.yml * [actions_schema_validation](https://nf-co.re/tools-docs/lint_tests/actions_schema_validation.html) - Workflow validation passed: push_dockerhub_release.yml * [actions_schema_validation](https://nf-co.re/tools-docs/lint_tests/actions_schema_validation.html) - Workflow validation passed: awstest.yml * [actions_schema_validation](https://nf-co.re/tools-docs/lint_tests/actions_schema_validation.html) - Workflow validation passed: push_dockerhub_dev.yml * [actions_schema_validation](https://nf-co.re/tools-docs/lint_tests/actions_schema_validation.html) - Workflow validation passed: awsfulltest.yml * [actions_schema_validation](https://nf-co.re/tools-docs/lint_tests/actions_schema_validation.html) - Workflow validation passed: linting.yml * [actions_schema_validation](https://nf-co.re/tools-docs/lint_tests/actions_schema_validation.html) - Workflow validation passed: branch.yml * [merge_markers](https://nf-co.re/tools-docs/lint_tests/merge_markers.html) - No merge markers found in pipeline files * [modules_json](https://nf-co.re/tools-docs/lint_tests/modules_json.html) - Only installed modules found in `modules.json` * [multiqc_config](https://nf-co.re/tools-docs/lint_tests/multiqc_config.html) - 'assets/multiqc_config.yml' follows the ordering scheme of the minimally required plugins. * [multiqc_config](https://nf-co.re/tools-docs/lint_tests/multiqc_config.html) - 'assets/multiqc_config.yml' contains a matching 'report_comment'. * [multiqc_config](https://nf-co.re/tools-docs/lint_tests/multiqc_config.html) - 'assets/multiqc_config.yml' contains 'export_plots: true'. ### Run details * nf-core/tools version 2.3.2 * Run at `2022-04-27 09:52:48`

skrakau commented 1 year ago

Hi @AntoniaSchuster, Thanks for the implementation and this PR! I did some tests and decided in the end against integrating this PR for the following reasons:

before changing the overall architecture of the pipeline, in my opinion it made sense first to optimize the memory usage of the bottleneck pandas scripts
the changed processing in order to generate peptides caused duplicates which have to be removed again and which costs unnecessary resources
the PR was not yet ready to use on real data: the process CSVTK_CONCAT failed also on the - relatively small - full-size test dataset, so we would have to implement or include further adjustments to avoid this problem
the original implemented data model was chosen for a few reasons, e.g. to avoid having to carry all the meta information of the individual peptides through the epitope prediction process and to facilitate the handling of those information. This is something very specific for this pipeline, and since the nf-core meta channels can anyway not solve this problem (since they are designed to handle process specific meta information and multiple peptides with different meta information are processed together), it's OK to use a different strategy here in my opinion.
Pandas is not per se "slow and inefficient": true, it loads all corresponding data into memory, but it is relatively easy to use (good for pipeline maintenance) and it can be used in a way to not load all data into memory at once. There would be alternatives to Pandas to handle data that does not fit into memory and/or to parallelise computations (probably not that easy to learn for everyone and more difficult to maintain), but before considering this, the existing pandas scripts could be optimised. For data that fits into memory it is in general relative efficient. (Unless one wants to avoid loading data into memory in general, which would make data handling itself less flexible and efficient, but which could be discussed of course.)

I optimised the corresponding scripts mainly by

reading in and processing data chunk-wise
downcasting the datatypes in pandas dataframes

See also #21 - #44 . This reduced the memory usage such that for the above mentioned Chung et al. data the peak memory usage was ~150GB for the process with the highest memory usage.

I compared the two different pipeline versions (ignoring the failed CSVTK_CONCAT process and corresponding downstream processes of this PR):

the metapep version with the optimised pandas scripts uses still clearly more memory than the version from this PR, but it is much faster and the estimated overall energy consumption is lower

If the memory becomes a problem again, we will keep some of the proposed changes of this PR in mind (e.g. for the downstream scripts to prepare the visualisations), since it reduced the memory drastically.

nf-core / metapep