nf-core / rnaseq

RNA sequencing analysis pipeline using STAR, RSEM, HISAT2 or Salmon with gene/isoform counts and extensive quality control.
https://nf-co.re/rnaseq
MIT License
923 stars 708 forks source link

New module: Kraken2/Bracken on Unaligned Sequences for Contamination Detection #1388

Closed egreenberg7 closed 2 months ago

egreenberg7 commented 2 months ago

Replace https://github.com/nf-core/rnaseq/pull/1351

Closes https://github.com/nf-core/rnaseq/issues/271. This contribution adds Kraken2/Bracken as an optional quality control step to the rnaseq pipeline for the HISAT2 and STAR/Salmon aligners. Contamination is a widely reported issue in rna-sequencing data, and the use of metagenomics tools can be used to address this. Kraken2 is particularly strong at detecting low levels of pathogens, which makes it appropriate for this task. This PR adds the option of providing a Kraken2 database to perform taxonomic classifications on unaligned reads.

Note: If the --bracken-precision parameter is set to something other than 'S', the current MultiQC version does not work properly. In future versions of MultiQC, this will not be an issue (see this MultiQC bug fix).

PR checklist

EDIT: by @maxulysse adding link to previous PR

github-actions[bot] commented 2 months ago

nf-core lint overall result: Passed :white_check_mark: :warning:

Posted for pipeline commit 02f65ab

+| ✅ 174 tests passed       |+
#| ❔   9 tests were ignored |#
!| ❗   7 tests had warnings |!
### :heavy_exclamation_mark: Test warnings: * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File not found: `assets/multiqc_config.yml` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File not found: `.github/workflows/awstest.yml` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File not found: `.github/workflows/awsfulltest.yml` * [pipeline_todos](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/pipeline_todos) - TODO string in `main.nf`: _Optionally add in-text citation tools to this list._ * [pipeline_todos](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/pipeline_todos) - TODO string in `main.nf`: _Optionally add bibliographic entries to this list._ * [pipeline_todos](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/pipeline_todos) - TODO string in `main.nf`: _Only uncomment below if logic in toolCitationText/toolBibliographyText has been filled!_ * [pipeline_todos](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/pipeline_todos) - TODO string in `methods_description_template.yml`: _#Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline_ ### :grey_question: Tests ignored: * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File is ignored: `conf/modules.config` * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config default ignored: params.ribo_database_manifest * [files_unchanged](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_unchanged) - File ignored due to lint config: `assets/email_template.html` * [files_unchanged](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_unchanged) - File ignored due to lint config: `assets/email_template.txt` * [files_unchanged](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_unchanged) - File ignored due to lint config: `.gitignore` or `.prettierignore` * [actions_ci](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/actions_ci) - actions_ci * [actions_awstest](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/actions_awstest) - 'awstest.yml' workflow not found: `/home/runner/work/rnaseq/rnaseq/.github/workflows/awstest.yml` * [multiqc_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/multiqc_config) - multiqc_config * [modules_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/modules_config) - modules_config ### :white_check_mark: Tests passed: * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File found: `.gitattributes` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File found: `.gitignore` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File found: `.nf-core.yml` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File found: `.editorconfig` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File found: `.prettierignore` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File found: `.prettierrc.yml` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File found: `CHANGELOG.md` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File found: `CITATIONS.md` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File found: `CODE_OF_CONDUCT.md` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File found: `LICENSE` or `LICENSE.md` or `LICENCE` or `LICENCE.md` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File found: `nextflow_schema.json` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File found: `nextflow.config` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File found: `README.md` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File found: `.github/.dockstore.yml` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File found: `.github/CONTRIBUTING.md` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File found: `.github/ISSUE_TEMPLATE/bug_report.yml` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File found: `.github/ISSUE_TEMPLATE/config.yml` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File found: `.github/ISSUE_TEMPLATE/feature_request.yml` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File found: `.github/PULL_REQUEST_TEMPLATE.md` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File found: `.github/workflows/branch.yml` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File found: `.github/workflows/ci.yml` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File found: `.github/workflows/linting_comment.yml` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File found: `.github/workflows/linting.yml` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File found: `assets/email_template.html` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File found: `assets/email_template.txt` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File found: `assets/sendmail_template.txt` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File found: `assets/nf-core-rnaseq_logo_light.png` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File found: `conf/test.config` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File found: `conf/test_full.config` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File found: `docs/images/nf-core-rnaseq_logo_light.png` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File found: `docs/images/nf-core-rnaseq_logo_dark.png` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File found: `docs/output.md` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File found: `docs/README.md` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File found: `docs/README.md` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File found: `docs/usage.md` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File found: `main.nf` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File found: `conf/base.config` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File found: `conf/igenomes.config` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File found: `modules.json` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File not found check: `.github/ISSUE_TEMPLATE/bug_report.md` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File not found check: `.github/ISSUE_TEMPLATE/feature_request.md` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File not found check: `.github/workflows/push_dockerhub.yml` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File not found check: `.markdownlint.yml` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File not found check: `.nf-core.yaml` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File not found check: `.yamllint.yml` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File not found check: `bin/markdown_to_html.r` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File not found check: `conf/aws.config` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File not found check: `docs/images/nf-core-rnaseq_logo.png` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File not found check: `lib/Checks.groovy` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File not found check: `lib/Completion.groovy` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File not found check: `lib/NfcoreTemplate.groovy` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File not found check: `lib/Utils.groovy` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File not found check: `lib/Workflow.groovy` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File not found check: `lib/WorkflowMain.groovy` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File not found check: `lib/WorkflowRnaseq.groovy` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File not found check: `parameters.settings.json` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File not found check: `pipeline_template.yml` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File not found check: `Singularity` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File not found check: `lib/nfcore_external_java_deps.jar` * [files_exist](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_exist) - File not found check: `.travis.yml` * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config variable found: `manifest.name` * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config variable found: `manifest.nextflowVersion` * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config variable found: `manifest.description` * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config variable found: `manifest.version` * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config variable found: `manifest.homePage` * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config variable found: `timeline.enabled` * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config variable found: `trace.enabled` * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config variable found: `report.enabled` * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config variable found: `dag.enabled` * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config variable found: `process.cpus` * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config variable found: `process.memory` * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config variable found: `process.time` * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config variable found: `params.outdir` * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config variable found: `params.input` * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config variable found: `params.validationShowHiddenParams` * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config variable found: `params.validationSchemaIgnoreParams` * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config variable found: `manifest.mainScript` * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config variable found: `timeline.file` * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config variable found: `trace.file` * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config variable found: `report.file` * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config variable found: `dag.file` * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config variable (correctly) not found: `params.nf_required_version` * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config variable (correctly) not found: `params.container` * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config variable (correctly) not found: `params.singleEnd` * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config variable (correctly) not found: `params.igenomesIgnore` * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config variable (correctly) not found: `params.name` * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config variable (correctly) not found: `params.enable_conda` * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config ``timeline.enabled`` had correct value: ``true`` * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config ``report.enabled`` had correct value: ``true`` * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config ``trace.enabled`` had correct value: ``true`` * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config ``dag.enabled`` had correct value: ``true`` * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config ``manifest.name`` began with ``nf-core/`` * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config variable ``manifest.homePage`` began with https://github.com/nf-core/ * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config ``dag.file`` ended with ``.html`` * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config variable ``manifest.nextflowVersion`` started with >= or !>= * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config ``manifest.version`` ends in ``dev``: ``3.16.0dev`` * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config `params.custom_config_version` is set to `master` * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config `params.custom_config_base` is set to `https://raw.githubusercontent.com/nf-core/configs/master` * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Lines for loading custom profiles found * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - nextflow.config contains configuration profile `test` * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config default value correct: params.hisat2_build_memory= 200.GB * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config default value correct: params.gtf_extra_attributes= gene_name * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config default value correct: params.gtf_group_features= gene_id * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config default value correct: params.featurecounts_group_type= gene_biotype * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config default value correct: params.featurecounts_feature_type= exon * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config default value correct: params.igenomes_base= s3://ngi-igenomes/igenomes/ * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config default value correct: params.trimmer= trimgalore * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config default value correct: params.min_trimmed_reads= 10000 * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config default value correct: params.umitools_extract_method= string * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config default value correct: params.umitools_grouping_method= directional * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config default value correct: params.aligner= star_salmon * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config default value correct: params.pseudo_aligner_kmer_size= 31 * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config default value correct: params.min_mapped_reads= 5.0 * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config default value correct: params.kallisto_quant_fraglen= 200 * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config default value correct: params.kallisto_quant_fraglen_sd= 200 * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config default value correct: params.stranded_threshold= 0.8 * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config default value correct: params.unstranded_threshold= 0.1 * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config default value correct: params.deseq2_vst= true * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config default value correct: params.rseqc_modules= bam_stat,inner_distance,infer_experiment,junction_annotation,junction_saturation,read_distribution,read_duplication * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config default value correct: params.bracken_precision= S * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config default value correct: params.skip_bbsplit= true * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config default value correct: params.skip_preseq= true * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config default value correct: params.custom_config_version= master * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config default value correct: params.custom_config_base= https://raw.githubusercontent.com/nf-core/configs/master * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config default value correct: params.max_cpus= 16 * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config default value correct: params.max_memory= 128.GB * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config default value correct: params.max_time= 240.h * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config default value correct: params.publish_dir_mode= copy * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config default value correct: params.max_multiqc_email_size= 25.MB * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config default value correct: params.validate_params= true * [nextflow_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nextflow_config) - Config default value correct: params.pipelines_testdata_base_path= https://raw.githubusercontent.com/nf-core/test-datasets/7f1614baeb0ddf66e60be78c3d9fa55440465ac8/ * [files_unchanged](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_unchanged) - `.gitattributes` matches the template * [files_unchanged](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_unchanged) - `.prettierrc.yml` matches the template * [files_unchanged](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_unchanged) - `CODE_OF_CONDUCT.md` matches the template * [files_unchanged](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_unchanged) - `LICENSE` matches the template * [files_unchanged](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_unchanged) - `.github/.dockstore.yml` matches the template * [files_unchanged](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_unchanged) - `.github/CONTRIBUTING.md` matches the template * [files_unchanged](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_unchanged) - `.github/ISSUE_TEMPLATE/bug_report.yml` matches the template * [files_unchanged](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_unchanged) - `.github/ISSUE_TEMPLATE/config.yml` matches the template * [files_unchanged](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_unchanged) - `.github/ISSUE_TEMPLATE/feature_request.yml` matches the template * [files_unchanged](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_unchanged) - `.github/PULL_REQUEST_TEMPLATE.md` matches the template * [files_unchanged](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_unchanged) - `.github/workflows/branch.yml` matches the template * [files_unchanged](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_unchanged) - `.github/workflows/linting_comment.yml` matches the template * [files_unchanged](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_unchanged) - `.github/workflows/linting.yml` matches the template * [files_unchanged](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_unchanged) - `assets/sendmail_template.txt` matches the template * [files_unchanged](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_unchanged) - `assets/nf-core-rnaseq_logo_light.png` matches the template * [files_unchanged](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_unchanged) - `docs/images/nf-core-rnaseq_logo_light.png` matches the template * [files_unchanged](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_unchanged) - `docs/images/nf-core-rnaseq_logo_dark.png` matches the template * [files_unchanged](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/files_unchanged) - `docs/README.md` matches the template * [readme](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/readme) - README Nextflow minimum version badge matched config. Badge: `23.04.0`, Config: `23.04.0` * [readme](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/readme) - README Zenodo placeholder was replaced with DOI. * [pipeline_name_conventions](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/pipeline_name_conventions) - Name adheres to nf-core convention * [template_strings](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/template_strings) - Did not find any Jinja template strings (589 files) * [schema_lint](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/schema_lint) - Schema lint passed * [schema_lint](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/schema_lint) - Schema title + description lint passed * [schema_lint](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/schema_lint) - Input mimetype lint passed: 'text/csv' * [schema_params](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/schema_params) - Schema matched params returned from nextflow config * [system_exit](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/system_exit) - No `System.exit` calls found * [actions_schema_validation](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/actions_schema_validation) - Workflow validation passed: ci.yml * [actions_schema_validation](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/actions_schema_validation) - Workflow validation passed: cloud_tests_full.yml * [actions_schema_validation](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/actions_schema_validation) - Workflow validation passed: linting_comment.yml * [actions_schema_validation](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/actions_schema_validation) - Workflow validation passed: release-announcements.yml * [actions_schema_validation](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/actions_schema_validation) - Workflow validation passed: linting.yml * [actions_schema_validation](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/actions_schema_validation) - Workflow validation passed: download_pipeline.yml * [actions_schema_validation](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/actions_schema_validation) - Workflow validation passed: clean-up.yml * [actions_schema_validation](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/actions_schema_validation) - Workflow validation passed: cloud_tests_small.yml * [actions_schema_validation](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/actions_schema_validation) - Workflow validation passed: fix-linting.yml * [actions_schema_validation](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/actions_schema_validation) - Workflow validation passed: branch.yml * [merge_markers](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/merge_markers) - No merge markers found in pipeline files * [modules_json](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/modules_json) - Only installed modules found in `modules.json` * [modules_structure](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/modules_structure) - modules directory structure is correct 'modules/nf-core/TOOL/SUBTOOL' * [base_config](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/base_config) - `conf/base.config` found and not ignored. * [nfcore_yml](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nfcore_yml) - Repository type in `.nf-core.yml` is valid: `pipeline` * [nfcore_yml](https://nf-co.re/tools/docs/2.14.1/pipeline_lint_tests/nfcore_yml) - nf-core version in `.nf-core.yml` is set to the latest version: `2.14.1` ### Run details * nf-core/tools version 2.14.1 * Run at `2024-09-19 16:16:49`
egreenberg7 commented 2 months ago

Reset of PR #1351 due to merge issue. I copied here the main discussion of its functionality. Other comments were made about some of the smaller details of the code, which you can see there.

From @MatthiasZepper

Impressive work! I think, that some users will indeed like that feature, but it may also produce false positive hits.

Without doing an in-depth review, I am therefore already missing some usage documentation on this feature. Since the quality of the reference database is paramount, pipeline users should be provided with some help how to generate an appropriate reference database:

The TCGA read data were analyzed with the Kraken program (13), a very fast algorithm that assigns reads to a taxon using exact matches of 31 base pairs (bp) or longer. The Kraken program is highly accurate, but it depends critically on the database of genomes to which it compares each read. Poore et al. used a database containing 59,974 microbial genomes, of which 5,503 were viruses and 54,471 were bacteria or archaea, including many draft genomes. Notably, their Kraken database did not include the human genome, nor did it include common vector sequences. This dramatically increased the odds for human DNA sequences present in the TCGA reads to be falsely reported as matching microbial genomes. This problem can be mitigated by including the human genome and using only complete bacterial genomes in the Kraken database.

(from Major data analysis errors invalidate cancer microbiome findings)

Also, I wonder if another tool like Sylph would have been sufficiently accurate for a fast screening, since a Kraken2/Bracken run is of course computationally heavy and this is not a metatranscriptomic pipeline.

I think this is an interesting functionality, but Kraken2/Bracken would generally not be my first choice of tools here, for being computationally heavy and very dependent on an appropriate reference database.

Therefore, I suggest a parameter contaminant screening that would allow choosing other tools as well in the future and have the default value 'off'. This parameter would allow to enable this functionality independently of the save_unaligned parameter, which should only decide if the files are published.

Apart from the remarks in the code, I think this PR also needs a better documentation in the usage.md and an update to the metro map.

But generally speaking, I support this addition to the pipeline.

From @egreenberg7

I had not heard of Sylph specifically before, but while Kraken2/Bracken is one of the more computationally expensive options, it appears from the literature that it will have the best level of recall for contaminating species and is the fastest (amortized).

See Ye et al., 2019, where Kraken2 is shown to be the best performing k-mer method (and Metaphlan4 the best performing marker-based based method) Kibegwa et al, 2020 that Kraken2 performs better than MG-RAST specifically Lindgreen et al., 2016 for a bit older article where Kraken was the best of the tools at the time And lastly the article that I mentioned in my first comment that shows Kraken2 performs much better than Metaphlan for detecting low levels of pathogens (Pereira-Marques et al., 2024). As for Sylph, while the BioArxiv paper does look promising, since it is such a new tool, it would require more development to incorporate into the pipeline (we would have to code a new MultiQC module and a new nf-core module for it), and it may be worth waiting until there are more benchmarks of it against other tools/it is more widely established. Alternatively, in the long run, we could have two different options users can choose from, Kraken2 for a more expensive but higher recall tool and Sylph or something else for a less expensive tool. Still, Kraken2's computational expense is largely based on the database size, and there are various size-restricted databases available for those with that concern (see the pre-constructed indices).

In terms of the database, I agree that it can have a significant impact. I was presuming that people would primarily use the standard database or the PlusPF database from the pre-constructed databases mentioned above. Both of these include human and vector genomes, which would avoid some of the issues mentioned in Major data analysis errors invalidate cancer microbiome findings. I considered putting an option to build the Kraken2 standard database within the pipeline (primarily for an indexing-only run), but it seemed sort of silly given the computational expense and the availability of pre-constructed databases.

Another source for database construction could be the genomes in the OpenContami database. Since this consists of only known contaminants from a statistically rigorous contamination-detection procedure, this would probably avoid most of the bacteria from extreme environments (though one would have to make sure to also add in the human and vector genomes). (If you haven't seen OpenContami before, it is based on doing Bowtie2 alignments followed by Blasting unknown sequences. As a result, it is very computationally expensive, and also, the code for it isn't available on Github).

For some other discussions of database construction, see

In Smith et al., they discuss that making sure to include pertinent genomes increases Kraken2's accuracy in rumen. This however is primarily based on when uncommon species are present, as in metagenomics. I did some profiling of Kraken2 on common contaminants mentioned on the OpenContami website with datasets generated by InSilicoSeq, and the PlusPF database did a good job at the genus level of classifying almost all of them. In Baud & Kennedy, 2024, they present an algorithm, Moonbase that constructs a Kraken2 database based on an initial run of Metaphlan3 in order to improve results. This however would raise two issues in our situation. First, using Metaphlan to construct the database would likely void some of the advantages of Kraken2 over Metaphlan in terms of recall. Second, having to build the database rather than have it inputted pre-made would increase computational expense. Some questions: If we added a section on the database to be used, would that go in the usage.md file, output.md, or elsewhere? Lastly, for false positives, I was of the opinion that we should use a higher recall but lower precision tool like Kraken2 to error on the side of caution and leave it to the researchers to use their judgement with interpreting results. I think it may be worth simply putting a warning in the documentation that very small numbers of non-host reads should not necessarily be immediately accepted as truth. In my InSilicoSeq profiling I mentioned, there were generally (though not always) significant differences in the number of reads of actual contaminants vs false positives (and for pure human samples, at least with in silico data there were not false positives when HISAT2 followed by Kraken2 was performed).

One other note: I chose the --confidence-level 0.05 based on Pereira-Marques et al., 2024 and --minimum-hit-groups 3 based on Lu et al., 2022, a protocol paper for finding pathogens with Kraken2.

From @Shaun-Regenbaum

Just adding my two cents. I helped Ezra in-person so many of my comments are not here, but think what is here is fantastic and pretty close to being ready for merging. Totally agree on the default being not to run this, but think its inclusion is a positive change.

In my eyes, this is the start to supporting alternative QC checks into the pipeline that have been largely ignored in the bioinformatics community up until now. Even if it is computationally heavy right now to run these, simply adding them as optional points can set a good precedent and gold standard on how to add further custom QC into the pipeline (both for OS community and industry).

In fact, I think there is room for work to be done by the community to optimize these kinds of pipeline to make them light enough to work as a default without incurring too much cost/time.

As a side note, we are going to start using this internally in our university lab (and company) as an additional sanity check. I am also considering using this to conduct a large scale meta analysis on the state of RNA-Seq data (especially human) to estimate contamination across the field, as I don't believe it has been done. Starting to write some potential grants for this now.

Overall, just want to write massive props to @egreenberg7 for his work. He has a bright future, and hopefully we can get his name on a couple cool papers in the near future :)

From @davidecarlson

While this is possibly not the right place to say it, I just wanted to note that I love this potential addition to the pipeline. I have to check RNA-seq data for contaminants more often than I would like (currently I use nf-core taxprofiler for this purpose), but having this option within the rnaseq pipeline would be fabulous!

egreenberg7 commented 2 months ago

Before the next release, the new metro map will need to be animated

maxulysse commented 2 months ago

Thanks for updating the subway map, I'll update the animated subway map in a separate PR

maxulysse commented 2 months ago

It looks good to me, you already add approval from @MatthiasZepper, but I'd like a confirmation from @pinin4fjords as well before going forward

maxulysse commented 2 months ago

I would have liked a confirmation by @pinin4fjords as I said in my comment, but that ship has sailed :rocket:.

@Shaun-Regenbaum Thanks for approval and merging.

Shaun-Regenbaum commented 2 months ago

Sorry I didn't see your comment before, my bad, was a bit too trigger happy. ~~Edit: Actually after looking at the timestamps, I actually merged before your comments.~~

maxulysse commented 2 months ago

I commented 2 hours before you approved and merged

maxulysse commented 2 months ago

We had merging cowboy before, so you're just following the tradition from the best of us cc @apeltzer

Shaun-Regenbaum commented 2 months ago

Ah you are right, I apologize again. 😅