nf-core / demultiplex

Demultiplexing pipeline for sequencing data
MIT License
41 stars 36 forks source link

Samplesheet validator #234

Closed nschcolnicov closed 1 month ago

nschcolnicov commented 1 month ago

PR created to address this issue:

PR checklist

nschcolnicov commented 1 month ago

A few things to discuss:

  1. I added a readme containing some guidelines on how to create the validator schema json file for samshee. Not sure what other place I could put this. Probably the best solution would be to create a PR to the samshee repository to add it there.
  2. The samplesheet validator doesn't seem to work with any of the samplesheets from our test profiles besides the one in the test_two_lanes.config profile. Not sure if this is the expected behaviour, or there is something to fix with the tool. More info here:
github-actions[bot] commented 1 month ago

nf-core lint overall result: Passed :white_check_mark: :warning:

Posted for pipeline commit 8587df1

+| ✅ 190 tests passed       |+
#| ❔   3 tests were ignored |#
!| ❗   6 tests had warnings |!
### :heavy_exclamation_mark: Test warnings: * [pipeline_todos]( - TODO string in ``: _Describe the minimum required steps to execute the pipeline, e.g. how to prepare samplesheets._ * [pipeline_todos]( - TODO string in ``: _Optionally add in-text citation tools to this list._ * [pipeline_todos]( - TODO string in ``: _Optionally add bibliographic entries to this list._ * [pipeline_todos]( - TODO string in ``: _Only uncomment below if logic in toolCitationText/toolBibliographyText has been filled!_ * [pipeline_todos]( - TODO string in `methods_description_template.yml`: _#Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline_ * [pipeline_todos]( - TODO string in `awsfulltest.yml`: _You can customise AWS full pipeline tests as required_ ### :grey_question: Tests ignored: * [files_unchanged]( - File ignored due to lint config: `.github/ISSUE_TEMPLATE/bug_report.yml` * [files_unchanged]( - File ignored due to lint config: `.github/workflows/linting.yml` * [actions_ci]( - actions_ci ### :white_check_mark: Tests passed: * [files_exist]( - File found: `.gitattributes` * [files_exist]( - File found: `.gitignore` * [files_exist]( - File found: `.nf-core.yml` * [files_exist]( - File found: `.editorconfig` * [files_exist]( - File found: `.prettierignore` * [files_exist]( - File found: `.prettierrc.yml` * [files_exist]( - File found: `` * [files_exist]( - File found: `` * [files_exist]( - File found: `` * [files_exist]( - File found: `LICENSE` or `` or `LICENCE` or `` * [files_exist]( - File found: `nextflow_schema.json` * [files_exist]( - File found: `nextflow.config` * [files_exist]( - File found: `` * [files_exist]( - File found: `.github/.dockstore.yml` * [files_exist]( - File found: `.github/` * [files_exist]( - File found: `.github/ISSUE_TEMPLATE/bug_report.yml` * [files_exist]( - File found: `.github/ISSUE_TEMPLATE/config.yml` * [files_exist]( - File found: `.github/ISSUE_TEMPLATE/feature_request.yml` * [files_exist]( - File found: `.github/` * [files_exist]( - File found: `.github/workflows/branch.yml` * [files_exist]( - File found: `.github/workflows/ci.yml` * [files_exist]( - File found: `.github/workflows/linting_comment.yml` * [files_exist]( - File found: `.github/workflows/linting.yml` * [files_exist]( - File found: `assets/email_template.html` * [files_exist]( - File found: `assets/email_template.txt` * [files_exist]( - File found: `assets/sendmail_template.txt` * [files_exist]( - File found: `assets/nf-core-demultiplex_logo_light.png` * [files_exist]( - File found: `conf/modules.config` * [files_exist]( - File found: `conf/test.config` * [files_exist]( - File found: `conf/test_full.config` * [files_exist]( - File found: `docs/images/nf-core-demultiplex_logo_light.png` * [files_exist]( - File found: `docs/images/nf-core-demultiplex_logo_dark.png` * [files_exist]( - File found: `docs/` * [files_exist]( - File found: `docs/` * [files_exist]( - File found: `docs/` * [files_exist]( - File found: `docs/` * [files_exist]( - File found: `` * [files_exist]( - File found: `assets/multiqc_config.yml` * [files_exist]( - File found: `conf/base.config` * [files_exist]( - File found: `conf/igenomes.config` * [files_exist]( - File found: `.github/workflows/awstest.yml` * [files_exist]( - File found: `.github/workflows/awsfulltest.yml` * [files_exist]( - File found: `modules.json` * [files_exist]( - File not found check: `.github/ISSUE_TEMPLATE/` * [files_exist]( - File not found check: `.github/ISSUE_TEMPLATE/` * [files_exist]( - File not found check: `.github/workflows/push_dockerhub.yml` * [files_exist]( - File not found check: `.markdownlint.yml` * [files_exist]( - File not found check: `.nf-core.yaml` * [files_exist]( - File not found check: `.yamllint.yml` * [files_exist]( - File not found check: `bin/markdown_to_html.r` * [files_exist]( - File not found check: `conf/aws.config` * [files_exist]( - File not found check: `docs/images/nf-core-demultiplex_logo.png` * [files_exist]( - File not found check: `lib/Checks.groovy` * [files_exist]( - File not found check: `lib/Completion.groovy` * [files_exist]( - File not found check: `lib/NfcoreTemplate.groovy` * [files_exist]( - File not found check: `lib/Utils.groovy` * [files_exist]( - File not found check: `lib/Workflow.groovy` * [files_exist]( - File not found check: `lib/WorkflowMain.groovy` * [files_exist]( - File not found check: `lib/WorkflowDemultiplex.groovy` * [files_exist]( - File not found check: `parameters.settings.json` * [files_exist]( - File not found check: `pipeline_template.yml` * [files_exist]( - File not found check: `Singularity` * [files_exist]( - File not found check: `lib/nfcore_external_java_deps.jar` * [files_exist]( - File not found check: `.travis.yml` * [nextflow_config]( - Config variable found: `` * [nextflow_config]( - Config variable found: `manifest.nextflowVersion` * [nextflow_config]( - Config variable found: `manifest.description` * [nextflow_config]( - Config variable found: `manifest.version` * [nextflow_config]( - Config variable found: `manifest.homePage` * [nextflow_config]( - Config variable found: `timeline.enabled` * [nextflow_config]( - Config variable found: `trace.enabled` * [nextflow_config]( - Config variable found: `report.enabled` * [nextflow_config]( - Config variable found: `dag.enabled` * [nextflow_config]( - Config variable found: `process.cpus` * [nextflow_config]( - Config variable found: `process.memory` * [nextflow_config]( - Config variable found: `process.time` * [nextflow_config]( - Config variable found: `params.outdir` * [nextflow_config]( - Config variable found: `params.input` * [nextflow_config]( - Config variable found: `params.validationShowHiddenParams` * [nextflow_config]( - Config variable found: `params.validationSchemaIgnoreParams` * [nextflow_config]( - Config variable found: `manifest.mainScript` * [nextflow_config]( - Config variable found: `timeline.file` * [nextflow_config]( - Config variable found: `trace.file` * [nextflow_config]( - Config variable found: `report.file` * [nextflow_config]( - Config variable found: `dag.file` * [nextflow_config]( - Config variable (correctly) not found: `params.nf_required_version` * [nextflow_config]( - Config variable (correctly) not found: `params.container` * [nextflow_config]( - Config variable (correctly) not found: `params.singleEnd` * [nextflow_config]( - Config variable (correctly) not found: `params.igenomesIgnore` * [nextflow_config]( - Config variable (correctly) not found: `` * [nextflow_config]( - Config variable (correctly) not found: `params.enable_conda` * [nextflow_config]( - Config ``timeline.enabled`` had correct value: ``true`` * [nextflow_config]( - Config ``report.enabled`` had correct value: ``true`` * [nextflow_config]( - Config ``trace.enabled`` had correct value: ``true`` * [nextflow_config]( - Config ``dag.enabled`` had correct value: ``true`` * [nextflow_config]( - Config ```` began with ``nf-core/`` * [nextflow_config]( - Config variable ``manifest.homePage`` began with * [nextflow_config]( - Config ``dag.file`` ended with ``.html`` * [nextflow_config]( - Config variable ``manifest.nextflowVersion`` started with >= or !>= * [nextflow_config]( - Config ``manifest.version`` ends in ``dev``: ``1.5.0dev`` * [nextflow_config]( - Config `params.custom_config_version` is set to `master` * [nextflow_config]( - Config `params.custom_config_base` is set to `` * [nextflow_config]( - Lines for loading custom profiles found * [nextflow_config]( - nextflow.config contains configuration profile `test` * [nextflow_config]( - Config default value correct: params.trim_fastq= true * [nextflow_config]( - Config default value correct: params.skip_tools= [] * [nextflow_config]( - Config default value correct: params.sample_size= 100000 * [nextflow_config]( - Config default value correct: params.demultiplexer= bclconvert * [nextflow_config]( - Config default value correct: params.custom_config_version= master * [nextflow_config]( - Config default value correct: params.custom_config_base= * [nextflow_config]( - Config default value correct: params.checkqc_config= [] * [nextflow_config]( - Config default value correct: params.max_cpus= 16 * [nextflow_config]( - Config default value correct: params.max_memory= 128.GB * [nextflow_config]( - Config default value correct: params.max_time= 240.h * [nextflow_config]( - Config default value correct: params.publish_dir_mode= copy * [nextflow_config]( - Config default value correct: params.max_multiqc_email_size= 25.MB * [nextflow_config]( - Config default value correct: params.remove_adapter= true * [nextflow_config]( - Config default value correct: params.validate_params= true * [nextflow_config]( - Config default value correct: params.pipelines_testdata_base_path= * [files_unchanged]( - `.gitattributes` matches the template * [files_unchanged]( - `.prettierrc.yml` matches the template * [files_unchanged]( - `` matches the template * [files_unchanged]( - `LICENSE` matches the template * [files_unchanged]( - `.github/.dockstore.yml` matches the template * [files_unchanged]( - `.github/` matches the template * [files_unchanged]( - `.github/ISSUE_TEMPLATE/config.yml` matches the template * [files_unchanged]( - `.github/ISSUE_TEMPLATE/feature_request.yml` matches the template * [files_unchanged]( - `.github/` matches the template * [files_unchanged]( - `.github/workflows/branch.yml` matches the template * [files_unchanged]( - `.github/workflows/linting_comment.yml` matches the template * [files_unchanged]( - `assets/email_template.html` matches the template * [files_unchanged]( - `assets/email_template.txt` matches the template * [files_unchanged]( - `assets/sendmail_template.txt` matches the template * [files_unchanged]( - `assets/nf-core-demultiplex_logo_light.png` matches the template * [files_unchanged]( - `docs/images/nf-core-demultiplex_logo_light.png` matches the template * [files_unchanged]( - `docs/images/nf-core-demultiplex_logo_dark.png` matches the template * [files_unchanged]( - `docs/` matches the template * [files_unchanged]( - `.gitignore` matches the template * [files_unchanged]( - `.prettierignore` matches the template * [actions_awstest]( - '.github/workflows/awstest.yml' is triggered correctly * [actions_awsfulltest]( - `.github/workflows/awsfulltest.yml` is triggered correctly * [actions_awsfulltest]( - `.github/workflows/awsfulltest.yml` does not use `-profile test` * [readme]( - README Nextflow minimum version badge matched config. Badge: `23.04.0`, Config: `23.04.0` * [readme]( - README Zenodo placeholder was replaced with DOI. * [pipeline_name_conventions]( - Name adheres to nf-core convention * [template_strings]( - Did not find any Jinja template strings (249 files) * [schema_lint]( - Schema lint passed * [schema_lint]( - Schema title + description lint passed * [schema_lint]( - Input mimetype lint passed: 'text/csv' * [schema_params]( - Schema matched params returned from nextflow config * [system_exit]( - No `System.exit` calls found * [actions_schema_validation]( - Workflow validation passed: awstest.yml * [actions_schema_validation]( - Workflow validation passed: branch.yml * [actions_schema_validation]( - Workflow validation passed: fix-linting.yml * [actions_schema_validation]( - Workflow validation passed: linting.yml * [actions_schema_validation]( - Workflow validation passed: clean-up.yml * [actions_schema_validation]( - Workflow validation passed: ci.yml * [actions_schema_validation]( - Workflow validation passed: linting_comment.yml * [actions_schema_validation]( - Workflow validation passed: awsfulltest.yml * [actions_schema_validation]( - Workflow validation passed: download_pipeline.yml * [actions_schema_validation]( - Workflow validation passed: release-announcements.yml * [merge_markers]( - No merge markers found in pipeline files * [modules_json]( - Only installed modules found in `modules.json` * [multiqc_config]( - `assets/multiqc_config.yml` found and not ignored. * [multiqc_config]( - `assets/multiqc_config.yml` contains `report_section_order` * [multiqc_config]( - `assets/multiqc_config.yml` contains `export_plots` * [multiqc_config]( - `assets/multiqc_config.yml` contains `report_comment` * [multiqc_config]( - `assets/multiqc_config.yml` follows the ordering scheme of the minimally required plugins. * [multiqc_config]( - `assets/multiqc_config.yml` contains a matching 'report_comment'. * [multiqc_config]( - `assets/multiqc_config.yml` contains 'export_plots: true'. * [modules_structure]( - modules directory structure is correct 'modules/nf-core/TOOL/SUBTOOL' * [base_config]( - `conf/base.config` found and not ignored. * [modules_config]( - `conf/modules.config` found and not ignored. * [modules_config]( - `UNTAR` found in `conf/modules.config` and Nextflow scripts. * [modules_config]( - `BCLCONVERT` found in `conf/modules.config` and Nextflow scripts. * [modules_config]( - `BCL2FASTQ` found in `conf/modules.config` and Nextflow scripts. * [modules_config]( - `BASES2FASTQ` found in `conf/modules.config` and Nextflow scripts. * [modules_config]( - `FASTP` found in `conf/modules.config` and Nextflow scripts. * [modules_config]( - `FALCO` found in `conf/modules.config` and Nextflow scripts. * [modules_config]( - `KRAKEN2` found in `conf/modules.config` and Nextflow scripts. * [modules_config]( - `SEQTK_SAMPLE` found in `conf/modules.config` and Nextflow scripts. * [modules_config]( - `MD5SUM` found in `conf/modules.config` and Nextflow scripts. * [modules_config]( - `CUSTOM_DUMPSOFTWAREVERSIONS` found in `conf/modules.config` and Nextflow scripts. * [modules_config]( - `SGDEMUX` found in `conf/modules.config` and Nextflow scripts. * [modules_config]( - `FQTK` found in `conf/modules.config` and Nextflow scripts. * [modules_config]( - `CELLRANGER_MKFASTQ` found in `conf/modules.config` and Nextflow scripts. * [modules_config]( - `MULTIQC` found in `conf/modules.config` and Nextflow scripts. * [modules_config]( - `CHECKQC` found in `conf/modules.config` and Nextflow scripts. * [nfcore_yml]( - Repository type in `.nf-core.yml` is valid: `pipeline` * [nfcore_yml]( - nf-core version in `.nf-core.yml` is set to the latest version: `2.14.1` ### Run details * nf-core/tools version 2.14.1 * Run at `2024-08-09 19:26:37`
edmundmiller commented 1 month ago

A few questions for sustainability (not trying to just tear down your hardwork):

  1. Is a custom Docker image required here? Could it just be an environment.yml and then we build the image with wave? Is samshee on bioconda?
  2. Could this be a nf-core/module?
  3. Any possibility that nf-validation or nf-schema could be used instead? Is this just a fancy way to call a json schema against the CSV?
nschcolnicov commented 1 month ago

A few questions for sustainability (not trying to just tear down your hardwork):

  1. Is a custom Docker image required here? Could it just be an environment.yml and then we build the image with wave? Is samshee on bioconda?
  2. Could this be a nf-core/module?
  3. Any possibility that nf-validation or nf-schema could be used instead? Is this just a fancy way to call a json schema against the CSV?

Hi @edmundmiller, no worries! Thank you for reviewing this changes.

  1. As far as I can tell, samshee is not available in bioconda, is this a requirement for building the image with wave? I'm not familiar with this approach. UPDATE: I replaced the docker image with wave.
  2. This could be an nf-core module, which is why I already added some of the additional files required, in case we decide to do so.
  3. What would be the benefit of having it inside the nf-validation or nf-schema? I didn't develop this tool, would it require a considerable effort to do this?
nschcolnicov commented 1 month ago

@nf-core-bot fix linting

apeltzer commented 1 month ago

Also chiming in here @edmundmiller - Nicolas is working with us to get this in, as we saw there are several rules that need to be employed to verify that the Illumina samplesheets are accepted by Illumina bcl2fastq/bclconvert at least. Until now, we had several times issues where people had slight deviations from their standard (e.g. uM or a (DNA/RNA)) in their samplesheets - which took a while until the pipeline failed.

Samshee is (though new) looking great and can do the validation of the Illumina Samplesheet very nicely. Technically maybe also feasible via nf-validation but that might take a while and the rules would be more or less a duplication of what samshee already does. We will use nf-validation to validate the pipeline samplesheet though.

edmundmiller commented 1 month ago

As far as I can tell, samshee is not available in bioconda, is this a requirement for building the image with wave? I'm not familiar with this approach. UPDATE: I replaced the docker image with wave.

Awesome, I'll add an environment.yml with what I'm talking about, and update the container and add a singularity image.

This could be an nf-core module, which is why I already added some of the additional files required, in case we decide to do so.

Sweet, just checking because others might find it useful.

What would be the benefit of having it inside the nf-validation or nf-schema? I didn't develop this tool, would it require a considerable effort to do this?

You can talk to @maxulysse more about this one 😆 Essentially spinning up a process just to validate is slow when this could just be done with the main Nextflow process. It might also be more maintainable instead of a custom script, it could be a few lines of groovy.

@apeltzer That all sounds good, just wanted to make sure the alternatives were thought of so when someone who's a stickler for the rules comes along, I can point to a follow-up issue.

grst commented 1 month ago

Any possibility that nf-validation or nf-schema could be used instead? Is this just a fancy way to call a json schema against the CSV?

samshee also has some custom python code for additional validation, it does not just use a json schema. See e.g.

Also an illumina samplesheet contains multiple [Sections]. Samshee first parses the different sections and then applies different schemas to each of them.

nschcolnicov commented 1 month ago

All comments were addressed, @edmundmiller could you give it a final look and see if your requested changes have been addressed?

apeltzer commented 1 month ago

@nf-core-bot fix linting

apeltzer commented 1 month ago

Hey folks - please continue in this one here: as this is editable by anyone here on demux (wanted to fix linting and some smaller bits myself, I found your PR was from your fork so I couldn't edit anything...) :)