Workflow inputs definition and schema

Spun off from #2723

Params can currently be defined in config files (including profiles), params files, CLI options, and the pipeline code itself. This creates the potential for much confusion around how these various sources are resolved (see #2662). Additionally, params are not typed, and while the CLI can cast command line params based on regular expressions, it can also backfire when e.g. a string param is given a value that "looks" like a number.

Instead, params should be defined in a single place with metadata such as type, default value, description, etc. Benefits are:

single source of truth
less ambiguity of how params are resolved
ability to validate params based on type definition instead of regex guessing

The nf-core parameter schema (nextflow_schema.json) as well as the nf-validation plugin are excellent steps in this direction, and the solution may be to simply incorporate them into Nextflow.

For backwards compatibility, we may allow params to be set in config files and pipeline code, but this would essentially be overriding the default value rather than "defining" the param, and it should be discouraged in favor of putting everything in the parameter schema. That being said, it can be useful to set params from a profile, such as a test profile that provides some test data, so this use case should be supported.

The main question that I see is whether the schema should be in a separate JSON/YAML file (as it currently is in nf-core) or in the pipeline code as part of the top-level workflow definition.

I would like the latter approach because it makes the top-level workflow more of a "unit" and makes it easier for IDE tooling to validate param references in the pipeline code. It would likely be less verbose than a JSON schema.
On the other hand, a JSON schema can be parsed by external tools written in other languages whereas Nextflow scripts can only be parsed by Nextflow (and any IDE tooling)
For what it's worth, Nextflow could export the workflow inputs definition to a JSON file for use with other tools, but then we have to keep it in sync with the pipeline code somehow

Hello there! I wanted to bring this project to the attention! https://pkl-lang.org

It was released by apple so I will have some support and I feel like it addresses really well the needs for schema validation and progressive amendment (+java library).

let me know what you think!

https://github.com/nextflow-io/nextflow/issues/2723#issuecomment-1896242139

This is getting away from the config file, but what do you think about putting the params schema in the pipeline code? The top-level anonymous workflow could have an "input" section that defines inputs with type, default value, etc. This way, the language server could ensure that the params are used correctly in the pipeline code. Nextflow could export this definition to YAML/JSON for use with external tools

I think its common to keep the params bundled with the config profiles; a lot of params are ultimately just paths to reference files, and you will need different paths to the files if you are on HPC vs cloud.

It seems like this would break that.

fwiw so far the typing of "default values" for the params has been one of the recurring headaches I have had lately, need to have a way to support

value typing (e.g. int, string, bool)
different default values for different profiles
differentiate between an unset value, a value set with a default, and a value set with a user-supplied value

its been my experience that nextflow_schema.json has severe issues especially with the latter two. Here is a common example;

I have an R script where "NA" is a recognized "unset" cli arg. I want my users to be able to pass in an arg from the Nextflow pipeline params. If the user submits an arg, it must be an int value. However if a value is not passed, I need to pass "NA" to the script instead, which is a string. If my params.Rscript_val default is null, it does not pass to the R script correctly and I have to hack in Groovy to fix it. If my params.Rscript_val default is "NA", it works, but the nextflow_schema.json does not support it because, due to the user-input requirement of only int, I can only express the compatible input value in terms of int from nextflow_schema.json

this is just one example but it highlights a relatively common type of situation; the limitations with nextflow_schema.json have also bled over into things like Nextflow Tower / Seqera platform which iirc use it for parsing the input fields for the pipeline run UI.

similarly, trying to have e.g. SLURM default paths vs. AWS S3 defaults paths, based on profile, supported in the nextflow_schema.json, is something I still have not figured out.

What I've been thinking is that the nextflow_schema.json should be the source of truth instead of the config file, but config profiles should still be able to override the default value. I think that would give the best of both worls.

As for your R example, I think the best practice is to encode that convention in the process that calls the R script like so:

"""
Rscript script.R --val ${params.Rscript_val ?: 'NA'}
"""

Thinking more on this, and with some inspiration from the output DSL prototype, here's a sketch for an input DSL for fetchngs:

inputs {
    input {
        type Path // type: 'file'
        required true
        mimetype 'text/csv'
        pattern '^\\S+\\.(csv|tsv|txt)$'
        schema 'assets/schema_input.json'
        description 'File containing SRA/ENA/GEO/DDBJ identifiers one per line to download their associated metadata and FastQ files.'
    }

    // ...

    nf_core_pipeline {
        type String
        description '''Name of supported nf-core pipeline e.g. 'rnaseq'. A samplesheet for direct use with the pipeline will be created with the appropriate columns.'''
        enum 'rnaseq', 'atacseq', 'viralrecon', 'taxprofiler'
    }

    nf_core_rnaseq_strandedness {
        type String
        description '''Value for 'strandedness' entry added to samplesheet created when using '--nf_core_pipeline rnaseq'.'''
        help '''The default is 'auto' which can be used with nf-core/rnaseq v3.10 onwards to auto-detect strandedness during the pipeline execution.'''
        defaultValue 'auto'
    }

    // ...
}

Notes:

JSON schema can be generated with a CLI command e.g. in CI before each release, JSON file should never be edited directly
can also model param sections, additional metadata, but ideally this basic syntax should be supported
could use CLI options to define extra fields like title, description
names and types are used to detect undefined params, validate param type, cast CLI params to appropriate type
schema (e.g. for input) can be used to validate structure of an input file
- can also be used to map file contents to in-memory data structure, though maybe should be separate, e.g. add schema option to splitCsv
params still accessed via params but only allowed in anonymous workflow and output DSL
params usage in subworkflows should be replaced with explicit inputs in take: block
config file can use params and override them, but only config profiles should override, top-level params block should be deprecated
but config is executed before script, so how can config use params?
- config could use the params schema if it exists, otherwise fallback to "no params allowed"
- 4744 could load the config but not evaluate params until after script execution when they are fully resolved
- 4510 is loaded on module inclusion, params are fully resolved
- main config could just be loaded after the script? most config options are not used until workflow starts, though params from config profiles are needed

I really like the input DSL, but the circular dependency with the config is a problem. I have listed a few ideas to address this, though none of them are complete IMO. Maybe some combination of them will do the trick.

Need to think further on the relationship between params, config, and script

One way to solve the circular dependency might be to restrict the scope of params to only things that are actually workflow inputs. In other words, don't allow the config to reference params at all. Then you could define params in the pipeline code (like the output definition) and generate a YAML schema for use by external tools.

The config file should still be able to set params. Nextflow would be able to validate them at runtime because it could evaluate the params definition before it evaluates the entry workflow and output definition (the only two places where params can be used).

The params that are typically used in the config file tend to be external to the workflow itself, for example:

outdir, publish_dir_mode: should become config settings e.g. workflow.output.directory and workflow.output.mode
max_cpus, max_memory, max_time: should be replaced by resourceLabels directive

The main consequence is that you would only be able to use params to control workflow inputs and not config settings. Things that might previously be an additional CLI option:

$ nextflow run nf-core/rnaseq --max_cpus 24

Would become config:

$ cat extra.config
process.resourceLimits.cpus = 24

$ nextflow run nf-core/rnaseq -c extra.config

I think I life this tradeoff, though I can appreciate why many people might like the power and convenience of params as they currently work. Maybe this would be a good long-term goal to work towards. First we focus on incorporating the param schema, then we can think about adding a params definition alongside the entry workflow.

nextflow-io / nextflow

Workflow inputs definition and schema #4669

4744 could load the config but not evaluate params until after script execution when they are fully resolved

4510 is loaded on module inclusion, params are fully resolved