nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.79k stars 633 forks source link

Workflow inputs definition and schema #4669

Open bentsherman opened 10 months ago

bentsherman commented 10 months ago

Spun off from #2723

Params can currently be defined in config files (including profiles), params files, CLI options, and the pipeline code itself. This creates the potential for much confusion around how these various sources are resolved (see #2662). Additionally, params are not typed, and while the CLI can cast command line params based on regular expressions, it can also backfire when e.g. a string param is given a value that "looks" like a number.

Instead, params should be defined in a single place with metadata such as type, default value, description, etc. Benefits are:

The nf-core parameter schema (nextflow_schema.json) as well as the nf-validation plugin are excellent steps in this direction, and the solution may be to simply incorporate them into Nextflow.

For backwards compatibility, we may allow params to be set in config files and pipeline code, but this would essentially be overriding the default value rather than "defining" the param, and it should be discouraged in favor of putting everything in the parameter schema. That being said, it can be useful to set params from a profile, such as a test profile that provides some test data, so this use case should be supported.

The main question that I see is whether the schema should be in a separate JSON/YAML file (as it currently is in nf-core) or in the pipeline code as part of the top-level workflow definition.

jspaezp commented 9 months ago

Hello there! I wanted to bring this project to the attention! https://pkl-lang.org

It was released by apple so I will have some support and I feel like it addresses really well the needs for schema validation and progressive amendment (+java library).

let me know what you think!

stevekm commented 9 months ago

https://github.com/nextflow-io/nextflow/issues/2723#issuecomment-1896242139

This is getting away from the config file, but what do you think about putting the params schema in the pipeline code? The top-level anonymous workflow could have an "input" section that defines inputs with type, default value, etc. This way, the language server could ensure that the params are used correctly in the pipeline code. Nextflow could export this definition to YAML/JSON for use with external tools

I think its common to keep the params bundled with the config profiles; a lot of params are ultimately just paths to reference files, and you will need different paths to the files if you are on HPC vs cloud.

It seems like this would break that.

fwiw so far the typing of "default values" for the params has been one of the recurring headaches I have had lately, need to have a way to support

its been my experience that nextflow_schema.json has severe issues especially with the latter two. Here is a common example;

I have an R script where "NA" is a recognized "unset" cli arg. I want my users to be able to pass in an arg from the Nextflow pipeline params. If the user submits an arg, it must be an int value. However if a value is not passed, I need to pass "NA" to the script instead, which is a string. If my params.Rscript_val default is null, it does not pass to the R script correctly and I have to hack in Groovy to fix it. If my params.Rscript_val default is "NA", it works, but the nextflow_schema.json does not support it because, due to the user-input requirement of only int, I can only express the compatible input value in terms of int from nextflow_schema.json

this is just one example but it highlights a relatively common type of situation; the limitations with nextflow_schema.json have also bled over into things like Nextflow Tower / Seqera platform which iirc use it for parsing the input fields for the pipeline run UI.

similarly, trying to have e.g. SLURM default paths vs. AWS S3 defaults paths, based on profile, supported in the nextflow_schema.json, is something I still have not figured out.

bentsherman commented 9 months ago

What I've been thinking is that the nextflow_schema.json should be the source of truth instead of the config file, but config profiles should still be able to override the default value. I think that would give the best of both worls.

As for your R example, I think the best practice is to encode that convention in the process that calls the R script like so:

"""
Rscript script.R --val ${params.Rscript_val ?: 'NA'}
"""
bentsherman commented 8 months ago

Thinking more on this, and with some inspiration from the output DSL prototype, here's a sketch for an input DSL for fetchngs:

inputs {
    input {
        type Path // type: 'file'
        required true
        mimetype 'text/csv'
        pattern '^\\S+\\.(csv|tsv|txt)$'
        schema 'assets/schema_input.json'
        description 'File containing SRA/ENA/GEO/DDBJ identifiers one per line to download their associated metadata and FastQ files.'
    }

    // ...

    nf_core_pipeline {
        type String
        description '''Name of supported nf-core pipeline e.g. 'rnaseq'. A samplesheet for direct use with the pipeline will be created with the appropriate columns.'''
        enum 'rnaseq', 'atacseq', 'viralrecon', 'taxprofiler'
    }

    nf_core_rnaseq_strandedness {
        type String
        description '''Value for 'strandedness' entry added to samplesheet created when using '--nf_core_pipeline rnaseq'.'''
        help '''The default is 'auto' which can be used with nf-core/rnaseq v3.10 onwards to auto-detect strandedness during the pipeline execution.'''
        defaultValue 'auto'
    }

    // ...
}

Notes:

I really like the input DSL, but the circular dependency with the config is a problem. I have listed a few ideas to address this, though none of them are complete IMO. Maybe some combination of them will do the trick.

Need to think further on the relationship between params, config, and script

bentsherman commented 5 months ago

One way to solve the circular dependency might be to restrict the scope of params to only things that are actually workflow inputs. In other words, don't allow the config to reference params at all. Then you could define params in the pipeline code (like the output definition) and generate a YAML schema for use by external tools.

The config file should still be able to set params. Nextflow would be able to validate them at runtime because it could evaluate the params definition before it evaluates the entry workflow and output definition (the only two places where params can be used).

The params that are typically used in the config file tend to be external to the workflow itself, for example:

The main consequence is that you would only be able to use params to control workflow inputs and not config settings. Things that might previously be an additional CLI option:

$ nextflow run nf-core/rnaseq --max_cpus 24

Would become config:

$ cat extra.config
process.resourceLimits.cpus = 24

$ nextflow run nf-core/rnaseq -c extra.config

I think I life this tradeoff, though I can appreciate why many people might like the power and convenience of params as they currently work. Maybe this would be a good long-term goal to work towards. First we focus on incorporating the param schema, then we can think about adding a params definition alongside the entry workflow.