Open bentsherman opened 10 months ago
Hello there! I wanted to bring this project to the attention! https://pkl-lang.org
It was released by apple so I will have some support and I feel like it addresses really well the needs for schema validation and progressive amendment (+java library).
let me know what you think!
https://github.com/nextflow-io/nextflow/issues/2723#issuecomment-1896242139
This is getting away from the config file, but what do you think about putting the params schema in the pipeline code? The top-level anonymous workflow could have an "input" section that defines inputs with type, default value, etc. This way, the language server could ensure that the params are used correctly in the pipeline code. Nextflow could export this definition to YAML/JSON for use with external tools
I think its common to keep the params bundled with the config profiles; a lot of params are ultimately just paths to reference files, and you will need different paths to the files if you are on HPC vs cloud.
It seems like this would break that.
fwiw so far the typing of "default values" for the params has been one of the recurring headaches I have had lately, need to have a way to support
its been my experience that nextflow_schema.json
has severe issues especially with the latter two. Here is a common example;
I have an R script where "NA"
is a recognized "unset" cli arg. I want my users to be able to pass in an arg from the Nextflow pipeline params
. If the user submits an arg, it must be an int value. However if a value is not passed, I need to pass "NA"
to the script instead, which is a string. If my params.Rscript_val
default is null
, it does not pass to the R script correctly and I have to hack in Groovy to fix it. If my params.Rscript_val
default is "NA"
, it works, but the nextflow_schema.json
does not support it because, due to the user-input requirement of only int
, I can only express the compatible input value in terms of int
from nextflow_schema.json
this is just one example but it highlights a relatively common type of situation; the limitations with nextflow_schema.json
have also bled over into things like Nextflow Tower / Seqera platform which iirc use it for parsing the input fields for the pipeline run UI.
similarly, trying to have e.g. SLURM default paths vs. AWS S3 defaults paths, based on profile, supported in the nextflow_schema.json
, is something I still have not figured out.
What I've been thinking is that the nextflow_schema.json
should be the source of truth instead of the config file, but config profiles should still be able to override the default value. I think that would give the best of both worls.
As for your R example, I think the best practice is to encode that convention in the process that calls the R script like so:
"""
Rscript script.R --val ${params.Rscript_val ?: 'NA'}
"""
Thinking more on this, and with some inspiration from the output DSL prototype, here's a sketch for an input DSL for fetchngs:
inputs {
input {
type Path // type: 'file'
required true
mimetype 'text/csv'
pattern '^\\S+\\.(csv|tsv|txt)$'
schema 'assets/schema_input.json'
description 'File containing SRA/ENA/GEO/DDBJ identifiers one per line to download their associated metadata and FastQ files.'
}
// ...
nf_core_pipeline {
type String
description '''Name of supported nf-core pipeline e.g. 'rnaseq'. A samplesheet for direct use with the pipeline will be created with the appropriate columns.'''
enum 'rnaseq', 'atacseq', 'viralrecon', 'taxprofiler'
}
nf_core_rnaseq_strandedness {
type String
description '''Value for 'strandedness' entry added to samplesheet created when using '--nf_core_pipeline rnaseq'.'''
help '''The default is 'auto' which can be used with nf-core/rnaseq v3.10 onwards to auto-detect strandedness during the pipeline execution.'''
defaultValue 'auto'
}
// ...
}
Notes:
input
) can be used to validate structure of an input file
schema
option to splitCsv
params
but only allowed in anonymous workflow and output DSLtake:
blockI really like the input DSL, but the circular dependency with the config is a problem. I have listed a few ideas to address this, though none of them are complete IMO. Maybe some combination of them will do the trick.
Need to think further on the relationship between params, config, and script
One way to solve the circular dependency might be to restrict the scope of params to only things that are actually workflow inputs. In other words, don't allow the config to reference params at all. Then you could define params in the pipeline code (like the output definition) and generate a YAML schema for use by external tools.
The config file should still be able to set params. Nextflow would be able to validate them at runtime because it could evaluate the params definition before it evaluates the entry workflow and output definition (the only two places where params can be used).
The params that are typically used in the config file tend to be external to the workflow itself, for example:
outdir
, publish_dir_mode
: should become config settings e.g. workflow.output.directory
and workflow.output.mode
max_cpus
, max_memory
, max_time
: should be replaced by resourceLabels
directiveThe main consequence is that you would only be able to use params to control workflow inputs and not config settings. Things that might previously be an additional CLI option:
$ nextflow run nf-core/rnaseq --max_cpus 24
Would become config:
$ cat extra.config
process.resourceLimits.cpus = 24
$ nextflow run nf-core/rnaseq -c extra.config
I think I life this tradeoff, though I can appreciate why many people might like the power and convenience of params as they currently work. Maybe this would be a good long-term goal to work towards. First we focus on incorporating the param schema, then we can think about adding a params definition alongside the entry workflow.
Spun off from #2723
Params can currently be defined in config files (including profiles), params files, CLI options, and the pipeline code itself. This creates the potential for much confusion around how these various sources are resolved (see #2662). Additionally, params are not typed, and while the CLI can cast command line params based on regular expressions, it can also backfire when e.g. a string param is given a value that "looks" like a number.
Instead, params should be defined in a single place with metadata such as type, default value, description, etc. Benefits are:
The nf-core parameter schema (
nextflow_schema.json
) as well as the nf-validation plugin are excellent steps in this direction, and the solution may be to simply incorporate them into Nextflow.For backwards compatibility, we may allow params to be set in config files and pipeline code, but this would essentially be overriding the default value rather than "defining" the param, and it should be discouraged in favor of putting everything in the parameter schema. That being said, it can be useful to set params from a profile, such as a test profile that provides some test data, so this use case should be supported.
The main question that I see is whether the schema should be in a separate JSON/YAML file (as it currently is in nf-core) or in the pipeline code as part of the top-level workflow definition.