nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.67k stars 622 forks source link

Optional inputs for DSL2 #1694

Open illusional opened 4 years ago

illusional commented 4 years ago

New feature

Pinging @rsuchecki @pditommaso (I couldn't find an issue with this, I hope it's okay that I open a new one).

Based on a small conversation on the Gitter (1 - primary | 2), there's interest (a lot from me) to have more direct support for optional inputs - this seems is inline with the goals of DSL2 to produce reusable tool modules / interfaces.

Other workflow specifications have the concept of tool wrappers, which aim to be a "write once, use in all of your workflows". This means the tool wrapper would contain most (if not all) available configuration options, which then the command line is dynamically constructed. This allows the community to build and contribute high quality tool wrappers, for example: Common Workflow Library (CWLibrary#fastqc), BioWDL (BioWDL#fastqc) with the tools available for other users to use, or upload to stores like Dockstore or the Galaxy toolshed.

Projects like aCLImatise aim to generate tool wrappers, as this process is usually a significant time consuming aspect of building workflows.

The DSL2 makes good strides towards this, and a stronger concept for optional inputs would take this further.

Relevant discussion:

Command line construction sidenote

I think it would be a bad idea to create a new syntax for building or interpolating command lines, but tool developers could use the groovy environment to build strings for each command option.

Usage scenario

Consider fastqc (eg: nf-core module definition), which might have the (simplified) command structure:

fastqc \
    [-c contaminant file] \
    [ ... other config options ] \
    seqfile1 .. seqfileN

I could build a process definition to encapsulate these ways to optionally configure the tool.

This process definition is just hypothetical, just one way I could think to do it.

process FASTQC {
    input:
        tuple val(name),
        Optional[path(contaminant)],
        path(reads)

    output:
        path("*.zip"), emit: zip

    script:
    contaminant_script = (contaminant != null) ? "--contaminant ${contaminant}" : ""
    reads_script = reads.join(' ')
    """
    fastqc \
        ${contaminant_script} \
        ${reads_script}
    """
}

But usage of imported modules in DSL2 in a workflow requires positional arguments, so you would have something like:

include { FASTQC as fastqc } from './tools/fastqc'

workflow {
    fastqc(params.name, null, params.reads)
}

Suggest implementation

As @rsuchecki noted in gitter:

Things are very flexible for val inputs, but understandably get more complex when files/paths are involved as they need to be staged. Tuples are nice and keep things organised but are still an extension of the same idea of positional inputs.

I'd hope to avoid the use of positional arguments, because you can't ascertain context for a variable.

Puumanamana commented 4 years ago

There are also some tools that can have multiple types of input files (actually any combination of those inputs). As such, none of them are mandatory, but you need at least one. For instance, if we look at read assemblers such as megahit, you can do either:

# Case 1: paired-end reads
megahit -1 sample1_R1.fastq.gz,sample2_R1.fastq.gz -2 sample1_R2.fastq.gz,sample2_R2.fastq.gz

# Case 2: paired-end, interleaved reads
megahit --12 sample1.fastq.gz,sample2.fastq.gz

# Case 3: single-end reads
megahit -r reads_single.fastq.gz 

# Case 4: multiple input types combined
megahit -1 sample1_paired_R1.fastq.gz,sample2_paired_R1.fastq.gz \
        -2 sample1_paired_R2.fastq.gz,sample2_paired_R2.fastq.gz \
        -r sample1_unpaired.fastq.gz,sample2_unpaired.fastq.gz

# And more...

Lately I had trouble handling this case with the DSL2 syntax in a clean way.

maxulysse commented 4 years ago

I managed to find a solution (not as clean as I would have hoped). https://github.com/nf-core/sarek/blob/a7679b9b5c178351b1e96a3ffe7ee81ddf9aad06/main.nf#L226

Which I later use in a clean manner in a process: https://github.com/nf-core/sarek/blob/dsl2/modules/nf-core/software/qualimap_bamqc.nf

drpatelh commented 4 years ago

Yep, this would be really nice. Using NO_FILE as suggested here doesn't work for optional inputs on AWS as @apeltzer found.

Another solution is to have a dummy file in the pipeline repo that you can stage if the actual file isn't required in the process e.g. initiated here and used here.

This also means you won't have to write anything to the results directory as suggested by @MaxUlysse.

illusional commented 3 years ago

It looks like there are a couple of common workarounds :

But maybe also recognising a few common patterns of arguments which tools may require to better wrap a "tool interface":

Just nudging @rsuchecki and @pditommaso to see if you guys have any thoughts.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

maxulysse commented 3 years ago

I would like to see such a feature. @drpatelh did you find any hack to make it work?

bioinfomagician commented 3 years ago

I haven't explored some of the workarounds listed above, but I also agree that implementing some form of optional input syntax for DSL2 would be very useful.

drpatelh commented 3 years ago

I haven't I'm afraid. I have resorted to staging "dummy" files to bypass this. See discussion here. Maybe there is a better solution.

mjhipp commented 3 years ago

Not ideal, but another workaround to use an optional input without having to stage a dummy file is to pass an empty list as the input path.

This script worked on aws batch:

nextflow.enable.dsl=2

process CAT_FILES {
  input:
    path files_to_cat // list of paths
    path optional // optional file

  output:
    path 'out.txt'

  script:
    def args = ['cat']
    files_to_cat.each { args.add(it) }
    if (optional) args.add(optional[0]) // or optional.each { args.add(it) }
    args.add("> out.txt")
    args.join(' ')
}

workflow {
  CAT_FILES(['file1.txt', 'file2.txt'], [])
}

An optional path is just a list of path with size 1 or 0.

CharlotteAnne commented 3 years ago

Wanting to bump this - having clear syntax for optional inputs would be really helpful.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

pditommaso commented 2 years ago

Bump

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

pditommaso commented 2 years ago

Related https://github.com/nextflow-io/nextflow/pull/2710

CharlotteAnne commented 1 year ago

Coming back to bump again ;)

DariiaVyshenska commented 5 months ago

I just encoutered this in kallisto quant module and had to change the module's main.nf (which I'd rather avoid). totally support this issue!

dombraccia commented 3 months ago

Encountering the same issue when trying to use the nf-core/spaceranger_count module. I am modifying the main.nf to get it to work, which seems to defeat the purpose of having an nf-core!

brandenjlynch commented 3 months ago

Also seeing this issue with the Sarek pipeline from nf-core, in which a missing (technically optional) input prevents execution of part of the workflow. See https://github.com/nf-core/sarek/issues/1546

vinjana commented 3 weeks ago

I would also like to see this feature. Until now I used the NO_FILE approach, but that breaks down if there are multiple optional files:

input file name collision -- There are multiple input files for each of the following file names: NO_FILE

Of course, multiple differently named NO_FILE files could be used (which makes the code more complicated).

bentsherman commented 3 weeks ago

Worth giving an update here. We were working on optional paths a while ago as part of the path arity option, but ultimately we didn't more forward with it because it revealed some tricky edge cases. We need to make some deeper changes to the process inputs/outputs handling in order to support optional inputs/outputs properly.

Until then you'll have to use some hack. My recommendation is to always create a file, but make it empty, or otherwise mark it in a way to make it clear that it is "null" without it actually being null.