NextFlowPlatform 2.0 - Githubissues

I created a repository, viash_nxf_poc, to work out a POC for a NextFlowPlatform rewrite. I'm proposing this rewrite to fix some of my annoyances with the current way of working, but also to reduce the code complexity and add more checks in order to avoid bugs (which currently occur quite regularly).

Channel Interface

A Viash+Nextflow module generated by Viash has the interface:

Input channel:  [id, inputs, ...passthrough...]
Output channel: [id, outputs, ...passthrough...]

These fields are defined as follows:

id (String) is a unique identifier for the event in the Channel.

inputs (Map[String, Object] or File) is a named map containing the component's input parameters. Examples of the class types associated with different Viash component arguments:

config.vsh.yaml	Input map in Nextflow	Class
`{ name: foo, type: string, direction: input }`	`[ foo: "bar" ]`	`String`
`{ name: int, type: integer, direction: input, multiple: true }`	`[ int: [ 1, 2, 3 ] ]`	`List[Integer]`
`{ name: bool, type: boolean, direction: input, required: false }`	`[ bool: null ]`	`Boolean` or `null`
`{ name: bool, type: boolean, direction: input, required: false }`	`[ bool: true ]`	`Boolean` or `null`
`{ name: in, type: file, direction: input }`	`[ in: file("in.h5ad") ]`	`File`
`{ name: out, type: file, direction: output }`	`[ out: "proposed_path.h5ad" ]`	`String`
`{ name: out, type: file, direction: output, multiple: true }`	`[ out: "proposed_path_*.h5ad" ]`	`String`

If you only want to specify a single input file, you can simply pass a File instead of a Map[String, Object].

...passthrough... (Object*) are objects that simply get passed through to the output. This is a practical solution for reading a bunch of parameters in at the start of a workflow and putting it into the inputs slot whenever they need to be consumed. This means that an event in the channel can be of length N where N >= 2.
outputs (Map[String, File] or File) is a named map containing the component's output files. If the component outputs only a single File, the outputs will be a File rather than a named map.

Module usage

Given a Viash component named poc ( src/poc/config.vsh.yaml ), importing the module yields a Nextflow Workflow which can be used as follows:

nextflow.enable.dsl=2

include { poc } from "./target/nextflow/poc/main.nf" params(params)

workflow {
  Channel.value( [
    "foo", 
    [
      input_one: file("data/pbmc_1k_protein_v3.normalize.output_rna.h5ad"),
      input_multi: file("data/*.h5ad"),
      string: "foo",
      integer: 10
    ]
  ])
  | poc
}

Viash+Nextflow modules are flexible

The strength of the new Viash+Nextflow modules lies in its flexibility in how you want to use the module.

nextflow.enable.dsl=2

include { poc } from "./target/nextflow/poc/main.nf"

workflow {
  Channel.value( [
    "foo", 
    [
      input_one: file("data/pbmc_1k_protein_v3.normalize.output_rna.h5ad"),
      input_multi: file("data/*.h5ad")
    ]
  ])
  | poc.run(
    args: [ string: "foo", integer: 10 ],
    directives: [
      cache: "lenient",
      label: [ "bigmem", "bigcpu" ]
    ],
    auto: [
      simplifyInput: true,
      simplifyOutput: true,
      publish: false,
      transcript: false
    ]
  )
}

directives: One on one mapping with the Nextflow process directives. NOTE: You can pass clojures, but they need to quoted, see example below. Examples:
- container: "bash:4.2"
- label: ["bigmem", "bigcpu"]
- publishDir: [ path: "output/", mode: "copy", saveAs: "{ "prefix_" + it }" ] (← saveAs is a quoted closure)
auto: Helper arguments provided by Viash.
- simplifyInput: If true, an input tuple only containing only a single File (e.g. ["foo", file("in.h5ad")]) is automatically transformed to a map (i.e. ["foo", [ input: file("in.h5ad") ] ])
- simplifyOutput: If true, an output tuple containing a map with a File (e.g. ["foo", [ output: file("out.h5ad") ] ]) is automatically transformed to a map (i.e. ["foo", file("out.h5ad")])
- publish: If true, the module's outputs are automatically published to params.publishDir. Will throw an error if params.publishDir is not defined.
- transcript: If true, the module's transcripts are automatically published to params.transcriptDir. If not defined, params.publishDir + "/_transcripts" will be used. Will throw an error if neither are defined.

Chaining multiple modules

If each module only has one input file and output file:

nextflow.enable.dsl=2

include { poc1 } from "./target/nextflow/poc1/main.nf"
include { poc2 } from "./target/nextflow/poc2/main.nf"
include { poc3 } from "./target/nextflow/poc3/main.nf"

workflow {
  Channel.value( [ "foo", file("data/pbmc_1k_protein_v3.normalize.output_rna.h5ad") ] )
  | poc1
  | poc2
  | poc3

If the modules have multiple input / output files per step:

nextflow.enable.dsl=2

include { poc1 } from "./target/nextflow/poc1/main.nf"
include { poc2 } from "./target/nextflow/poc2/main.nf"
include { poc3 } from "./target/nextflow/poc3/main.nf"

workflow {
  Channel.value( [
    "foo", 
    [
      input_one: file("data/pbmc_1k_protein_v3.normalize.output_rna.h5ad"),
      input_multi: file("data/*.h5ad")
    ]
  ])
  | poc1.run(
    args: [ string: "foo", integer: 10 ]
  )
  | poc2.run(
    renameKeys: ["input_one": "output_one", "input_multi": "output_multi"]
  )
  | poc3.run(
    mapData: { [input_one: it.output_one, input_multi: it.output_multi ] }
  )
}

map: Apply a map over the incoming tuple. Example: { tup -> [ tup[0], [input: tup[1].output], tup[2] ] }.
mapId: Apply a map over the ID element of a tuple (i.e. the first element). Example: { id -> id + "_foo" }
mapData: Apply a map over the data element of a tuple (i.e. the second element). Example: { data -> [ input: data.output ] }
mapPassthrough: Apply a map over the passthrough elements of a tuple (i.e. the tuple excl. the first two elements). Example: { pt -> pt.drop(1) }
renameKeys: Rename keys in the data field of the tuple (i.e. the second element). Example: [ "new_key": "old_key" ]
debug: Whether or not to print debug messages. Example: true

Reuse same module

You can run the same component multiple times. For reasons, you need to specify a unique key every time the module is used.

nextflow.enable.dsl=2

include { poc } from "./target/nextflow/poc/main.nf"

workflow {
  Channel.value( [ "foo", file("data/pbmc_1k_protein_v3.normalize.output_rna.h5ad") ] )
  | poc.run(key: "step1")
  | poc.run(key: "step2")
  | poc.run(key: "step3")

Going over all directives to determine how they should be managed.

Layout

This format in Nextflow DSL:

process foo_process {
  <nextflow dsl>
}

is equivalent to the following in the viash config:

platforms:
  - type: nextflow
    directives:
      <viash config>

and is also equivalent to the following in viash + nextflow DSL:

foo_process(
  directives: [
    <viash + nextflow dsl>
  ]
)

Note: Should clojures in viash+nxf dsl be interpreted? E.g. directives: [ "cache": { ... }, "label": "foo" ]?

Order of importance

The order in which directives get resolved (in order of decreasing priority):

values defined in function call (i.e. foo_process(directives: ...)
values defined in viash config (i.e. - { type: nextflow, directives: ... }

accelerator

type	code
Nextflow DSL	`accelerator 4, type: 'nvidia-tesla-k80'`
Viash config	`accelerator: "4, type: 'nvidia-tesla-k80'"`
Viash + Nextflow DSL	`"accelerator": "4, type: 'nvidia-tesla-k80'"`

afterScript

type	code
Nextflow DSL	`afterScript "source /foo/bar/script"`
Viash config	`afterScript: "source /foo/bar/script"`
Viash + Nextflow DSL	`"afterScript": "source /foo/bar/script"`

beforeScript

type	code
Nextflow DSL	`beforeScript "source /foo/bar/script"`
Viash config	`beforeScript: "source /foo/bar/script"`
Viash + Nextflow DSL	`"beforeScript": "source /foo/bar/script"`

cache

type	code
Nextflow DSL	`cache false`
Viash config	`cache: false`
Viash + Nextflow DSL	`"cache": false`
--	--
Nextflow DSL	`cache "deep"`
Viash config	`cache: deep`
Viash + Nextflow DSL	`"cache": "deep"`

Possible values: false / true / "deep" / "lenient"

Note that Viash might need to convert yaml booleans into strings during parsing.

type	code
Nextflow DSL	`cpus 8`
Viash config	`cpus: 8`
Viash + Nextflow DSL	`"cpus": 8`

clusterOptions

type	code
Nextflow DSL	`clusterOptions xxxx`
Viash config	`clusterOptions: xxxx`
Viash + Nextflow DSL	`"clusterOptions": "xxxx"`

disk

type	code
Nextflow DSL	`disk '2 GB'`
Viash config	`disk: "2 GB"`
Viash + Nextflow DSL	`disk: "2 GB"`

Must match <decimal> [KMGT]?B

echo

type	code
Nextflow DSL	`echo true`
Viash config	`disk: true`
Viash + Nextflow DSL	`"disk": true`

errorStrategy

type	code
Nextflow DSL	`errorStrategy "terminate"`
Viash config	`errorStrategy: terminate`
Viash + Nextflow DSL	`"errorStrategy": "terminate"`

Possible values are 'terminate', 'finish', 'ignore', 'retry'

viash-io / viash

NextFlowPlatform 2.0 #82