Niflows are an organizational structure that is targeted at making neuroimaging tools and analyses FAIR (findable, accessible, interoperable, and reusable) with strong assurances of compatibility across environments.
Niflows builds on the lessons of the Nipype ecosystem to support user contributed Workflows as packages. These workflows can be written in any language. Niflows does not restrict a package to be written in Python, but provides additional tooling if it is. Niflows integreates a specific structure for data, code, and tests and a comprehensive test suite to allow for better validation of each Workflow, and easier reuse of Workflows in containerized form.
Niflows is intended to provide replicable workflows that quantify their variability across datasets and operating environments (i.e., different operating systems, versions of libraries and software).
niflow-manager
, which provides the nfm
command-line tool, aims to support
niflow creation, testing, and packaging. It provides the following sub-commands:
nfm init
- Create a stub workflow, with templates for desired languagesnfm install
- Install niflows from an online registry or sourcenfm build
- Package a workflow into containers (including Docker
and Singularity)nfm test
- Comprehensive testing across ranges of environments and dependency versions, uses
TestKrakenWhen the niflow is initialized with nfm init
, a language specific template from the templates is used. At this moment, only Python has a full support.
The user has to fill the template, including the specification file that is required for all workflows - spec.yml. The specification has two main parts:
The build
part is used to create an image when nfm build
is run.
The main part of the build specification is required_env
and it specifies the environment needed to run the workflow.
Since Neurodocker is used to create Dockerfile, we are using the same fields as Neurodocker specification
(with one exception that the base part should contain image and pkg-manager in one dictionary).
The full Neurodocker specification can be found here.
Specification in the required_env
is also used as an additional environment during testing with nfm test.
build:
required_env:
base:
image: debian:stretch
pkg-manager: apt
miniconda:
conda_install: [python=3.7, nipype]
fsl:
version: 5.0.10
afni:
version: latest
An optional field entrypoint
is used to set an entrypoint for the container
(niflow-{ORGANIZATION}-{WORKFLOW} is the default value). This allows each
Workflow to be used from the shell without any additional programming.
The test
part is used to test the workflow when nfm test
is run and it follows the TestKraken specification.
Testing can be performed in several computational environments. These environments
can be described in env
or fixed_env
(one or both elements have to be specified in the specification).
As in the build
part, the environment specification uses components from the Neurodocker specification.
There is one difference, that base
part should contain image
and pkg-manager
in one dictionary.
env
and fixed_env
elementsBoth env
and fixed_env
are used to specify multiple environments. In the env
part, each Neurodocker key (e.g. base
, miniconda
, fsl
) can be a list, and TestKraken will create all desired combinations of environment specifications. The fixed_env
can provide an additional specification for an environment or a list of complete specifications. The Neurodocker keys must be the same for env
and all elements of the fixed_env
part.
This is an example of the environment specification that makes use of env
and fixed_env
elements:
# List all desired combinations of environment specifications. This
# configuration, for example, will produce four different Docker images:
# 1. ubuntu 16.04 + python=3.5, numpy
# 2. ubuntu 16.04 + python=2.7, numpy
# 3. debian:stretch + python=3.5, numpy
# 4. debian:stretch + python=2.7, numpy
env:
base:
- {image: ubuntu:16.04, pkg-manager: apt}
- {image: debian:stretch, pkg-manager: apt}
miniconda:
- {conda_install: [python=3.5, numpy]}
- {conda_install: [python=2.7, numpy]}
# One or more fixed environments to test. These environments are built as defined
# and are not combined in any way. This configuration, for example, will
# produce one Docker image.
fixed_env:
base: {image: debian:stretch, pkg-manager: apt}
miniconda: {conda_install: [python=3.7, numpy]}
Example that uses the concept can be found here
common
and varied
partsIn order to eliminate repetition in the env
part, for each Neurodocker key the additional structure can be added to describe common
and varied
parts. The previous example could also look like this:
env:
base:
- {image: ubuntu:16.04, pkg-manager: apt}
- {image: debian:stretch, pkg-manager: apt}
miniconda:
common: {pip_install: [numpy]}
varied:
- {conda_install: [python=3.5]}
- {conda_install: [python=2.7]}
Example that uses the concept can be found here
There is a default location where TestKraken
tries to find all the data files and all the scripts files - this is the root directory of the tested workflow. However, these default locations can be changed.
data
elementIn order to specify how to get the data, the data
entry has to have two keys - type
and location
. For now, only one type
is implemented - workflow_path
, but in the future this might be used to specify external repositories. For type=workflow_path
, the location is simply the relative directory path to the workflow path. An example can look like this:
data:
type: workflow_path
location: my_data
scripts
elementThe scripts
entry requires only the relative directory path to the workflow path. An example can look like this:
scripts: my_scripts
Example that uses the concept can be found here
The analysis
element contains all the information required to run the workflow with the analysis. There is one required element - command
, and two optional elements - script
and inputs
. These are assembled as command script input1 input2 ...
. When the command
is a shell or interpreter (e.g., "bash", "python"), then the script
is needed. However, the command can be an executable (e.g., "ssh", "bc") and then the script
option is not required. The inputs
part contains all the inputs needed to complete the command required to run the analysis. Each element of the inputs
entry should have type
, argstr
(if a flag is needed) and value
, and might have additional metadata that can be used by pydra (a dataflow engine used by TestKraken). If type
is File
, the file is assumed to be relative to the the data directory location. If script
is provided, the script file is expected to be in the scripts directory. An example can look like this:
# The analysis part: inputs to the analysis script,
# the command to run the script and the script with the analysis.
analysis:
inputs:
- {type: File, argstr: -f, value: list2sort.json}
command: python
script: sorting.py
The tests
part contains all information regarding testing the analysis output. It is assumed that the output file is compared to the reference file that is available in the data directory (with the same name). If the tests
part is not present or it's empty, no tests will be run after the analysis. There could be multiple entries for tests
, but each element has to contain file
with the name of the output file, name
with the user defined name of the test, and script
with the name of the script that should be used for running the test. The script can be saved in the script directory (checked first) or it can be an existing test from the TestKraken
testing_functions directory. Any user provided tests have to follow the same template as the tests from TestKraken
and define a command line interface.
Example:
# Tests to compare the output of the script to reference data.
# The scripts are available under the user defined `script` subdirectory
# or the `testkraken/testing_functions` directory.
tests:
- {file: list_sorted.json, name: regr1, script: test_obj_eq.py}
- {file: list_sorted.json, name: regr1a, script: my_test_obj_eq.py}
- {file: avg_list.json, name: regr2, script: test_obj_eq.py}
Example that uses the concept can be found here