vntasis / stan-nf

Nextflow pipeline for performing statistical analysis with Stan
GNU General Public License v3.0
0 stars 0 forks source link
bayesian cmdstan nextflow pipeline stan workflow

Stan-NF

A nextflow pipeline for performing statistical analysis with Stan.

Introduction | Requirements | Pipeline summary | Quickstart | Pipeline parameters | Pipeline Input | Pipeline Output | Running the pipeline

Introduction

Stan-NF is using CmdStan to draw samples from a posterior.

Stan is a state-of-the-art platform for statistical modeling and high-performance statistical computation. It uses Markov chain Monte Carlo (MCMC) sampling, in order to get full Bayesian statistical inference. For more information check Stan's documentation.

CmdStan is the command-line interface to Stan. It taskes as input a statistical model written in Stan probabilistic programming language and compiles it to a C++ executable, which can then be used to draw samples from the posterior. It also offers tools for generating quantities of interest from an existing estimate, as well as evaluating and summarizing the produced outputs.

Stan-NF uses Nextflow as the execution backend. It ensures scalability and automation. It makes trivial the deployment of a pipeline in a high performance computing or cloud environment. Please check Nextflow documentation for more information.

The user may provide multiple Stan models and/or datasets. Stan-NF will execute different processes in parallel to compile the different models, and then sample from the posteriors of those models based on every different dataset. So, the number of output files depends on M x D, where M is the number of model files provided and D the number of data files provided.

Requirements

Pipeline summary

  1. Compile Stan model(s) into executable(s)
  2. Run MCMC in order to sample from the posterior distribution
  3. Summarize the results per sample (and per model)
  4. Calculate basic diagnostic metrics for the MCMC run(s)
  5. Standalone generate quantities of interest from a fitted model

Quickstart

  1. Install Nextflow by using the following command:

    curl -s https://get.nextflow.io | bash
  2. Fetch the pipeline and print help information about it:

    ./nextflow run vntasis/stan-nf --help

Pipeline parameters

General

The following parameters are required for every run of the pipeline, but all of them have default values. In most cases, there is no reason changing them.

--data DATA_PATH

--outdir OUTPUT_PATH

--steps STEPS_STR

--model MODEL_PATH

--chains CHAIN_NUMBER

--seed SEED

--cmdStanHome STAN_HOME_PATH

Building a model

--buildModelParams PARAM_STR

Sampling

--numSamples SAMPLES_NUMBER

--numWarmup WARMUP_NUMBER

--sampleParams PARAM_STR

Summarize results

--summaryParams PARAM_STR

Generating quantities

--fittedParams SAMPLES_PATH

--seedToGenQuan

Other

--multithreading

--threads THREAD_NUMBER

--help

Pipeline input

In order to sample from a posterior, the user needs to provide:

For standalone generating quantities of interest, the user needs to provide:

Pipeline output

By default, output is saved in a directory with the name results located in the current working directory.

Stan-NF is going to extract the dataset name from the name of the input json file and use it to create a directory for the results specific to this dataset inside the results directory. For instance, if the input is sample1.json and sample2.json, results/sample1 and results/sample2 directories are going to be created by the pipeline.

In each of those directories, the following will be saved:

The name of all the produced files are going to be based on the name of the input data file and the name of the model file. The names of the sample files and the generated quantities files are going to refer also to the number of chain.

Running the pipeline

Here is a simple example of running the pipeline:

nextflow run vntasis/stan-nf --chains 2 --seed 135 --model 'models/*.stan' --data 'data/*.json' --numSamples 2000

This would compile every model file inside the models/ directory and use every data file from the data/ directory to sample from the posterior of the models. It would use 2 chains, each one generating 2000 samples. It would also generate summaries of the results and diagnostics reports.

The first time Stan-NF is used with the default configuration, it is going to take some more time, in order to download the required docker image.

In another usage scenario, the user may has already generated samples from the posterior and wishes to generate some quantities of interest (e.g. log-likelihood). In that case, the user needs to write a new model file (e.g. 'model_genquan.stan') that will include a generated quantities section with the required code.

First, the user needs to compile the new model:

nextflow run vntasis/stan-nf --steps 'build-model' --model 'models/model_genquan.stan'

Then, the generated quantities files can be generated:

nextflow run vntasis/stan-nf --chains 2 --model 'results/models/model_genquan' --steps 'generate-quantities' --fittedParams 'results/*/samples/*.csv'

This would use the newly compiled model and the generated samples to produce the quantities of interest.

Stan version

The current CmdStan version built inside the docker image is 2.28.0