marcodelapierre / toy-shpc-nf

Toy RNAseq pipeline, to test interplay of Nextflow with SHPC container modules
Mozilla Public License 2.0
0 stars 1 forks source link

Toy RNAseq pipeline, to test interplay of Nextflow with SHPC container modules

Formerly called marcodelapierre/demo-shpc-nf.

The pipeline requires Nextflow to run, plus some additional tools:

The Conda example requires only Miniconda.

Credits: the backbone of the pipeline (main.nf) and sample input data come from nextflow-io/rnaseq-nf.

1. Setup

Install Modules (if needed)

Most clusters will already have a module system installed, but if needed, here are basic steps to install environment modules. For this tutorial Environment Modules were installed on a development server using these steps and sudo powers.

curl -LJO https://github.com/cea-hpc/modules/releases/download/v4.7.1/modules-4.7.1.tar.gz
tar xfz modules-4.7.1.tar.gz
cd modules-4.7.1
./configure
make
sudo make install

To ensure they are available at shell startup:

sudo ln -s /usr/local/Modules/init/profile.sh /etc/profile.d/modules.sh

The singularity-hpc repository has Dockerfile's and GitHub workflows that show additional examples.

Install Singularity (if needed)

You can install singularity in many different ways! I usually choose from source. Note how a full installation requires sudo powers.

$ git clone https://github.com/sylabs/singularity
$ cd singularity
$ ./mconfig
$ make

Install Singularity-HPC (SHPC)

SHPC can be easily installed with a clone, provided you have Python and Pip available:

$ git clone https://github.com/singularityhub/singularity-hpc
$ cd singularity-hpc
$ pip install -e .

Ensure to configure it for your module software (default is Lmod, here is changing to Environment Modules):

$ shpc config set module_sys:tcl

Install Nextflow

Next (har har), you'll want to install nextflow. The Java Virtual Machine is required to use it.

$ curl -s https://get.nextflow.io | bash

2. Clone workflow

And now, you need the workflow!

$ git clone https://github.com/researchapps/demo-shpc-nf
$ cd demo-shpc-nf

3. Run with Singularity

First let's test just running the workflow with Singularity, which is probably our best bet since it only requires the two dependencies of nextflow and singularity. Make sure you have singularity in the PATH, with all required variables, such as SINGULARITY_BINDPATH. Then run:

$ nextflow run main.nf -profile singularity
Done! Open the following report in your browser --> results/multiqc_report.html

Completed at: 05-Feb-2022 15:13:48
Duration    : 8m 51s
CPU hours   : (a few seconds)
Succeeded   : 4

4. Run with Singularity-HPC (SHPC)

So why would you want to use shpc if it's one more dependency? Since we can install the containers as modules, given a shared HPC module system, users can easily share them across workflows. Or if it's just you, you can do the same. First, (one-off) install the container modules you need for shpc:

shpc install quay.io/biocontainers/salmon:1.6.0--h84f40af_0
shpc install quay.io/biocontainers/fastqc:0.11.9--0
shpc install quay.io/biocontainers/multiqc:1.11--pyhdfd78af_0

Since you just pulled these containers with singularity, they should be cached and the install quick! You should still have singularity in the PATH, with all required variables, such as SINGULARITY_BINDPATH Finally, tell the environment modules about your module directory.

$ module use ~/Documents/Code/singularity-hpc/modules

You should be able to see the modules available!

$ module avail
----------------------------------- /home/vanessa/Documents/Code/singularity-hpc/modules -----------------------------------
quay.io/biocontainers/fastqc/0.11.9--0/module.tcl            quay.io/biocontainers/salmon/1.6.0--h84f40af_0/module.tcl  
quay.io/biocontainers/multiqc/1.11--pyhdfd78af_0/module.tcl  

---------------------------------------------- /usr/local/Modules/modulefiles ----------------------------------------------
dot  module-git  module-info  modules  null  use.own  

Finally, let's run the workflow!

$ nextflow run main.nf -profile shpc

5. Run with Conda

Finally, here is how to run the same workflow with conda! You'll need to have the conda executable on your path, and to add the correct channels:

conda config --add channels cctbx202112
conda config --add channels conda-forge

Then:

nextflow run main.nf -profile conda

Expected output

The run takes under a minute to complete (excluding container downlaod times, one-off).
On successful completion of the four tasks, the following message is displayed:

Done! Open the following report in your browser --> results/multiqc_report.html

Implementation note

The key configuration difference between using Singularity, SHPC modules, or Conda are the keywords process.container/process.module/process.conda; the package names are almost a copy/paste, with only small caveats:

Here is the relevant snippet from the profiles section of the configuration file nextflow.config:

  singularity {
    process {
      withName: 'index|quant' { container = 'quay.io/biocontainers/salmon:1.6.0--h84f40af_0' }
      withName: 'fastqc'      { container = 'quay.io/biocontainers/fastqc:0.11.9--0' }
      withName: 'multiqc'     { container = 'quay.io/biocontainers/multiqc:1.11--pyhdfd78af_0' }
    }
    singularity.enabled = true
    singularity.autoMounts = true
  }

  shpc {
    process {
      withName: 'index|quant' { module = 'quay.io/biocontainers/salmon/1.6.0--h84f40af_0' }
      withName: 'fastqc'      { module = 'quay.io/biocontainers/fastqc/0.11.9--0' }
      withName: 'multiqc'     { module = 'quay.io/biocontainers/multiqc/1.11--pyhdfd78af_0' }
    }
  }

  conda {
    process {
      withName: 'index|quant' { conda = 'bioconda::salmon=1.6.0=h84f40af_0' }
      withName: 'fastqc'      { conda = 'bioconda::fastqc=0.11.9=0' }
      withName: 'multiqc'     { conda = 'bioconda::multiqc=1.11=pyhdfd78af_0' }
    }
  }