VDB Shotgun Pipeline

Prerequisites

Snakemake We currently recommend installing version 7.31.1 as later versions may be inconsistent with this pipeline.
Apptainer/Singularity: while in many cases we do provide conda envs the only method of execution we support is via containers.
(optional) A Snakemake Profile: this coordinates the execution of jobs on whatever hardware you are using.

Recommendations

set up a fresh conda virtual environment, install snakemake version 7.31.1 and python 3.10.9 and use this environment to run all analyses.
configure your .bashrc file to configure the $SNAKEMAKE_PROFILE variable to use the vdblab-profile (A private repo for vdblab members) or to point to whatever snakmake profile you will be using. At the same time add an $TMPDIR environmental variable definition to your .bashrc file to define where you would like to put temporary files. If doing this on lilac - recommended that you point this to a location in your /data/ directory.

Important Notes:

Set the location of your profile to the environment variable $SNAKEMAKE_PROFILE (eg export SNAKEMAKE_PROFILE=/path/to/your/profile/) (Recommended that you add this to the .bashrc file in your home directory to have this environmental variable instated upon startup.)
For the purposes of the examples, we added the --dry-run flag for the user to preview the rules to be executed. Remove this step to execute the commands.
All database paths are configured in config/config.yaml Change the paths to reflect where the databases can be found on your machine. For a uniform way to fetch and build all the databases, see https://github.com/vdblab/resources
If running analysis on SRA files, when specifying the config of your command set the dedup-platform=SRA switch. If this tag is not successfully set the pipeline will hang indefinitely at the dedup stage.

Simulating test data:

snakemake --snakefile .test/Snakefile --directory .test/simulated/

Main Pipeline

Usage

snakemake \
  --directory tmpout/ \
  --config \
    sample=473 \
    R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
    R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
    nshards=4 \
    stage=all \
  --dry-run

Outputs

MultiQC-ready reports
Microbe relative abundances (MetaPhlAn3, Kraken2)
Metabolic pathway relative abundances (HUMAnN3)
Metagenome assembled genomes (MetaSPAdes)
AMR profiles with Abricate and RGI
MAGs with MetaWRAP (Metabat2, CONCOCT, Maxbin2)
Gene prediction and annotation (MetaErg)
Secondary metabolite gene clusters (antiSMASH)
Antimicrobial resistance and virulence genes (ABRicate, AMRFinderPlus)
Carbohydrate active enzyme (CAZyme) annotation (dbCAN3)

Workflow

The rule DAG for a single sample looks like this:

Different modules of the workflow can be run indenpendently using the stage config entry.

MultiQC

Just run MultiQC on a directory, no need to use Snakemake


cp -r tmppre/reports tmpreports
cp tmpassembly/quast/quast_473/report.tsv ./tmpreports/
ver="v1.12"
docker run -V $PWD:$PWD docker://ewels/multiqc:${ver} multiqc \
    --config vdb_shotgun/multiqc_config.yaml --force \
    --title "a multiqc report for some test data" \
    -b "generated by ${ver}" --filename multiqc_report.html \
    reports/ --interactive

Preprocessing

snakemake \
  --directory tmppreprocess/ \
  --config \
    sample=473 \
    R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
    R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
    nshards=4 \
    dedup_platform=NovaSeq \
    stage=preprocess \
  --dry-run

Tools used

BBTools (site | paper)
SeqKit (site | paper)
Bowtie2 (site | paper)
Snap (site | paper)
SortMeRNA (site | paper)
FastQC (site)
2-step host removal descibed here, extended to use both human and mouse genomes

Biobakery

snakemake \
  --directory tmpbiobakery/ \
  --config \
    sample=473 \
    R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
    R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
    stage=biobakery \
  --dry-run

Tools used

MetaPhlAn3 (site | paper)
HUMAnN3 (site | paper)

Kraken2/Bracken

snakemake \
  --directory tmpkraken/ \
  --config \
    sample=473 \
    R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
    R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
    dedup_platform=NovaSeq \
    stage=kraken \
  --dry-run

Tools used

Kraken2 (site | paper)

Assembly

snakemake \
  --directory tmpassembly/ \
  --config \
    sample=473 \
    R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
    R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
    stage=assembly \
  --dry-run

Tools used

MetaSPAdes (site | paper)
MetaQUAST (site | paper)

Annotation

snakemake \
  --directory tmpannotate/ \
  --config \
    sample=473 \
    R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
    R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
    assembly=tmpassembly/473.contigs.fasta \
    stage=annotate \
  --dry-run

Tools used

MetaErg (site | paper)
antiSMASH (site | paper)
ABRicate (site)
AMRFinderPlus (site | paper)
dbCAN (site | paper)

Binning

snakemake \
  --directory tmpbinning/ \
  --config \
    sample=473 \
    R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
    R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
    assembly=tmpassembly/473.contigs.fasta \
    stage=binning \
  --dry-run

RGI

snakemake \
  --directory tmprgi/ \
  --config \
    sample=473 \
    R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
    R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
    stage=rgi \
  --dry-run

Tools used

RGI (site | paper)

Strainphlan Pipeline

This pipeline StrainPhlAn for each specified species. Strainphlan requires two inputs: sample-level marker pickle files, and strain-level markers extracted from the main database. These are stored in central subdirectory in the Metaphlan database directory to aid re-running. If you provide the .sam.bz2 file for a samples that has already been processed into a pkl file, it will use the pregenerated result.

This workflow accepts as input a list of sample's metaphlan sam.bz2 alignment files, and a list of species of interest. A config argument strainphlan_markers_dir serves as a central place for storing both the species- and the sample-level marker files; these are specific to a version of the MetaPHlan database, so we recommend placing that within the metaphlan database directory.

Usage

snakemake \
  --snakefile workflow/strainphlan.smk \
  --directory tmpstrain/ \
  --config \
    sams=[path/to/sample1.sam.bz2,path/to/sample2.sam.bz2] \
    strainphlan_markers_dir=/data/brinkvd/resources/dbs/metaphlan/mpa_vJan21_CHOCOPhlAnSGB_202103/marker_outputs/ \
    metaphlan_db=/data/brinkvd/resources/dbs/metaphlan/mpa_vJan21_CHOCOPhlAnSGB_202103/ \
    marker_in_n_samples=2 \
  --dry-run

Outputs

For each input species:

Multiple sequence alignment of strains detected in samples
Phylogenetic tree of strains detected in samples

Workflow

The rule DAG for two example input species looks like this:

Testing and Development

Please see development.md.

vdblab / vdblab-shotgun

readme

VDB Shotgun Pipeline

Prerequisites

Recommendations

Important Notes:

Simulating test data:

Main Pipeline

Usage

Outputs

Workflow

MultiQC

Preprocessing

Tools used

Biobakery

Tools used

Kraken2/Bracken

Tools used

Assembly

Tools used

Annotation

Tools used

Binning

RGI

Tools used

Strainphlan Pipeline

Usage

Outputs

Workflow

Testing and Development