vdblab / vdblab-shotgun

Shotgun metagenomic sequencing processing pipeline
MIT License
1 stars 1 forks source link

VDB Shotgun Pipeline

Prerequisites

Recommendations

Important Notes:

Simulating test data:

snakemake --snakefile .test/Snakefile --directory .test/simulated/

Main Pipeline

Usage

snakemake \
  --directory tmpout/ \
  --config \
    sample=473 \
    R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
    R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
    nshards=4 \
    stage=all \
  --dry-run

Outputs

Workflow

The rule DAG for a single sample looks like this:

Main Shotgun Pipeline DAG

Different modules of the workflow can be run indenpendently using the stage config entry.

MultiQC

Just run MultiQC on a directory, no need to use Snakemake


cp -r tmppre/reports tmpreports
cp tmpassembly/quast/quast_473/report.tsv ./tmpreports/
ver="v1.12"
docker run -V $PWD:$PWD docker://ewels/multiqc:${ver} multiqc \
    --config vdb_shotgun/multiqc_config.yaml --force \
    --title "a multiqc report for some test data" \
    -b "generated by ${ver}" --filename multiqc_report.html \
    reports/ --interactive

Preprocessing

Shotgun Preprocessing Pipeline DAG
snakemake \
  --directory tmppreprocess/ \
  --config \
    sample=473 \
    R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
    R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
    nshards=4 \
    dedup_platform=NovaSeq \
    stage=preprocess \
  --dry-run

Tools used

Biobakery

Shotgun Biobakery Profiling Pipeline DAG
snakemake \
  --directory tmpbiobakery/ \
  --config \
    sample=473 \
    R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
    R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
    stage=biobakery \
  --dry-run

Tools used

Kraken2/Bracken

Shotgun Kraken/Bracken Pipeline DAG
snakemake \
  --directory tmpkraken/ \
  --config \
    sample=473 \
    R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
    R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
    dedup_platform=NovaSeq \
    stage=kraken \
  --dry-run

Tools used

Assembly

Shotgun Assembly Pipeline DAG
snakemake \
  --directory tmpassembly/ \
  --config \
    sample=473 \
    R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
    R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
    stage=assembly \
  --dry-run

Tools used

Annotation

Shotgun Assembly Annotation DAG
snakemake \
  --directory tmpannotate/ \
  --config \
    sample=473 \
    R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
    R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
    assembly=tmpassembly/473.contigs.fasta \
    stage=annotate \
  --dry-run

Tools used

Binning

Shotgun Assembly Binning Pipeline DAG
snakemake \
  --directory tmpbinning/ \
  --config \
    sample=473 \
    R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
    R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
    assembly=tmpassembly/473.contigs.fasta \
    stage=binning \
  --dry-run

RGI

Shotgun RGI Pipeline DAG
snakemake \
  --directory tmprgi/ \
  --config \
    sample=473 \
    R1=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R1_001.fastq.gz] \
    R2=[/data/brinkvd/data/shotgun/test/473/473_IGO_12587_1_S132_L003_R2_001.fastq.gz] \
    stage=rgi \
  --dry-run

Tools used

Strainphlan Pipeline

This pipeline StrainPhlAn for each specified species. Strainphlan requires two inputs: sample-level marker pickle files, and strain-level markers extracted from the main database. These are stored in central subdirectory in the Metaphlan database directory to aid re-running. If you provide the .sam.bz2 file for a samples that has already been processed into a pkl file, it will use the pregenerated result.

This workflow accepts as input a list of sample's metaphlan sam.bz2 alignment files, and a list of species of interest. A config argument strainphlan_markers_dir serves as a central place for storing both the species- and the sample-level marker files; these are specific to a version of the MetaPHlan database, so we recommend placing that within the metaphlan database directory.

Usage

snakemake \
  --snakefile workflow/strainphlan.smk \
  --directory tmpstrain/ \
  --config \
    sams=[path/to/sample1.sam.bz2,path/to/sample2.sam.bz2] \
    strainphlan_markers_dir=/data/brinkvd/resources/dbs/metaphlan/mpa_vJan21_CHOCOPhlAnSGB_202103/marker_outputs/ \
    metaphlan_db=/data/brinkvd/resources/dbs/metaphlan/mpa_vJan21_CHOCOPhlAnSGB_202103/ \
    marker_in_n_samples=2 \
  --dry-run

Outputs

For each input species:

Workflow

The rule DAG for two example input species looks like this:

StrainPhlAn Shotgun Pipeline DAG

Testing and Development

Please see development.md.