jordangumm / metago

Experimental metagenomic workflow
2 stars 0 forks source link

METAGenOmic Analysis Workflow CLI

A CLI that aims to automate common metagenome pipeline steps, with a subsequent focus on viral downstream analysis. The base steps include read quality control, assembly, binning, and gene calling, and should work for most any metagenomic analysis with a little tweaking of the parameters.

Workflow Overview

General Steps and Opinionated Software List

  1. Quality Control: BBDuk
  2. Assembly: MegaHit
  3. Binning: Maxbin 2.0
  4. Gene Calling: Prodigal | EMIRGE (ribosomal reconstruction)
  5. Viral Analysis: VirSorter
  6. Annotation: Prokka

Resources

  1. Micro-Phage Interaction Database: MVP

Setup

The following steps assume you have a python and pip install. If you don't and potentially are using a server with unprivileged access, you may want to consider a user install of Anaconda. Some rationale and steps to do this can be found here in the Personal Environment section.

Github Install

$ git clone https://github.com/jordangumm/metago.git
$ cd metago && ./build.sh
$ pip install -e .

Singularity Install

In Development!

Usage

Base Command

The metago command is your interface to a myriad of workflow commands. It requires fastq files to be organized in Illumina fashion, that is in the form of Run_[RUNID]/Sample_[SAMPLEID]/[SAMPLEID].fastq. Run-based commands target a run directory and will process every sample automatically. Sample-based commands target a single sample or fastq file. The quality control step interleaves fastqs, so ensure you run your fastq sample files through that step first if you want to be able to leverage downstream commands.

$ metago --help

Usage: metago [OPTIONS] COMMAND [ARGS]...

  Metago Command Line Interface

  Note: Use absolute paths to all files and directories

Options:
  -o, --output TEXT
  --flux / --no-flux
  -a, --account TEXT
  -p, --ppn INTEGER
  -m, --mem TEXT
  -w, --walltime TEXT
  --help               Show this message and exit.

Commands:
  run_assembly        Assemble run sample reads
  run_mapping         Read mapping of run to reference
  run_minhash         Minhash compare fastqs in run
  run_pseudoalign     Read assignment of run to reference
  run_qc              Quality control of Illumina run
  sample_assembly     Assembly sample reads
  sample_mapping      Read mapping of reference to sample
  sample_pseudoalign  Read assignment of sample to reference
  sample_qc           Quality control of Illumina sample

Secondary Commands

The workflow commands require their own arguments and options.

$ metago sample_qc --help

Usage: metago sample_qc [OPTIONS] SAMPLE_DP

  Quality control of Illumina sample

Options:
  --help  Show this message and exit.

Make sure to provide arguments and options at the appropriate command level. metago expects resource and output information, while secondary commands require more data-specific information, like what fastq to process or analyze. Below is an example command that specifies an output path and 4 hour walltime limit for quality controlling a sample.

$ metago -o /scratch/analysis/Sample_1234 -w 4:00:00 sample_qc /nfs/longterm_storage/Sample_1234