This is a Metabarcoding Pipeline developed for Illumina Sequencing Data at the Thuenen Institute of Biodiversity, Braunschweig by Wiebke Sickel & Lasse Krueger, with contributions from Brandon Seah.
The Pipeline is managed by Snakemake and
uses various tools, which are specified in Conda environment files in the
workflow/envs
directory. All that is needed to start the pipeline is a
current version of Conda. When the pipeline is run
for the first time, the required software is automatically installed in Conda
environments. Mamba can be used as an
alternative to Conda.
Input data should be short-read metabarcoding sequence data (paired- or single-end) in Fastq or gzipped Fastq format. File paths and pipeline options/parameters are specified in config files (see below).
A reference database in Fasta format is required for taxonomic classification. The Fasta headers should contain the taxonomy string in SINTAX format, as described in the Vsearch manual.
Depending on your familiarity with Snakemake, there are several ways to use set up this pipeline to use with your own data.
Clone the pipeline directly from GitHub:
git clone git@github.com:monagrland/MB_Pipeline.git
To clone a specific version, specify its tag with the --branch
option.
Alternatively, download a zipped archive of the source code from the GitHub releases page.
(For pipeline versions v3.0.2 and above only.)
Snakedeploy is a tool to deploy Snakemake pipelines that have been published on platforms like GitHub or GitLab.
You can deploy either a specific version with the --tag
option, or a specific
branch with --branch
. The following deploys version v3.0.2
of the pipeline
to a new folder my_deployment/
:
snakedeploy deploy-workflow https://github.com/monagrland/MB_Pipeline \
./my_deployment --tag v3.0.2
cd my_deployment
The template configuration will be copied to the config/
subfolder, and
should be modified with your desired configuration.
This usage pattern is relevant when you wish to apply the same pipeline to different sets of input data, each requiring their own configuration. In this way, each separate deployment can be run and put under version control, indepedently of each other and the the pipeline itself.
To rerun under a new pipeline version, simply change the version specified in the module directive in the Snakefile. The config may also have to be updated, if the config schema has changed.
This is similar to option 2 above, except that you wish to incorporate
MB_Workflow in an existing Snakemake pipeline. Instead of using the Snakedeploy
command line tool to set up a new template workflow, you should directly add a
module
declaration to the existing workflow. See the Snakemake
documentation
The pipeline options and input/output paths are declared in a configuration
file config/config.yaml
and table config/reads.tsv
respectively. These
should be modified to match the actual data and analysis you want to perform.
The sample files provided are configured to run the test dataset in the
testdata/
subfolder without further modification, and can be used to test the
pipeline.
Configuration options are described in the config/README.md
file.
The pipeline is executed with snakemake
, specifying the path to the config
file with the --configfile
parameter, and other options. mamba
or conda
should also be in your path. The python package pandas
should be available in
the environment.
Output will be written to the subfolder results/
, and log files to logs/
,
which will be created if they do not already exist.
Examine and/or modfiy the bash script run_pipeline.sh
for typical defaults.
The script first installs Snakemake and other dependencies required to run the
pipeline in a Conda environment at ./snakemake_8
, which will be activated
before running the pipeline, if this environment does not already exist at this
path.
bash run_pipeline.sh # uses config file at config/config.yaml by default
Performance-related options include:
--cores
- Change total number of CPU cores available for the pipeline--dryrun
- Show rules that will be run only, do not actually execute--notemp
- Keep temporary files (trimmed, merged, dereplicated reads);
remove this flag to discard them after run.--conda-frontend conda
- Specify conda
or mamba
(recommended) to manage
environmentsSee the Snakemake documentation for the complete list:
Snakemake by default locks the directories in which the results are saved. If a run fails, the directories remain locked and if you attempt to rerun the pipeline with the same output directory, you get an error message. To unlock the directory, simply run the following script:
./unlock.sh
Raw sequencing reads are first trimmed of adapter sequences with Cutadapt. Paired-end reads are merged into the complete amplicon sequence with VSEARCH. For each sample, reads are then filtered for quality, and exactly identical sequences are deduplicated, both steps with VSEARCH. Reads are then pooled across all samples and deduplicated again to obtain the pooled, dereplicated amplicon sequences.
If the barcoding marker sequence is not from a protein-coding gene, they are first clustered with the UNOISE algorithm as implemented in VSEARCH. The resulting amplicon sequence variants (ASVs) are then screened for PCR chimeras with the UCHIME3 denovo algorithm as implemented in VSEARCH, to produce the set fo denoised, artefact-filtered ASVs.
If the sequence is protein-coding, the sequences are instead denoised with the entropy ratio-based denoising method DnoisE. They are then translated in all forward reading frames with pytransaln and screened against a HMM of the target protein sequence (using pyhmmer and HMMer internally), to screen out pseudogenes and to find the correct reading frame. PCR chimeras should also be indirectly removed at this step.
The per-sample abundance of sequences represented by each final ASV is calculated by re-aligning them with VSEARCH with the global alignment option.
Denoised, screened ASVs are aligned with MAFFT, and used to build a phylogenetic tree, from which the phylogenetic diversity per sample is calculated.
A consensus taxonomy per representative ASV is also reported based on the user-supplied reference database, using either the SINTAX algorithm as implemented in VSEARCH, or an experimental two-step heuristic method.
For full parameter details please see the config/README.md
file.
flowchart TB
seq_data(Sequencing Data)--> |cutadapt| trimmed(Trimmed reads)
trimmed --> |vsearch --fastq_mergepairs| merged(Merged reads)
merged --> |vsearch --fastq_filter| filtered(Quality-filtered reads)
filtered --> |vsearch --derep_fullength| derep1(Dereplicated reads)
derep1 --> |cat| concat(Pooled reads)
concat --> |vsearch --derep_fulllength| derep2(Pooled, dereplicated reads)
derep2 --> coding_or_noncoding{Protein-coding?}
coding_or_noncoding --> coding(Coding)
coding --> |dnoise| dnoise_denoised(Entropy-denoised ASVs)
dnoise_denoised --> |pytransaln| pseudogene_screened(Screened for pseudogenes, min cluster size)
coding_or_noncoding --> noncoding(Non-coding)
noncoding --> |vsearch --cluster_unoise| unoise_denoised(Unoise-denoised ASVs)
unoise_denoised --> |vsearch --uchime3_denovo| uchime_screened(Screened for chimeras, min cluster size)
uchime_screened --> asvs(Denoised, screened ASVs)
pseudogene_screened --> asvs
asvs --> comtab(Community table)
concat --> |vsearch --usearch_global| comtab
asvs --> |mafft| align(Aligned ASVs)
align --> |fasttree| phylogeny(Phylogenetic tree of ASVs)
phylogeny --> faithpd(Phylogenetic diversity per sample)
comtab --> faithpd