leylabmpi / Struo2

Scalable creating/updating of metagenome profiling databases
MIT License
58 stars 8 forks source link
database metagenomics ngs pipeline

Struo2 Snakemake

Struo2

logo

Struo2: a pipeline for building custom databases for common metagenome profilers

"Struo" --> from the Latin: “I build” or “I gather”

WARNING: the pipeline is still in development and could change substantially without warning!

TOC

Use the GitHub auto-generated TOC (top-left above the README).

Citation

Struo2

Youngblut, Nicholas D., and Ruth E. Ley. 2021. “Struo2: Efficient Metagenome Profiling Database Construction for Ever-Expanding Microbial Genome Datasets.” PeerJ 9 (September): e12198. https://doi.org/10.7717/peerj.12198

Struo (original)

Cuesta-Zuluaga, Jacobo de la, Ruth E. Ley, and Nicholas D. Youngblut. 2019. "Struo: A Pipeline for Building Custom Databases for Common Metagenome Profilers." Bioinformatics , November. https://doi.org/10.1093/bioinformatics/btz899

Overview

Efficiently create/update custom databases for the following metagenome profilers:

You can also just use Struo2 for efficiently clustering genes via mmseqs and generating gene & gene-cluster databases that can be efficiently updated via mmseqs2.

Changes from Version 1

Pre-built custom databases

Custom GTDB databases available at the Struo2 data ftp server

GTDB releases available:

Setup for installing Struo2

Only Unix/Linux OS supported

Download

To download the pipeline, clone the Git repository:

git clone --recurse-submodules https://github.com/leylabmpi/Struo2.git 

Note the use of submodules. If needed, you can update the submodule(s) via:

cd Struo2
git submodule update --remote --init --recursive

conda env setup

See the miniconda docs for installing conda.

To install the main conda environment for running the snakemake pipeline and utility scripts:

conda env create --name struo2 -f conda_env.yaml

The only real dependencies for running the Struo2 snakemake pipeline are:

If you want email notifications upon pipeline success/failure, then you need mutt installed on your OS (e.g., sudo apt-get install mutt).

All other dependencies are installed automatically in conda environments via snakemake when running the pipeline.

Setting a location for necessary files

The pipeline requires a few publicly available database files (see below). Some files are large and require many hours to download.

# Create a directory to hold all of the necessary Struo2 data files
# The directory location can be anywhere on your file system
OUTDIR=./data/
mkdir -p $OUTDIR

taxdump files

The taxdump files are used for creating/updating the Kraken2 database.

GTDB taxdump

By default, the pipeline uses custom GTDB taxIDs generated with gtdb_to_taxdump from the GTDB taxonomy. To download the custom taxdump files:

wget --directory-prefix $OUTDIR http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release95/taxdump/names.dmp
wget --directory-prefix $OUTDIR http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release95/taxdump/nodes.dmp

NCBI taxdump

If you would rather use NCBI taxonomy instead of the GTDB taxonomy:

wget --directory-prefix $OUTDIR https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
tar -pzxvf $OUTDIR/taxdump.tar.gz --directory $OUTDIR

UniRef

UniRef databases are used for annotating genes. UniRef IDs are required for HUMANnN3. You do not need any UniRef databases if just creating Kraken2/Bracken databases.

IF using mmseqs search for gene annotation

mmseqs UniRef database(s)

See the mmseqs2 wiki on database downloading

# you must have mmseqs2 installed
# Example of downloading UniRef50 (WARNING: slow!)
mmseqs databases --remove-tmp-files 1 --threads 4 UniRef50 $OUTDIR/mmseqs2/UniRef50 data/mmseqs2_TMP

IF using diamond blastp for gene annotation

HUMAnN3 UniRef diamond database(s)

See the "Download a translated search database" section of the humann3 docs.

# Example download of UniRef50 DIAMOND database
wget --directory-prefix $OUTDIR http://huttenhower.sph.harvard.edu/humann_data/uniprot/uniref_annotated/uniref50_annotated_v201901.tar.gz
tar -pzxvf $OUTDIR/uniref50_annotated_v201901.tar.gz --directory $OUTDIR

UniRef50-90 index

Optional, but recommended

This is needed to map annotations from UniRef90 clusters to UniRef50 clusters. This allows for just annotating against UniRef90 and then mapping those annotations to UniRef50 cluster IDs. You then do not have to annotate against UniRef90 clusters and UniRef50 clusters, which requires a lot more querying of genes against UniRef.

wget --directory-prefix $OUTDIR http://ftp.tue.mpg.de/ebio/projects/struo2/install/uniref_2019.01/uniref50-90.pkl

Getting reference genomes for the custom databases

Downloading genomes

A toy dataset

A collection of 10 genomes from the GTDB:

wget --directory-prefix $OUTDIR http://ftp.tue.mpg.de/ebio/projects/struo2/dev_data/genomes/GTDBr95_n10.tar.gz
tar -pzxvf $OUTDIR/GTDBr95_n10.tar.gz --directory $OUTDIR

To test database updating with new genomes, get this collection of 5 genomes:

wget --directory-prefix $OUTDIR http://ftp.tue.mpg.de/ebio/projects/struo2/dev_data/genomes/GTDBr95_n5.tar.gz
tar -pzxvf $OUTDIR/GTDBr95_n5.tar.gz --directory $OUTDIR

To test database updating with indivdual genes, get this collection:

wget --directory-prefix $OUTDIR http://ftp.tue.mpg.de/ebio/projects/struo2/dev_data/genes/UniRef50_n5.tar.gz
tar -pzxvf $OUTDIR/UniRef50_n5.tar.gz --directory $OUTDIR

Example of download from the GTDB

# Filtering GTDB metadata to certain genomes
./GTDB_metadata_filter.R -o gtdb-r95_bac-arc.tsv https://data.gtdb.ecogenomic.org/releases/release95/95.0/ar122_metadata_r95.tar.gz https://data.gtdb.ecogenomic.org/releases/release95/95.0/bac120_metadata_r95.tar.gz

# Downloading all genomes (& creating tab-delim table of genome info)
./genome_download.R -o genomes -p 8 gtdb-r95_bac-arc.tsv > genomes.txt

# Note: the output of ./genome_download.R can be directly used for running the `Struo2` pipeline (see below)
# Note: genome fasta files can be compressed (gzip or bzip2) or uncompress for input to Struo2

Input genome/gene data

The genomes and/or individual genes used for creating/updating databases

Genome data

The table of input files/data can be created using the helper scripts described above if you downloaded the genomes from the GTDB or NCBI.

We recommend to use the highest quality genome assemblies possible. See the Struo2 manuscript for data on how genome assembly quality affects database accuracy.

We also recommend to de-replicate genomes to 95% ANI, since within-species variation can result in multi-mapping of reads and lower sensitivity. Also, de-replicating to 95% ANI reduces the dataset to more manageable numbers.

The samples.txt file lists all reference genomes to be included in the custom databases:

Other columns in the file will be ignored. See this table for an example. The path to the samples table file should be specified in the config.yaml file (see below).

Gene data

This can only be used for updating existing databases (e.g., existing custom GTDB Kraken2 or HUMAnN databases).

The gene data must include:

Running the pipeline

This applies to creating and updating the databases

Edit the config file

Running locally

This is only recommended for test runs

It is always good to run snakemake in dryrun mode first:

snakemake --use-conda -j 1 -Fqn

For an actual run with 2 cores:

snakemake --use-conda -j 2

Running on a cluster

If SGE, then you can use the snakemake_sge.sh script. You can create a similar bash script for other cluster architectures. See the following resources for help:

General info on using snakemake

Snakemake allows for easy re-running of the pipeline on just genomes that have not yet been processed. You can just add more genomes to the input table and re-run the pipeline (test first with --dryrun). Snakemake should just process the new genomes and then re-create the combined dataset files (this must be done each time). Make sure to not mess with the files in the nuc_filtered and prot_filtered directories! Otherwise, snakemake may try to run all genomes again through the computationally expensive gene annotation process.

Using the resulting databases

Set the database paths in humann3, kraken2, etc. to the new, custom database files.

Example of a HUMANnN3 run

Run HUMAnN3 with custom databases created by Struo2. Change that PATHs as necessary.

STRUO2_OUT_DIR=./struo2_output/
NUC_DB=$STRUO2_OUT_DIR/humann3/uniref50/genome_reps_filt_annot.fna.gz
PROT_DB=$STRUO2_OUT_DIR/humann3/uniref50/protein_database/uniref50_201901.dmnd
MP_DB=$STRUO2_OUT_DIR/mpa_v30_CHOCOPhlAn_201901_marker_info.txt

READS=/path/to/example/read/files/CHANGE/THIS/reads.fq

# metaphlan database not actually used
touch $MP_DB

# humann3 run   
humann3 --bypass-nucleotide-index \
  --nucleotide-database $NUC_DB \
  --protein-database $PROT_DB \
  --metaphlan-options "--bowtie2db $MP_DB" \
  --output-basename humann \
  --input $READS \
  --output humann_output

Utilities

database_download.py

Helper script for downloading existing custom Struo2-generated GTDB databases.

Example for downloading Kraken2/Bracken database (and associated files):

# requires `requests` and `bs4` python packages
# using 4 threads in this example
./util_scripts/database_download.py -t 4 -r 202 -d kraken2 metadata taxdump phylogeny -- custom_dbs

GTDB_metadata_filter.R

This tool is useful for selecting which GTDB genomes to include in a custom database.

Filter >=1 genome assembly metadata file (e.g., bac120_metadata_r89.tsv) by assembly quality or other parameters.

genome_download.R

This tool is useful for downloading genomes from NCBI.

Download a set of genomes based on NCBI assembly accessions provided in a table. The file paths of the downloaded genome fasta files will be appended to the input table.

tree_prune.py

This tool is useful for creating a GTDB archaea/bacteria phylogeny of all genomes in your custom database. The phylogeny can be used for phylogenetic analyses of metagenomes (e.g., Faith's PD or Unifrac).

Prune >=1 phylogeny to just certain taxa. If >1 phylogeny provided, then the phylogenies are merged.

genome_mis-asmbl_sim.py

A simple script for simulating mis-assemblies among a set of microbial genomes. See the script help docs for more information.

gtdb_to_taxdump

This is located in a separate repo.

This is useful for creating an NCBI taxdump (names.dmp and nodes.dmp) from the GTDB taxonomy. Note that the taxIDs are arbitrary and don't match anything in the NCBI!

nbci-gtdb_map.py

This is located in a separate repo.

This is useful for mapping between GTDB and NCBI taxonomies. The mapping is based on the GTDB archaeal and bacterial metadata tables, which contain both GTDB and NCBI taxonomic lineages.

Tutorials

See the Struo2 wiki for tutorials on database generation and updating.

FAQ

Why is MetaPhlAn not included in Struo2?

MetaPhlAn requires intra-clade information in regards to which genes are core and variable within the clade. Inter-clade information is also needed in order to specify which "core" genes for a clade are also present in other clades.

Struo2 is designed to process one genome representative of each species, which is all that is required for generating the Kraken2/Bracken and HUMAnN databases. Inclusion of MetaPhlAn would require that users provide all genomes per clade (species), would greatly add to the total computation of the pipeline, and would greatly increase the total complexity of the pipeline.

We note that Kraken2 + Bracken generally has the highest scores of any species-level taxonomic profilier tested in the LEMMI evaluations.