Struo: a pipeline for building custom databases for common metagenome profilers
"Struo" --> from the Latin: “I build” or “I gather”
Cuesta-Zuluaga, Jacobo de la, Ruth E. Ley, and Nicholas D. Youngblut. 2019. "Struo: A Pipeline for Building Custom Databases for Common Metagenome Profilers." Bioinformatics , November. https://doi.org/10.1093/bioinformatics/btz899
Custom GTDB databases available at the struo data ftp server
GTDB releases available:
ar122_r89.tree
& bac120_r89.tree
grafted togetherar122_r95.tree
& bac120_r95.tree
grafted togetherncbi-gtdb_map.py
from the
gtdb_to_taxdump repo
to convert between NCBI and GTDB taxonomiesFor a step-by-step example of how to prepare and execute Struo, see the notebook in the ./tutorial/
folder
Struo's workflow encompasses the steps from genome download to database construction
To download the pipeline, clone the Git repository:
git clone git@github.com:leylabmpi/Struo.git
Versions listed are those that have been tested
You will need a UniRef diamond database for the humann2 database construction (e.g., UniRef90). See the "Download a translated search database" section of the humann2 docs.
GTDB_metadata_filter.R
to select genomesgenome_download.R
Example:
# Filtering GTDB metadata to certain genomes
./GTDB_metadata_filter.R -o gtdb-r89_bac-arc.tsv https://data.ace.uq.edu.au/public/gtdb/data/releases/release89/89.0/bac120_metadata_r89.tsv https://data.ace.uq.edu.au/public/gtdb/data/releases/release89/89.0/ar122_metadata_r89.tsv
# Downloading all genomes (& creating tab-delim table of genome info)
./genome_download.R -o genomes -p 8 gtdb-r89_bac-arc.tsv > genomes.txt
# Note: the output of ./genome_download.R can be directly used for running the `Struo` pipeline (see below)
Users can also provide genomes as compressed fasta files (.fna.gz
).
This also requires adding the corresponding information to the samples.txt
file (see below)
samples.txt
file)The table of input files/data can be created using the helper scripts described above.
config.yaml
file):
g__<genus>;s__<species>
g__unclassified
(if no genus)s__unclassified
(if no species)Other columns in the file will be ignored. The path to the samples file should be specified in the config.yaml
file (see below)
kraken2 & humann2 databases used NCBI taxIDs, and thus the NCBI taxonomy is used by default
for Struo
. You can instead create custom taxIDs from the GTDB taxonomy with
gtdb_to_taxdump.
The resulting names.dmp
and nodes.dmp
files, along with a genome metadata file that includes the gtdb_taxids,
then you can modify the Struo pipeline to fully use the GTDB taxonomy & taxIDs.
You will need to modify the config.yaml
file (see "If using GTDB taxIDs" below).
config.yaml
samples_col:
column specified should contain unique values
ncbi_organism_name
column in the GTDB metadata does not contain unique valuesGB_GCA_001784635.1_Candidatus Micrarchaeota archaeon RBG_16_49_10
temp_folder:
if needed
If you have followed "Using the GTDB taxonomy instead of NCBI taxIDs" above, then
make the following modifications to the config.yaml
file:
## column names in samples table
taxID_col: 'gtdb_taxid'
taxonomy_col: 'gtdb_taxonomy'
#-- if custom NCBI taxdump files --#
names_dmp: /YOUR/PATH/TO/names.dmp
nodes_dmp: /YOUR/PATH/TO/nodes.dmp
snakemake --use-conda
If SGE, then you can use the snakemake_sge.sh
script. You can create a similar bash script
for other cluster architectures. See the following resources for help:
snakemake
Snakemake allows for easy re-running of the pipeline on just genomes that have not yet been processed.
You can just add more genomes to the input table and re-run the pipeline (test first with --dryrun
).
Snakemake should just process the new genomes and then re-create the combined dataset files (this must be done each time).
Make sure to not mess with the files in the nuc_filtered
and prot_filtered
directories! Otherwise,
snakemake may try to run all genomes again through the computationally expensive gene annotation process.
Set the database paths in humann2, kraken2, etc. to the new, custom database files.
all_genes_annot.fna.gz
all_genes_annot.dmnd
database*mers.kraken
Run humann2 with custom databases created by Struo. Change that PATHs as necessary.
STRUO_OUT_DIR=./struo_output/
NUC_DB=`dirname $STRUO_OUT_DIR"/all_genes_annot.fna.gz"`
PROT_DB=`dirname $STRUO_OUT_DIR"/all_genes_annot.dmnd"`
MTPHLN_BT2_DB=`dirname ./metaphlan2_db/mpa_v20_m200/mpa_v20_m200.1.bt2`
MTPHLN_PKL_DB=/ebio/abt3_projects2/databases_no-backup/metaphlan2/mpa_v20_m200/mpa_v20_m200.pkl
humann2 --gap-fill on --bypass-nucleotide-index \
--nucleotide-database $NUC_DB \
--protein-database $PROT_DB \
--metaphlan-options "Skip --mpa_pkl $MTPHLN_PKL_DB --bowtie2db $MTPHLN_BT2_DB" \
--tmp-dir /dev/shm/humann2_temp/ \
--threads 12 \
--input-format fastq \
--output-basename SRS018656 \
--input SRS018656_R1.fq
If you set keep_intermediate: True
for your initial run, then the
intermediate files from the computationally intensive steps are kept,
and so those genomes don't have to be reprocessed. Only new genomes will
be processed, and then the database(s) will be re-created with old + new
genomes.
To create a database with more genomes:
db_name:
parameter in the config.yaml filedb_name:
parameter in the config.yaml file--dryrun
to see what snakemake is going to do before actually running the pipeline.use_ancient: True
in order to have snakemake skip the diamond mapping for humann2If you have gene sequences already formatted for creating a humann2 custom DB,
and you'd like to include them with the gene sequences generated from the
input genomes, then just provide the file paths to the nuc/prot fasta files
(humann2_nuc_seqs
and humann2_prot_seqs
in the config.yaml
file).
All genes (from genomes & user-provided) will be clustered altogether with vsearch
.
See the vsearch_all:
setting in the config.yaml
for the default clustering parameters used.
You can use vsearch_all: Skip
to skip the clustering and instead all of the sequences
will just be combined without removing redundancies.
GTDB_metadata_filter.R
This tool is useful for selecting which GTDB genomes to include in a custom database.
Filter >=1 genome assembly metadata file (e.g., bac120_metadata_r89.tsv) by assembly quality or other parameters.
genome_download.R
This tool is useful for downloading genomes from NCBI.
Download a set of genomes based on NCBI assembly accessions provided in a table. The file paths of the downloaded genome fasta files will be appended to the input table.
tree_prune.py
This tool is useful for creating a GTDB archaea/bacteria phylogeny of all genomes in your custom database. The phylogeny can be used for phylogenetic analyses of metagenomes (e.g., Faith's PD or Unifrac).
Prune >=1 phylogeny to just certain taxa. If >1 phylogeny provided, then the phylogenies are merged.
gtdb_to_taxdump
This is a separate repo.
This is useful for creating an NCBI taxdump (names.dmp and nodes.dmp) from the GTDB taxonomy. Note that the taxIDs are arbitrary and don't match anything in the NCBI!
diamond >=0.9
so that users can run humann2 with
the most up-to-date version of diamond