Vitor C. Piro (vitorpiro@gmail.com)
Piro, V. C., Matschkowski, M., & Renard, B. Y. (2017). MetaMeta: integrating metagenome analysis tools to improve taxonomic profiling. Microbiome, 5(1), 101. http://doi.org/10.1186/s40168-017-0318-y
Miniconda:
# Download conda installer
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
# Set permissions to execute
chmod +x Miniconda3-latest-Linux-x86_64.sh
# Execute. Make sure to "yes" to add the conda to your PATH
./Miniconda3-latest-Linux-x86_64.sh
# Add channels
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
MetaMeta:
conda install metameta=1.2.0
--use-conda
parameter active). Alternatively, install MetaMeta in a separated environment (named "metametaenv") with the command:
conda create -n metametaenv metameta=1.2.0
source activate metametaenv # Command to activate the environment. To deactivate use "source deactivate"
Create a configuration file (yourconfig.yaml) with the required fields (workdir, dbdir and samples):
workdir: "/home/user/folder/results/"
dbdir: "/home/user/folder/databases/"
samples:
sample_name_1:
fq1: "/home/user/folder/reads/file.1.fq"
fq2: "/home/user/folder/reads/file.2.fq"
Check rules and output files:
metameta --configfile yourconfig.yaml -np
Run MetaMeta:
metameta --configfile yourconfig.yaml --use-conda --keep-going --cores 24
cp ~/miniconda3/opt/metameta/config/example_complete.yaml yourconfig.yaml
--cores
is the total amount avaiable for the pipeline. Number of specific threads for the tools should be set on the configuration file (yourconfig.yaml) with the parameter threads
archaea_bacteria_201503
by default - see below) necessary for each tool.Available databases:
Info | Date | metameta database name |
---|---|---|
Archaea + Bacteria - RefSeq Complete Genomes | 2015-03 | archaea_bacteria_201503 |
Fungal + Viral - RefSeq Complete Genomes | 2017-09 | fungi_viral_201709 |
Database availability per tool:
database | clark | dudes | gottcha | kaiju | kraken | motus |
---|---|---|---|---|---|---|
archaea_bacteria_201503 |
Yes | Yes | Yes | Yes | Yes | Yes |
fungi_viral_201709 |
Yes | Yes | No | Yes | Yes | No |
cd ~/miniconda3/opt/metameta/
Pre-configured Archaea and Bacteria database:
./metameta --configfile sampledata/sample_data_archaea_bacteria.yaml --use-conda --keep-going --cores 6
Custom database (some viral reference genomes):
./metameta --configfile sampledata/sample_data_custom_viral.yaml --use-conda --keep-going --cores 6
Results:
cd sampledata/results/
Running MetaMeta on a cluster environment:
Make a copy of cluster configuration file:
cp ~/miniconda3/opt/metameta/config/cluster.json yourcluster.json
Edit the file with your cluster specifications (threads, partitions, cpu/memory, etc) for each rule.
Run MetaMeta (slurm example):
metameta --configfile yourconfig.yaml --keep-going --use-conda -j 999 --cluster-config yourcluster.json --cluster "sbatch --job-name {cluster.job-name} --output {cluster.output} --partition {cluster.partition} --nodes {cluster.nodes} --cpus-per-task {cluster.cpus-per-task} --mem {cluster.mem} --time {cluster.time}"
sbatch
) and adapt them to your cluster system.MetaMeta uses by default Archaea and Bacteria sequences as reference database (archaea_bacteria_201503
- see below). Additionaly MetaMeta allows the creation of custom database.
First select which databses should be used on the configuration file:
databases:
- archaea_bacteria_201503
- custom_db
Second, create an entry with the path to the sequences that should be added to the custom database:
custom_db:
clark: "sampledata/database/"
dudes: "sampledata/database/"
kaiju: "sampledata/database/"
kraken: "sampledata/database/"
MetaMeta will compile the "custom_db" on the first run and use it as a database. After finished it is possible to delete de database definition from the configuration file for the following runs.
It is possible to create a custom database based on the set of genomes from NCBI
Download the genome_updater script:
git clone https://github.com/pirovc/genome_updater
Download the desired database: Example -> All fungi genomes available on refseq, fasta and GenBank formats with 6 threads:
./genome_updater.sh -d "refseq" -g "fungi" -f "genomic.fna.gz,genomic.gbff.gz" -t 6 -o fungi_genomes/
mkdir -p custom_fungi_db/clark_dudes/ custom_fungi_db/kaiju/ custom_fungi_db/kraken/
Extract files: clark and dudes:
zcat fungi_genomes/files/*.fna.gz > custom_fungi_db/clark_dudes/fungi_genomes.fna
kaiju:
zcat fungi_genomes/files/*.gbff.gz > custom_fungi_db/kaiju/fungi_genomes.gbff
kraken (with header conversion to GI, old NCBI style):
zcat fungi_genomes/files/*.fna.gz | awk '{if(substr($0, 0, 1)==">"){sep=index($0," ");acc=substr($0,2,sep-2);header=substr($0,sep+1); cmd="wget -qO - \"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id="acc"&rettype=gi\""; cmd | getline gi; close(cmd); print ">gi|" gi "|ref|" acc "| " header }else{ print $0 }}' > custom_fungi_db/kraken/fungi_genomes.fna
Add entry on the configuration file:
databases:
- new_custom_fungi_db
Finally, add the path for each set of reference sequences on the configuration file:
new_custom_fungi_db:
clark: "custom_fungi_db/clark_dudes/"
dudes: "custom_fungi_db/clark_dudes/"
kaiju: "custom_fungi_db/kaiju/"
kraken: "custom_fungi_db/kraken/"
On the first run MetaMeta will compile the "new_custom_fungi_db" database for each configured tool. After finished it is possible to delete de database definition from the configuration file for the following runs.
wget https://raw.githubusercontent.com/pirovc/metameta/master/envs/metameta_complete.yaml
conda env create -f metameta_complete.yaml
source activate metametaenv_complete
To merge final results from many samples into one final tabular file:
~/miniconda3/opt/metameta/scripts/merge_final_profiles.sh workdir/samples_*/metametamerge/database/final.metametamerge.profile.out
MetaMeta can run several tools with several samples against several databases. The files on the working directory and database directory are organized in the structure below:
WORKDIR:
SAMPLE_1/
TOOL_1/ (*)
DB_1/
DB_2/
...
TOOL_2/ (*)
...
PROFILES/
DB_1/
TOOL_1.profile.out
TOOL_2.profile.out
...
DB_2/
...
METAMETAMERGE/
DB_1/
FINAL_PROFILE.out
FINAL_PROFILE_KRONA.html
DB_2/
...
LOG/
DB_1/
DB_2/
...
READS/ (*)
TOOL_1.1.fq
TOOL_1.2.fq
TOOL_2.1.fq
TOOL_2.2.fq
...
SAMPLE_2/
...
CLUSTERLOG/ (**)
DBDIR:
DB_1/
TOOL_1_DB/
TOOL_2_DB/
...
TOOL_1.dbprofile.out
TOOL_2.dbprofile.out
...
LOG/
DB_2/
...
TAXONOMY/
LOG/
(*) removed when keepfiles=0 (**) only when running on cluster mode
MetaMeta integrates profiling and binning tools and it has 6 pre-configured tools (clark, dudes, gottcha, kaiju, kraken and motus). New tools are required to use the NCBI Taxonomy structure and nomenclature/identifiers to be added to the pipeline. MetaMeta accepts BioBoxes format directly (https://github.com/bioboxes/rfc/tree/master/data-format) or a .tsv file in the following format:
Example:
genus Methanospirillum 0.0029
genus Thermus 0.0029
genus 568394 0.0029
species Arthrobacter sp. FB24 0.0835
species 195 0.0582
species Mycoplasma gallisepticum 0.0536
Example:
M2|S1|R140 354 201
M2|S1|R142 195 201
M2|S1|R145 457425 201
M2|S1|R146 562 201
M2|S1|R147 1245471 201
M2|S1|R150 354 201
MetaMeta pipeline uses Snakemake. To add a new tool to the pipeline it is necessary to create two main files described below. Replace 'newtool' with the tool identifier (lower case, no spaces, no special chars):
tools/newtool.sm -> specifies how to execute the tool
Rules:
- newtool_run_1[..n] -> one or more rules necessary to run the tool
- newtool_rpt -> final rule that should output a file newtool.profile.out in an accepted output format (described above)
tools/newtool_db_custom.sm -> specifies how to download/compile the database/references
Rules:
- newtool_db_custom_1[..n] -> one or more rules necessary to compile the database.
- newtool_db_custom_profile -> this rule generates automatically the database profile. It should have as an output a file (newtool.dbaccession.out) with the accession version identifier for all sequences used in the database.
- newtool_db_custom_check -> rule to check the required database files. It should have as an input all mandatory files that should be present to the database work properly.
v1.2.0)
v1.1.1) Bug fixes parsing output files for kraken and kaiju
v1.1) Support single and paired-end reads, multiple and custom databases, krona integration