MetaMeta: Integrating metagenome analysis tools to improve taxonomic profiling

Vitor C. Piro (vitorpiro@gmail.com)

Piro, V. C., Matschkowski, M., & Renard, B. Y. (2017). MetaMeta: integrating metagenome analysis tools to improve taxonomic profiling. Microbiome, 5(1), 101. http://doi.org/10.1186/s40168-017-0318-y

Install:

Miniconda:

# Download conda installer
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh

# Set permissions to execute
chmod +x Miniconda3-latest-Linux-x86_64.sh  

# Execute. Make sure to "yes" to add the conda to your PATH
./Miniconda3-latest-Linux-x86_64.sh         

# Add channels
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda

MetaMeta:

conda install metameta=1.2.0

All other tools and dependencies are installed in their own environment automatically on the first run (with --use-conda parameter active).

Alternatively, install MetaMeta in a separated environment (named "metametaenv") with the command:

conda create -n metametaenv metameta=1.2.0  
source activate metametaenv # Command to activate the environment. To deactivate use "source deactivate"

Run:

Create a configuration file (yourconfig.yaml) with the required fields (workdir, dbdir and samples):

workdir: "/home/user/folder/results/"
dbdir: "/home/user/folder/databases/"
samples:
  sample_name_1:
     fq1: "/home/user/folder/reads/file.1.fq"
     fq2: "/home/user/folder/reads/file.2.fq"

All paths set on this file are relative to the workdir (if not absolute)

Check rules and output files:

metameta --configfile yourconfig.yaml -np

Run MetaMeta:

metameta --configfile yourconfig.yaml --use-conda --keep-going --cores 24

Alternatively, make a copy of the configuration file for the complete set of parameters cp ~/miniconda3/opt/metameta/config/example_complete.yaml yourconfig.yaml
The number of --cores is the total amount avaiable for the pipeline. Number of specific threads for the tools should be set on the configuration file (yourconfig.yaml) with the parameter threads
On the first run MetaMeta will download and install the configured tools as well as the database files (archaea_bacteria_201503 by default - see below) necessary for each tool.

Pre-configured databases:

Available databases:

Info	Date	metameta database name
Archaea + Bacteria - RefSeq Complete Genomes	2015-03	`archaea_bacteria_201503`
Fungal + Viral - RefSeq Complete Genomes	2017-09	`fungi_viral_201709`

Database availability per tool:

database	clark	dudes	gottcha	kaiju	kraken	motus
`archaea_bacteria_201503`	Yes	Yes	Yes	Yes	Yes	Yes
`fungi_viral_201709`	Yes	Yes	No	Yes	Yes	No

Running sample data:

cd ~/miniconda3/opt/metameta/

Pre-configured Archaea and Bacteria database:

./metameta --configfile sampledata/sample_data_archaea_bacteria.yaml --use-conda --keep-going --cores 6

Custom database (some viral reference genomes):

./metameta --configfile sampledata/sample_data_custom_viral.yaml --use-conda --keep-going --cores 6

Results:

cd sampledata/results/

Running MetaMeta on a cluster environment:

Make a copy of cluster configuration file:

cp ~/miniconda3/opt/metameta/config/cluster.json yourcluster.json

Edit the file with your cluster specifications (threads, partitions, cpu/memory, etc) for each rule.

Run MetaMeta (slurm example):

metameta --configfile yourconfig.yaml --keep-going --use-conda -j 999 --cluster-config yourcluster.json --cluster "sbatch --job-name {cluster.job-name} --output {cluster.output} --partition {cluster.partition} --nodes {cluster.nodes} --cpus-per-task {cluster.cpus-per-task} --mem {cluster.mem} --time {cluster.time}"

you can change the cluster command (sbatch) and adapt them to your cluster system.

Custom databases:

MetaMeta uses by default Archaea and Bacteria sequences as reference database (archaea_bacteria_201503 - see below). Additionaly MetaMeta allows the creation of custom database.

First select which databses should be used on the configuration file:

databases:
  - archaea_bacteria_201503
  - custom_db

all samples will run agains the "archaea_bacteria_201503" and the new "custom_db" databases

Second, create an entry with the path to the sequences that should be added to the custom database:

custom_db:
    clark: "sampledata/database/"
    dudes: "sampledata/database/"
    kaiju: "sampledata/database/"
    kraken: "sampledata/database/"

clark and dudes require one or more fasta files (extension .fna) with the accession.version identifier after the header ">" (e.g. ">NC_001998.1 Guinea pig Chlamydia phage, complete genome")
kaiju requires one or more GenBank flat file (extension .gbff)
kraken requires one or more fasta files (extension .fna) with the gi identifier on the header (e.g. ">gi|9632287|ref|NC_001998.1| Guinea pig Chlamydia phage, complete genome")

MetaMeta will compile the "custom_db" on the first run and use it as a database. After finished it is possible to delete de database definition from the configuration file for the following runs.

Creating a custom database based on NCBI genomes:

It is possible to create a custom database based on the set of genomes from NCBI

Download the genome_updater script:

git clone https://github.com/pirovc/genome_updater

Download the desired database: Example -> All fungi genomes available on refseq, fasta and GenBank formats with 6 threads:

./genome_updater.sh -d "refseq" -g "fungi" -f "genomic.fna.gz,genomic.gbff.gz" -t 6 -o fungi_genomes/
mkdir -p custom_fungi_db/clark_dudes/ custom_fungi_db/kaiju/ custom_fungi_db/kraken/

Extract files: clark and dudes:

zcat fungi_genomes/files/*.fna.gz > custom_fungi_db/clark_dudes/fungi_genomes.fna

kaiju:

zcat fungi_genomes/files/*.gbff.gz > custom_fungi_db/kaiju/fungi_genomes.gbff

kraken (with header conversion to GI, old NCBI style):

zcat fungi_genomes/files/*.fna.gz | awk '{if(substr($0, 0, 1)==">"){sep=index($0," ");acc=substr($0,2,sep-2);header=substr($0,sep+1); cmd="wget -qO - \"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id="acc"&rettype=gi\""; cmd | getline gi; close(cmd); print ">gi|" gi "|ref|" acc "| " header }else{ print $0 }}' > custom_fungi_db/kraken/fungi_genomes.fna

Add entry on the configuration file:

databases:
  - new_custom_fungi_db

Finally, add the path for each set of reference sequences on the configuration file:

new_custom_fungi_db:
    clark: "custom_fungi_db/clark_dudes/"
    dudes: "custom_fungi_db/clark_dudes/"
    kaiju: "custom_fungi_db/kaiju/"
    kraken: "custom_fungi_db/kraken/"

On the first run MetaMeta will compile the "new_custom_fungi_db" database for each configured tool. After finished it is possible to delete de database definition from the configuration file for the following runs.

Pre-install a complete environment:

wget https://raw.githubusercontent.com/pirovc/metameta/master/envs/metameta_complete.yaml
conda env create -f metameta_complete.yaml
source activate metametaenv_complete

Merging final results:

To merge final results from many samples into one final tabular file:

~/miniconda3/opt/metameta/scripts/merge_final_profiles.sh workdir/samples_*/metametamerge/database/final.metametamerge.profile.out

Folder structure:

MetaMeta can run several tools with several samples against several databases. The files on the working directory and database directory are organized in the structure below:

WORKDIR:
    SAMPLE_1/
        TOOL_1/ (*)
            DB_1/
            DB_2/
            ...
        TOOL_2/ (*)
            ...
        PROFILES/
            DB_1/
                TOOL_1.profile.out
                TOOL_2.profile.out
                ...
            DB_2/
                ...
        METAMETAMERGE/
            DB_1/
                FINAL_PROFILE.out
                FINAL_PROFILE_KRONA.html
            DB_2/
                ...
        LOG/
            DB_1/
            DB_2/
            ...
        READS/ (*)
            TOOL_1.1.fq
            TOOL_1.2.fq
            TOOL_2.1.fq
            TOOL_2.2.fq
            ...
    SAMPLE_2/
        ...
    CLUSTERLOG/ (**)

DBDIR:
    DB_1/
        TOOL_1_DB/
        TOOL_2_DB/
        ...
        TOOL_1.dbprofile.out
        TOOL_2.dbprofile.out
        ...
        LOG/
    DB_2/
        ...
    TAXONOMY/
        LOG/

(*) removed when keepfiles=0 (**) only when running on cluster mode

Adding a new tool:

MetaMeta integrates profiling and binning tools and it has 6 pre-configured tools (clark, dudes, gottcha, kaiju, kraken and motus). New tools are required to use the NCBI Taxonomy structure and nomenclature/identifiers to be added to the pipeline. MetaMeta accepts BioBoxes format directly (https://github.com/bioboxes/rfc/tree/master/data-format) or a .tsv file in the following format:

Profiling: rank, taxon name or taxid, abundance

Example:

genus   Methanospirillum        0.0029
genus   Thermus 0.0029
genus   568394      0.0029
species Arthrobacter sp. FB24   0.0835
species 195      0.0582
species Mycoplasma gallisepticum        0.0536

Binning: readid, taxon name or taxid, lenght of sequence assigned

Example:

M2|S1|R140      354     201
M2|S1|R142      195     201
M2|S1|R145      457425  201
M2|S1|R146      562     201
M2|S1|R147      1245471 201
M2|S1|R150      354     201

MetaMeta pipeline uses Snakemake. To add a new tool to the pipeline it is necessary to create two main files described below. Replace 'newtool' with the tool identifier (lower case, no spaces, no special chars):

tools/newtool.sm -> specifies how to execute the tool
    Rules:
    - newtool_run_1[..n] -> one or more rules necessary to run the tool
    - newtool_rpt -> final rule that should output a file newtool.profile.out in an accepted output format (described above)

tools/newtool_db_custom.sm -> specifies how to download/compile the database/references
    Rules:
    - newtool_db_custom_1[..n] -> one or more rules necessary to compile the database.
    - newtool_db_custom_profile -> this rule generates automatically the database profile. It should have as an output a file (newtool.dbaccession.out) with the accession version identifier for all sequences used in the database.
    - newtool_db_custom_check -> rule to check the required database files. It should have as an input all mandatory files that should be present to the database work properly.

Template files can be found inside the folder tools/template. Once the two files are inside the tools folder, it is necessary to add the tool identifier to the YAML configuration file.

Changelog:

v1.2.0)

Updated to Snakemake 4.3.0 (from 3.9.1)
Bug fixes on custom database creation and database profile generation
Centralized taxonomy download (once for all tools, kept on dbdir:taxonomy/)
Updated tools: kaiju 1.0 -> 1.4.5, dudes 0.07 -> 0.08, spades 3.9.0 -> 3.11.1
Addition of new pre-configured databases: fungal_viral_201709
Multiple pre-configured databases support
Several fixes on custom database creation

v1.1.1) Bug fixes parsing output files for kraken and kaiju

v1.1) Support single and paired-end reads, multiple and custom databases, krona integration

pirovc / metameta

readme