mdsufz / MuDoGeR

MuDoGeR makes the recovery of genomes from prokaryotes, viruses, and eukaryotes from metagenomes easy.
GNU General Public License v3.0
87 stars 9 forks source link

Multi-Domain Genome Recovery v1.0.1 (MuDoGeR v1.0.1)

ScreenShot

The Multi-Domain Genome Recovery v1.0 (MuDoGeR v1.0) framework (Figure 1) is a tool developed to help users to recover Metagenome-Assembled Genomes (MAGs as defined by Parks et al. (2018)) and Uncultivated Viral Genomes (UViGs as defined by Roux (2019)) from whole-genome sequence (WGS) samples simultaneously. The MuDoGeR v1.0 framework acts as a wrapper for several tools. It was designed to be an easy-to-use tool that outputs ready-to-use comprehensive files.

You should be able to run 1 simple command for each module. Therefore, you only need 5 commands to completely run MuDoGeR. After a successful run of MuDoGeR, you should have the outputs summarized in Figure 2. Please find a comprehensive description of the main outputs in the understand outputs file.

ScreenShot

Using the tools individually

In addition, MuDoGeR also sets up individual working conda environments for each of the integrated tools. Consequently, if the user wants to customize the use of any tool, you can use MuDoGeR to configure your machine and follow the instructions here to activate the relevant environments.

Reading this GitHub

This Github should help you install and run the complete MuDoGeR pipeline, as well as understand all its outputs. Consequently, we suggest the following reading strategy:

MuDoGeR Overview

MuDoGeR was designed to be an easy-to-use tool that outputs ready-to-use comprehensive files. You should be able to run 1 simple command for each of the following modules.

The MuDoGeR starts with Module 1: Pre-Processing, which covers the Raw Read Quality Control, Resources calculation, and Assembly. The assembled libraries should be used in all the other modules.

After the data pre-processing, MuDoGeR is divided into 3 different branches: Module 2: Recovery of Prokaryotic Metagenome-Assembled Genomes (pMAGs) Module 3: Recovery of Uncultivated Viral Genomes (UViGs) Module 4: Recovery of Eukaryotic Metagenome-Assembled Bins (eMABs)

Furthermore, in Module 5: Relative Abundance, users can automatically calculate the coverage and relative abundance tables from the recovered pMAGs/UViGs/eMABs. The users can also calculate the coverage and relative abundance tables from the prokaryotic genes annotated in assembled libraries.

System requirements

MuDoGeR makes it easy to install and run a group of complex software and dependencies. More than 20 bioinformatics tools and several complex software dependencies are present in MuDoGeR. Hopefully, you won't have to worry too much about it. We designed an installation script that should take care of every dependency for you. You can find the MuDoGeR installation tutorial here.

Keep in mind that the MuDoGeR pipeline requires some computer power, and you probably won't be able to run it on a laptop. The complete software installation requires approximately 170 GB, but MAKER2, from Module 4, uses 99 GB of that space since it requires the database to be installed in a specific manner. See Module 4 setup. The complete database requirements, considering all tools, is around 439.9 GB. However, you don't need to install all of MuDoGeR's Modules to use it.

MuDoGeR is designed to support only Linux x64 systems. As for the resource requirements, the MuDoGeR framework uses software that requires a large amount of RAM (e.g. GDTB-Tk, MetaWRAP ). Specific resource requirements vary depending on your data and its sequencing depth. We recommend the user provide at least 180 GB of RAM. Therefore, for the assembly process, MuDoGeR attempts to calculate the amount of memory necessary for metaSPades (on step 1.b). The user should be aware that samples with higher expected diversity require a higher amount of memory.

Consequently, we suggest you install and run MuDoGeR using your available high-performance computer or in cloud services such as AWS, Google cloud, or, for researchers in Germany, the de.NBI

The software dependencies used during the pipeline are described here: Dependencies description.

Installation

Installation using Singularity (now called Apptainer) - Recommended

0 - Install Singularity Most HPC administrators are already making Singularity available for its users. You could check if that is your case and skip this step. Otherwise, please follow the instructions on the official Singularity installation guide

1 - Download MuDoGeR ready-to-use container

Once you have Singularity installed, you can simply download the MuDoGeR container. Remember that the container's usage is slightly different. Please refer to the Singularity container usage

Click HERE to be redirected to the download page.

Following, you can click on Direct Download or right-click it and "copy link". Once with the copied link, you can use wget on your platform.

2 - Download MuDoGeR Singularity usage scripts

To download the mudoger_singularity.sh script to run the MuDoGeR singularity container run:

wget https://github.com/mdsufz/MuDoGeR/raw/master/installation/dependencies/mudoger_singularity.sh

If you plan to use the script in a SLURM-based HPC, perhaps you will find the mudoger_singularity_slurm.sh script useful. This script simply wraps the mudoger command as a SLURM job and submits it for you. You can download the script by running:

wget https://github.com/mdsufz/MuDoGeR/raw/master/installation/dependencies/mudoger_singularity_slurm.sh

3 - Database installation

The MuDoGeR required databases can vary depending on which module you plan to use. Naturally, the databases can require significant storage and are not included in the MuDoGeR container. It is recommended that the user follow the instructions from the tool developer to install and update the desired database. The only requirement is that all the databases use the same base folder and are installed using the name of the tool as follows: buscodbs/ checkm/ checkv/ eukccdb/ gtdbtk/ vibrant/ wish/. Therefore, your database installation folder should look like this:

mudoger_dbs/
├── buscodbs
├── checkm
├── checkv
├── eukccdb
├── gtdbtk
├── vibrant
└── wish

However, additionally, the user can find useful guidance by reading the automated database configuration script here. An additional tutorial on how to install the databases will be available shortly.

4 - Configure Genemark License if you will use Module 4

  1. ACCESS GENEMARK WEBPAGE

    http://opal.biology.gatech.edu/GeneMark/license_download.cgi

  2. SELECT OPTIONS

    GeneMark-ES/ET/EP ver *_lic and LINUX 64

  3. FILL IN THE CREDENTIALS WITH YOUR NAME, E-MAIL, INSTITUTION, ETC...

  4. CLICK ON 'I agree the terms of this license agreement'

  5. DOWNLOAD THE 64_bit key files provided

    It should look something like the following:

    ```console
    $ wget http://topaz.gatech.edu/GeneMark/tmp/GMtool_HZzc0/gm_key_64.gz
    ```

    You should have the following file: gm_key_64.gz

  6. DECOMPRESS THE KEY FILE

$ gunzip gm_key_64.gz
  1. COPY AND RENAME KEY FILE TO A FOLDER

The folder you will move the renamed key file will be used as your Home during the execution of Module 4 in singularity. Please see here to run Module 4 using the singularity container.

$ cp gm_key_64 /path/to/folder/.gm_key

Installation using conda environments

Installing MuDoGeR via conda can help the user to utilize only part of the workflow. However, it is recommended for those with a deeper understanding of how conda environments work, as manual adjustments may need to be made. If that is your case, please follow the instructions here: MuDoGeR conda installation.

Modules Overview

Module 1: Pre-Processing

Screenshot

The steps of Module 1 are shown in Figure 3. A detailed description of its execution and outputs are found here: Pre-Processing description.

When you use MuDoGeR Module 1, it will perform the following tasks:

Module 2: Recovery of Prokaryotic Metagenome-Assembled Genomes (pMAGs)

screenshot

Module 2 workflow is shown in Figure 4. A detailed description of its execution and outputs are found here: Pipeline for recovery of Prokaryotic Metagenome-Assembled Genomes.

When you use MuDoGeR Module 2, it will perform the following tasks:

Module 3: Recovery of Uncultivated Viral Genomes (UViGs)

screenshot

The steps of Module 3 are shown in Figure 5. A detailed description of its execution and outputs are found here: Pipelines for viral genomes recovery.

When you use MuDoGeR Module 3, it will perform the following tasks:

Module 4: Recovery of Eukaryotic Metagenome-Assembled Bins (eMABs)

screenshot

The steps of Module 4 are shown in Figure 6. A detailed description of its execution and outputs are found here: Pipelines for eukaryotic bins recovery.

When you use MuDoGeR Module 4, it will perform the following tasks:

Module 5 Relative abundance

screenshot

The steps of Module 5 are shown in Figure 7. A detailed description of its execution and outputs are found here: Pipelines for abundance calculation. Essentially, module 5 maps the quality-controlled reads of your sample on the recovered pMAGs/UViGs/eMAB or annotated prokaryotic genes. We designed three possible mapping types to calculate abundance: reduced, complete, or genes. A detailed description of their differences can be found here

The steps of Module 5 can be summarized as follows. If you select complete or genes, the pipeline will run steps 5.a and 5.b. If you select genes, the pipeline will run 5.c:

MuDoGeR simplified usage - with Singularity installation

Currently, MuDoGeR v1.0 only works with paired-end ILLUMINA sequences. Future updates will add tools to work with long-read sequencing samples. MuDoGeR was designed to work module by module, starting from pre-process (Module 1). Additional modularity will be added in future updates to allow the user to run specific parts of the pipeline. However, you can always use the tools independently by using the created conda environments by MuDoGeR. You can follow the instructions here.

MuDoGeR is an easy-to-use wrapper of several tools organized within modules. The individual modules can be called independently.

The pipeline requires, as input, a metadata table in tsv format containing the samples to be processed and the path to its raw sequence reads. The metadata file should have the sample name and the path to the forward reads file from the sample in one line, followed by the same sample name and the path to the reverse reads from the sample.

One additional point of attention. The input data is mounted in the MuDoGeR singularity container at /tools/data_input/. Therefore, if your sample sequence files are in /path/to/input/sampleID/sampleID_1.fastq and /path/to/input/sampleID/sampleID_2.fastq your metadata.tsv file should look like:

Do not change the folders /tools/data_input, as this is the folder used inside the container.

#Show the content of the metadata.tsv file
$ cat metadata.tsv

sampleID   /tools/data_input/sampleID/sampleID_1.fastq
sampleID   /tools/data_input/sampleID/sampleID_2.fastq

Please note that the forward sequencing reads file must end in "_1.fastq" and the reverse in "_2.fastq"!

MuDoGeR is designed to run all multi-domain genome recovery pipelines entirely. In order for MuDoGeR to work automatically, from start to finish, we use a specific folder structure. Please read the Manual_MuDoGeR if you would like to manipulate MuDoGeR.

MuDoGeR Singularity Execution

When using the MuDoGeR singularity container, you have all the complex dependencies and software environments from MuDoGeR already configured. For using the recommended singularity installation, please keep your metadata.tsv file in the same folder where you have your sample reads.

We have developed a MuDoGeR singularity usage script. You should have it available if you followed the installation guide

Once you have the mudoger_singularity.sh script available you can see the help information by typing:

$ /path/to/mudoger_singularity.sh -h

All options are required.
Usage: ./mudoger_singularity.sh <module_name> -s <singularity_file_path> -o <output_path> -i <input_data_path> -d <databases_path> -h <home_path> -m <memory> -t <threads> -f <metadata_file> [abundance_tables options]
  <module_name>         Module name (e.g., preprocess, prokaryotes, viruses, abundance_tables, eukaryotes)
Options:
  -s  Path to Singularity file (.sif file)
  -o  Path to output folder
  -i  Path to input data (Note: the metadata file should be located in this folder)
  -d  Path to databases folder
  -c  Path for Singularity home directory (required for eukaryotes module)
  -m  Memory size (for preprocess module)
  -t  Number of threads
  -f  Name of the metadata file (including .tsv extension)
Abundance Tables Options (only for abundance_tables module):
  --reduced (default), --complete, or --genes (exclusive options)
  --absolute-values (default), --coverage, and/or --relative-abundance (can be combined)

Therefore, if you have your metadata.tsv and your samples reads in the test_data folder, your output folder is the test_out, and your databases are in the mudoger_dbs folder, your usage commands should be:


/path/to/mudoger_singularity.sh preprocess -s /path/to/mudogerV1.sif -o /path/to/test_out -i /path/to/test_data -d /path/to/mudoger_dbs -m 100 -t 25 -f metadata.tsv

/path/to/mudoger_singularity.sh prokaryotes -s /path/to/mudogerV1.sif -o /path/to/test_out -i /path/to/test_data -d /path/to/mudoger_dbs -t 25 -f metadata.tsv

/path/to/mudoger_singularity.sh viruses -s /path/to/mudogerV1.sif -o /path/to/test_out -i /path/to/test_data -d /path/to/mudoger_dbs -t 25 -f metadata.tsv

/path/to/mudoger_singularity.sh abundance_tables -s /path/to/mudogerV1.sif -o /path/to/test_out -i /path/to/test_data -d /path/to/mudoger_dbs -t 25 -f metadata.tsv --reduced --coverage --relative-abundance

Module 4 (Eukaryotes recovery) has one particularity. The GeneMark requires each user to agree to a license and place it in their home folder. You can obtain this license following the instructions here. Once the license is configured, you have to specify its location with the -c parameter when using the singularity MuDoGeT. For instance, if you saved the Genemark license key in /path/to/tmp_home your module 4 singularity command will be:


/path/to/mudoger_singularity.sh eukaryotes -s /path/to/mudogerV1.sif -o /path/to/test_out/ -i /path/to/test_data/ -d /path/to/mudoger_dbs -c /path/to/tmp_home/ -t 25 -f metadata.tsv

The result two-level folder structure after a successful run of all MuDoGeR is as follows:

.
├── sample_1
│   ├── assembly
│   ├── eukaryotes
│   ├── khmer
│   ├── prokaryotes
│   ├── qc
│   └── viruses
├── sample_2
│   ├── assembly
│   ├── eukaryotes
│   ├── khmer
│   ├── prokaryotes
│   ├── qc
│   └── viruses
└── mapping_results
    ├── assembly_gene_map
    ├── euk_mabs_mapping
    ├── gOTUpick_results
    ├── merged_reads
    ├── pmags_otu_mapping
    └── uvigs_mapping

If you want to use the conda environment installation, a more detailed tutorial for the MuDoGeR can be found in Manual_MuDoGeR.

MuDoGeR as Wrapper and its Critical Use

MuDoGeR is a wrapper designed to streamline the genome assembly process from metagenome samples across multiple domains. While MuDoGeR accelerates metagenomics analysis, it's crucial to understand the inherent limitations of any metagenomic approach. The tools and parameters integrated into MuDoGeR are based on benchmark studies, but users should understand their dataset and tools' limitations to adapt the workflow.

Citing

Rocha, U., Coelho Kasmanas, J., Kallies, R., Saraiva, J. P., Toscan, R. B., Štefanič, P., Bicalho, M. F., Borim Correa, F., Baştürk, M. N., Fousekis, E., Viana Barbosa, L. M., Plewka, J., Probst, A. J., Baldrian, P., Stadler, P. F., & (2023). MuDoGeR: Multi-Domain Genome recovery from metagenomes made easy. Molecular Ecology Resources, 00, 1–12. https://doi.org/10.1111/1755-0998.13904

Acknowledgements