Multi-Domain Genome Recovery v1.0.1 (MuDoGeR v1.0.1)

ScreenShot

The Multi-Domain Genome Recovery v1.0 (MuDoGeR v1.0) framework (Figure 1) is a tool developed to help users to recover Metagenome-Assembled Genomes (MAGs as defined by Parks et al. (2018)) and Uncultivated Viral Genomes (UViGs as defined by Roux (2019)) from whole-genome sequence (WGS) samples simultaneously. The MuDoGeR v1.0 framework acts as a wrapper for several tools. It was designed to be an easy-to-use tool that outputs ready-to-use comprehensive files.

You should be able to run 1 simple command for each module. Therefore, you only need 5 commands to completely run MuDoGeR. After a successful run of MuDoGeR, you should have the outputs summarized in Figure 2. Please find a comprehensive description of the main outputs in the understand outputs file.

ScreenShot

Using the tools individually

In addition, MuDoGeR also sets up individual working conda environments for each of the integrated tools. Consequently, if the user wants to customize the use of any tool, you can use MuDoGeR to configure your machine and follow the instructions here to activate the relevant environments.

Reading this GitHub

This Github should help you install and run the complete MuDoGeR pipeline, as well as understand all its outputs. Consequently, we suggest the following reading strategy:

First, read the MuDoGeR overview and define which modules you are interested in using.
Secondly, read the System requirements and make sure you have the resources for the modules you want to use.
Then, read the Installation and follow its steps.
Read the overview descriptions of the MuDoGeR modules you intend to use.
If you want a quick run, read the MuDoGeR simplified usage
Read the understanding main outputs file. To understand and find relevant outputs created by MuDoGeR.
Read the MuDoGeR Manual for a more detailed description of the used modules and their output files

MuDoGeR Overview

MuDoGeR was designed to be an easy-to-use tool that outputs ready-to-use comprehensive files. You should be able to run 1 simple command for each of the following modules.

The MuDoGeR starts with Module 1: Pre-Processing, which covers the Raw Read Quality Control, Resources calculation, and Assembly. The assembled libraries should be used in all the other modules.

After the data pre-processing, MuDoGeR is divided into 3 different branches: Module 2: Recovery of Prokaryotic Metagenome-Assembled Genomes (pMAGs) Module 3: Recovery of Uncultivated Viral Genomes (UViGs) Module 4: Recovery of Eukaryotic Metagenome-Assembled Bins (eMABs)

Furthermore, in Module 5: Relative Abundance, users can automatically calculate the coverage and relative abundance tables from the recovered pMAGs/UViGs/eMABs. The users can also calculate the coverage and relative abundance tables from the prokaryotic genes annotated in assembled libraries.

Please find a comprehensive description of the main outputs in the understand outputs file.
Instructions for using the MuDoGeR can be found in the following hyperlink: Manual MuDoGeR.
Information about the system requirements of the MuDoGeR can be found in the following hyperlink: System requirements.
Detailed instructions for the installation of the MuDoGeR tools can be found in the following hyperlink: Installation.
The simplified usage of the MuDoGeR can be found in the following hyperlink: MuDoGeR simplified usage.
To use the individual working conda environments created by MuDoGeR for each of the used tools, go here.

System requirements

MuDoGeR makes it easy to install and run a group of complex software and dependencies. More than 20 bioinformatics tools and several complex software dependencies are present in MuDoGeR. Hopefully, you won't have to worry too much about it. We designed an installation script that should take care of every dependency for you. You can find the MuDoGeR installation tutorial here.

Keep in mind that the MuDoGeR pipeline requires some computer power, and you probably won't be able to run it on a laptop. The complete software installation requires approximately 170 GB, but MAKER2, from Module 4, uses 99 GB of that space since it requires the database to be installed in a specific manner. See Module 4 setup. The complete database requirements, considering all tools, is around 439.9 GB. However, you don't need to install all of MuDoGeR's Modules to use it.

MuDoGeR is designed to support only Linux x64 systems. As for the resource requirements, the MuDoGeR framework uses software that requires a large amount of RAM (e.g. GDTB-Tk, MetaWRAP ). Specific resource requirements vary depending on your data and its sequencing depth. We recommend the user provide at least 180 GB of RAM. Therefore, for the assembly process, MuDoGeR attempts to calculate the amount of memory necessary for metaSPades (on step 1.b). The user should be aware that samples with higher expected diversity require a higher amount of memory.

Consequently, we suggest you install and run MuDoGeR using your available high-performance computer or in cloud services such as AWS, Google cloud, or, for researchers in Germany, the de.NBI

The software dependencies used during the pipeline are described here: Dependencies description.

Installation

Installation using Singularity (now called Apptainer) - Recommended

0 - Install Singularity Most HPC administrators are already making Singularity available for its users. You could check if that is your case and skip this step. Otherwise, please follow the instructions on the official Singularity installation guide

1 - Download MuDoGeR ready-to-use container

Once you have Singularity installed, you can simply download the MuDoGeR container. Remember that the container's usage is slightly different. Please refer to the Singularity container usage

Click HERE to be redirected to the download page.

Following, you can click on Direct Download or right-click it and "copy link". Once with the copied link, you can use wget on your platform.

2 - Download MuDoGeR Singularity usage scripts

To download the mudoger_singularity.sh script to run the MuDoGeR singularity container run:

wget https://github.com/mdsufz/MuDoGeR/raw/master/installation/dependencies/mudoger_singularity.sh

If you plan to use the script in a SLURM-based HPC, perhaps you will find the mudoger_singularity_slurm.sh script useful. This script simply wraps the mudoger command as a SLURM job and submits it for you. You can download the script by running:

wget https://github.com/mdsufz/MuDoGeR/raw/master/installation/dependencies/mudoger_singularity_slurm.sh

3 - Database installation

The MuDoGeR required databases can vary depending on which module you plan to use. Naturally, the databases can require significant storage and are not included in the MuDoGeR container. It is recommended that the user follow the instructions from the tool developer to install and update the desired database. The only requirement is that all the databases use the same base folder and are installed using the name of the tool as follows: buscodbs/ checkm/ checkv/ eukccdb/ gtdbtk/ vibrant/ wish/. Therefore, your database installation folder should look like this:

mudoger_dbs/
├── buscodbs
├── checkm
├── checkv
├── eukccdb
├── gtdbtk
├── vibrant
└── wish

However, additionally, the user can find useful guidance by reading the automated database configuration script here. An additional tutorial on how to install the databases will be available shortly.

4 - Configure Genemark License if you will use Module 4

ACCESS GENEMARK WEBPAGE

http://opal.biology.gatech.edu/GeneMark/license_download.cgi
SELECT OPTIONS

GeneMark-ES/ET/EP ver *_lic and LINUX 64
FILL IN THE CREDENTIALS WITH YOUR NAME, E-MAIL, INSTITUTION, ETC...
CLICK ON 'I agree the terms of this license agreement'
DOWNLOAD THE 64_bit key files provided

It should look something like the following:
```
```console
$ wget http://topaz.gatech.edu/GeneMark/tmp/GMtool_HZzc0/gm_key_64.gz
```
```
You should have the following file: gm_key_64.gz
DECOMPRESS THE KEY FILE

$ gunzip gm_key_64.gz

COPY AND RENAME KEY FILE TO A FOLDER

The folder you will move the renamed key file will be used as your Home during the execution of Module 4 in singularity. Please see here to run Module 4 using the singularity container.

$ cp gm_key_64 /path/to/folder/.gm_key

Installation using conda environments

Installing MuDoGeR via conda can help the user to utilize only part of the workflow. However, it is recommended for those with a deeper understanding of how conda environments work, as manual adjustments may need to be made. If that is your case, please follow the instructions here: MuDoGeR conda installation.

Modules Overview

Module 1: Pre-Processing

Screenshot

The steps of Module 1 are shown in Figure 3. A detailed description of its execution and outputs are found here: Pre-Processing description.

When you use MuDoGeR Module 1, it will perform the following tasks:

1.a: Raw Read Quality Control.
1.b: Calculation of memory requirements for the assembly process.
- (1.b.1) The k-mer (33-mer and 55-mer) of the quality-controlled reads produced in 1.a is calculated.
- (1.b.2) The calculated k-mer is used in a trained machine learning model to estimate the amount of memory that metaSPades uses to assemble the reads.
1.c: Assembly of the quality-controlled reads.

Module 2: Recovery of Prokaryotic Metagenome-Assembled Genomes (pMAGs)

screenshot

Module 2 workflow is shown in Figure 4. A detailed description of its execution and outputs are found here: Pipeline for recovery of Prokaryotic Metagenome-Assembled Genomes.

When you use MuDoGeR Module 2, it will perform the following tasks:

2.a: Binning and bin refinement of the Prokaryotic bins.
- (2.a.1) Binning with Metabat2, Maxbin2, and CONCOCT.
- (2.a.2) Bacterial bins refinement and archaea bins refinement.
- (2.a.3) Dereplication of the recovered prokaryotic bins.
2.b: Taxonomic classification, quality estimation, and gene annotation.
- (2.b.1) Taxonomic classification of the prokaryotic bins produced in (2.a.3) using GTDB-tk.
- (2.b.2) Generation of quality matrix of the prokaryotic bins produced in (2.a.3) using CheckM.
- (2.b.3) Prokaryotic MAGs gene annotation with Prokka.
2.c: Sequence metrics calculation and selection of Prokaryotic MAGs.
- (2.c.1) Sequence metric calculation from the selected MAGs.
- (2.c.2) Selection of Prokaryotic MAGs

Module 3: Recovery of Uncultivated Viral Genomes (UViGs)

screenshot

The steps of Module 3 are shown in Figure 5. A detailed description of its execution and outputs are found here: Pipelines for viral genomes recovery.

When you use MuDoGeR Module 3, it will perform the following tasks:

3.a: Recovery of Putative Viral Contigs
- (3.a.1) Recovery of putative viral contigs using VirSorter2, VirFinder, and VIBRANT.
- (3.a.2) Filtering of the putative viral contigs.
- (3.a.3) Dereplication of the putative viral contigs.
- 3.b: Taxonomic and Quality estimation of potential viral contigs
- (3.b.1) Taxonomic classification from the dereplicated putative viral contigs with Vcontact2.
- (3.b.2) Checking the quality of the dereplicated contigs with CheckV.
3.c: Viral-Host pair estimation using WIsH. This step is only done automatically if you generate the prokaryotic MAGs using MuDoGeR as well.
3.d: Selection of UViGs
- (3.d.1) Selection of all viruses that yielded taxonomy when using vContact2 plus those larger than 15 Kb.
- (3.d.2) Selection based on the quality determined by CheckV.

Module 4: Recovery of Eukaryotic Metagenome-Assembled Bins (eMABs)

screenshot

The steps of Module 4 are shown in Figure 6. A detailed description of its execution and outputs are found here: Pipelines for eukaryotic bins recovery.

When you use MuDoGeR Module 4, it will perform the following tasks:

4.a: Recovery and binning of Eukaryotic assemblies.
- (4.a.1) Classification of Eukaryotic assemblies and removal of prokaryotic assemblies with EukRep.
- (4.a.2) Use of CONCOCT for binning the Eukaryotic assemblies.
- (4.a.3) Filtering the Eukaryotic bins, produced from CONCOCT, by size. Bins with size < 1.5 Mb are removed.
4.b: Completeness and contamination estimation and annotation of Eukaryotic bins
- (4.b.1) In the filtered bins produced in 4.a, genes are predicted using GeneMark.
- (4.b.2) Completeness and contamination estimation of the Eukaryotic filtered bins produced in 4.a using EukCC.
- (4.b.3) MAKER2 annotates the predicted genes produced by GeneMark.
- (4.b.4) BUSCO is applied to the annotated genes from MAKER2, for detection of single-copy orthologous genes (SCGs) and estimation of completeness of Eukaryotic contigs.

Module 5 Relative abundance

screenshot

The steps of Module 5 are shown in Figure 7. A detailed description of its execution and outputs are found here: Pipelines for abundance calculation. Essentially, module 5 maps the quality-controlled reads of your sample on the recovered pMAGs/UViGs/eMAB or annotated prokaryotic genes. We designed three possible mapping types to calculate abundance: reduced, complete, or genes. A detailed description of their differences can be found here

The steps of Module 5 can be summarized as follows. If you select complete or genes, the pipeline will run steps 5.a and 5.b. If you select genes, the pipeline will run 5.c:

5.a: Select representative pMAGs from each created OTU
- (5.a.1) Copy recovered pMAGs from all samples within the provided metadata table.
- (5.a.2) Group recovered pMAGs of all samples within the provided metadata table using gOTUpick.
- (5.a.3) Select the highest quality pMAG within the gOTUpick groups as the group's representative MAG.
5.b: pMAGs/UViGs/eMABs mapping and abundance calculation
- (5.b.1) Copy representative pMAGs from step 5.1 and the selected UViGs and eMABs recovered in module 3 and module 4, respectively.
- (5.b.2) Index selected pMAGs/UViGs/eMABs
- (5.b.3) If --coverage is selected, calculate pMAGs/UViGs/eMAB size, and the average read length of all samples to be mapped. If --relative-abundance calculate the total number of reads from all samples. This information is used further in the pipeline.
- (5.b.4) If --reduced is selected, maps reads from the samples where the pMAGs/UViGs/eMABs were found on the pMAGs/UViGs/eMABs. If --complete is selected, map reads from all samples on the pMAGs/UViGs/eMABs.
- (5.b.5) Calculate the absolute number of hits, relative abundance, and coverage tables, if the respective flag is selected.
5.c: Genes relative abundance calculation from the samples assembly. Currently working on prokaryotic genes
- (5.c.1) Index assemblies from given samples.
- (5.c.2) Map sample reads on the respective assembly.
- (5.c.3) Annotate genes on the assembly with Prokka.
- (5.c.4) Convert the .gff file from Prokka gene annotation into .gtf.
- (5.c.5) Count mapped reads on each gene.
- (5.c.6) Calculate the average read length of all mapped samples.
- (5.c.7) Calculate gene lenght from Prokka .gtf file.
- (5.c.8) Calculate genes' absolute number of hits, relative abundance, coverage, and TPM tables for each sample.

MuDoGeR simplified usage - with Singularity installation

Currently, MuDoGeR v1.0 only works with paired-end ILLUMINA sequences. Future updates will add tools to work with long-read sequencing samples. MuDoGeR was designed to work module by module, starting from pre-process (Module 1). Additional modularity will be added in future updates to allow the user to run specific parts of the pipeline. However, you can always use the tools independently by using the created conda environments by MuDoGeR. You can follow the instructions here.

MuDoGeR is an easy-to-use wrapper of several tools organized within modules. The individual modules can be called independently.

The pipeline requires, as input, a metadata table in tsv format containing the samples to be processed and the path to its raw sequence reads. The metadata file should have the sample name and the path to the forward reads file from the sample in one line, followed by the same sample name and the path to the reverse reads from the sample.

One additional point of attention. The input data is mounted in the MuDoGeR singularity container at /tools/data_input/. Therefore, if your sample sequence files are in /path/to/input/sampleID/sampleID_1.fastq and /path/to/input/sampleID/sampleID_2.fastq your metadata.tsv file should look like:

Do not change the folders /tools/data_input, as this is the folder used inside the container.

#Show the content of the metadata.tsv file
$ cat metadata.tsv

sampleID   /tools/data_input/sampleID/sampleID_1.fastq
sampleID   /tools/data_input/sampleID/sampleID_2.fastq

Please note that the forward sequencing reads file must end in "_1.fastq" and the reverse in "_2.fastq"!

MuDoGeR is designed to run all multi-domain genome recovery pipelines entirely. In order for MuDoGeR to work automatically, from start to finish, we use a specific folder structure. Please read the Manual_MuDoGeR if you would like to manipulate MuDoGeR.

MuDoGeR Singularity Execution

When using the MuDoGeR singularity container, you have all the complex dependencies and software environments from MuDoGeR already configured. For using the recommended singularity installation, please keep your metadata.tsv file in the same folder where you have your sample reads.

We have developed a MuDoGeR singularity usage script. You should have it available if you followed the installation guide

Once you have the mudoger_singularity.sh script available you can see the help information by typing:

$ /path/to/mudoger_singularity.sh -h

All options are required.
Usage: ./mudoger_singularity.sh <module_name> -s <singularity_file_path> -o <output_path> -i <input_data_path> -d <databases_path> -h <home_path> -m <memory> -t <threads> -f <metadata_file> [abundance_tables options]
  <module_name>         Module name (e.g., preprocess, prokaryotes, viruses, abundance_tables, eukaryotes)
Options:
  -s  Path to Singularity file (.sif file)
  -o  Path to output folder
  -i  Path to input data (Note: the metadata file should be located in this folder)
  -d  Path to databases folder
  -c  Path for Singularity home directory (required for eukaryotes module)
  -m  Memory size (for preprocess module)
  -t  Number of threads
  -f  Name of the metadata file (including .tsv extension)
Abundance Tables Options (only for abundance_tables module):
  --reduced (default), --complete, or --genes (exclusive options)
  --absolute-values (default), --coverage, and/or --relative-abundance (can be combined)

Therefore, if you have your metadata.tsv and your samples reads in the test_data folder, your output folder is the test_out, and your databases are in the mudoger_dbs folder, your usage commands should be:


/path/to/mudoger_singularity.sh preprocess -s /path/to/mudogerV1.sif -o /path/to/test_out -i /path/to/test_data -d /path/to/mudoger_dbs -m 100 -t 25 -f metadata.tsv

/path/to/mudoger_singularity.sh prokaryotes -s /path/to/mudogerV1.sif -o /path/to/test_out -i /path/to/test_data -d /path/to/mudoger_dbs -t 25 -f metadata.tsv

/path/to/mudoger_singularity.sh viruses -s /path/to/mudogerV1.sif -o /path/to/test_out -i /path/to/test_data -d /path/to/mudoger_dbs -t 25 -f metadata.tsv

/path/to/mudoger_singularity.sh abundance_tables -s /path/to/mudogerV1.sif -o /path/to/test_out -i /path/to/test_data -d /path/to/mudoger_dbs -t 25 -f metadata.tsv --reduced --coverage --relative-abundance

Module 4 (Eukaryotes recovery) has one particularity. The GeneMark requires each user to agree to a license and place it in their home folder. You can obtain this license following the instructions here. Once the license is configured, you have to specify its location with the -c parameter when using the singularity MuDoGeT. For instance, if you saved the Genemark license key in /path/to/tmp_home your module 4 singularity command will be:


/path/to/mudoger_singularity.sh eukaryotes -s /path/to/mudogerV1.sif -o /path/to/test_out/ -i /path/to/test_data/ -d /path/to/mudoger_dbs -c /path/to/tmp_home/ -t 25 -f metadata.tsv

The result two-level folder structure after a successful run of all MuDoGeR is as follows:

.
├── sample_1
│   ├── assembly
│   ├── eukaryotes
│   ├── khmer
│   ├── prokaryotes
│   ├── qc
│   └── viruses
├── sample_2
│   ├── assembly
│   ├── eukaryotes
│   ├── khmer
│   ├── prokaryotes
│   ├── qc
│   └── viruses
└── mapping_results
    ├── assembly_gene_map
    ├── euk_mabs_mapping
    ├── gOTUpick_results
    ├── merged_reads
    ├── pmags_otu_mapping
    └── uvigs_mapping

If you want to use the conda environment installation, a more detailed tutorial for the MuDoGeR can be found in Manual_MuDoGeR.

MuDoGeR as Wrapper and its Critical Use

MuDoGeR is a wrapper designed to streamline the genome assembly process from metagenome samples across multiple domains. While MuDoGeR accelerates metagenomics analysis, it's crucial to understand the inherent limitations of any metagenomic approach. The tools and parameters integrated into MuDoGeR are based on benchmark studies, but users should understand their dataset and tools' limitations to adapt the workflow.

MuDoGeR generates progress reports, intermediate files, and error logs at various stages of the assembly process, which are detailed in the MuDoGeR Manual. Users should regularly check these reports to ensure the tool works as expected for their dataset.
MuDoGeR aims to help provide a holistic view of all three domains simultaneously, reducing cross-domain recovery bias by initiating all genome recovery from the same assembly. However, the genome recovery approaches from different domains are at different stages of technological progress and complexity of analysis.
Users should be aware of potential recovery bias from a particular dataset.
For experienced users, MuDoGeR allows the activation of each tool separately, enabling users to adapt the process to their specific needs. follow the instructions here.
Users are strongly recommended to consult and check the direct links from the tools used within the wrapper for a deeper understanding of the underlying processes and optimization of parameters for their specific needs. The software used during the pipeline is detailed and described here: Dependencies description.

Citing

Rocha, U., Coelho Kasmanas, J., Kallies, R., Saraiva, J. P., Toscan, R. B., Štefanič, P., Bicalho, M. F., Borim Correa, F., Baştürk, M. N., Fousekis, E., Viana Barbosa, L. M., Plewka, J., Probst, A. J., Baldrian, P., Stadler, P. F., & (2023). MuDoGeR: Multi-Domain Genome recovery from metagenomes made easy. Molecular Ecology Resources, 00, 1–12. https://doi.org/10.1111/1755-0998.13904

mdsufz / MuDoGeR

readme

Multi-Domain Genome Recovery v1.0.1 (MuDoGeR v1.0.1)

Using the tools individually

Reading this GitHub

MuDoGeR Overview

System requirements

Installation

Installation using Singularity (now called Apptainer) - Recommended

Installation using conda environments

Modules Overview

Module 1: Pre-Processing

Module 2: Recovery of Prokaryotic Metagenome-Assembled Genomes (pMAGs)

Module 3: Recovery of Uncultivated Viral Genomes (UViGs)

Module 4: Recovery of Eukaryotic Metagenome-Assembled Bins (eMABs)

Module 5 Relative abundance

MuDoGeR simplified usage - with Singularity installation

Please note that the forward sequencing reads file must end in "_1.fastq" and the reverse in "_2.fastq"!

MuDoGeR Singularity Execution

The result two-level folder structure after a successful run of all MuDoGeR is as follows:

MuDoGeR as Wrapper and its Critical Use

Citing

Acknowledgements