jtamames / SqueezeMeta

A complete pipeline for metagenomic analysis
GNU General Public License v3.0
346 stars 81 forks source link
metagenomes metagenomics pipeline taxonomic-assignment

SqueezeMeta: a fully automated metagenomics pipeline, from reads to bins

1. What is SqueezeMeta?

SqueezeMeta is a full automatic pipeline for metagenomics/metatranscriptomics, covering all steps of the analysis. SqueezeMeta includes multi-metagenome support allowing the co-assembly of related metagenomes and the retrieval of individual genomes via binning procedures. Thus, SqueezeMeta features several unique characteristics:

1) Co-assembly procedure with read mapping for estimation of the abundances of genes in each metagenome 2) Co-assembly of a large number of metagenomes via merging of individual metagenomes 3) Includes binning and bin checking, for retrieving individual genomes 4) The results are stored in a database, where they can be easily exported and shared, and can be inspected anywhere using a web interface. 5) Internal checks for the assembly and binning steps inform about the consistency of contigs and bins, allowing to spot potential chimeras. 6) Metatranscriptomic support via mapping of cDNA reads against reference metagenomes

SqueezeMeta supports different assembly strategies (co-assembly, sequential, assembly merging, and sequential-merging) and different assemblers (see below for details). SqueezeMeta uses a combination of custom scripts and external software packages for the different steps of the analysis:

1) Assembly 2) RNA prediction and classification 3) ORF (CDS) prediction 4) Homology searching against taxonomic and functional databases 5) Hmmer searching against Pfam database 6) Taxonomic assignment of genes 7) Functional assignment of genes (OPTIONAL) 8) Blastx on parts of the contigs with no gene prediction or no hits 9) Taxonomic assignment of contigs, and check for taxonomic disparities 10) Coverage and abundance estimation for genes and contigs 11) Estimation of taxa abundances 12) Estimation of function abundances 13) Merging of previous results to obtain the ORF table 14) Binning with different methods 15) Binning integration with DAS tool 16) Taxonomic assignment of bins, and check for taxonomic disparities 17) Checking of bins with CheckM 18) Merging of previous results to obtain the bin table 19) Merging of previous results to obtain the contig table 20) Prediction of kegg and metacyc patwhays for each bin 21) Final statistics for the run

Detailed information about the different steps of the pipeline can be found in the PDF manual.

2. Installation

SqueezeMeta is intended to be run in a x86-64 Linux OS (tested in Ubuntu and CentOS). The easiest way to install it is by using conda. The default conda solver might however be slow solving the dependencies, so it's better to first set up the libmamba solver with

conda update -n base conda # if your conda version is below 22.11
conda install -n base conda-libmamba-solver
conda config --set solver libmamba

and then use conda to install SqueezeMeta

conda create -n SqueezeMeta -c conda-forge -c bioconda -c anaconda -c fpusan squeezemeta=1.6 --no-channel-priority

This will create a new conda environment named SqueezeMeta, which must then be activated.

conda activate SqueezeMeta

When using conda, all the scripts from the SqueezeMeta distribution will be available on $PATH.

Alternatively, you can download the latest release from the GitHub repository and uncompress the tarball in a suitable directory. The tarball includes the SqueezeMeta scripts as well as the third-party software redistributed with SqueezeMeta. The INSTALL files contain detailed installation instructions, including all the external libraries required to make SqueezeMeta run in a vanilla Ubuntu 20.04. Note that you may need different libraries and potentially recompiling some binaries from source in order for the manual install to work in other Ubuntu versions or other distributions. The conda method is now the recommended way to install SqueezeMeta, and we may not be able to support issues regarding manual installation.

The test_install.pl script can be run in order to check whether the required dependencies are available in your environment.

/path/to/SqueezeMeta/utils/install_utils/test_install.pl

3. Downloading or building databases

SqueezeMeta uses several databases. GenBank nr for taxonomic assignment, and eggnog, KEGG and Pfam for functional assignment. The script download_databases.pl can be run to download a pre-formatted version of all the databases required by SqueezeMeta.

/path/to/SqueezeMeta/utils/install_utils/download_databases.pl /download/path

, where /download/path is the destination folder. This is the recommended option, but the files are hosted in our institutional server, which can at times be unreachable.

Alternatively, the script make_databases.pl can be run to download from source and format the latest version of the databases.

/path/to/SqueezeMeta/utils/install_utils/make_databases.pl /download/path

Generally, donwload_databases.pl is the safest choice for getting your databases set up. When running make_databases.pl, data download (e.g. from the NCBI server) can be interrupted, leading to a corrupted database. Always run test_install.pl to check that the database was properly created. Otherwise, you can try re-running make_databases.pl, or just run download_databases.pl instead.

The databases occupy 200Gb, but we recommend having at least 350Gb free disk space during the building process.

Two directories will be generated after running either make_databases.pl or download_databases.pl.

If download_databases.pl or make_databases.pl can't find our server, you can instead run make_databases_alt.pl (same syntax as make_databases.pl) which will try to download the data from an alternative site.

If the SqueezeMeta databases are already built in another location in the system, a different copy of SqueezeMeta can be configured to use them with

/path/to/SqueezeMeta/utils/install_utils/configure_nodb.pl /path/to/db

, where /path/to/db is the route to the db folder that was generated by either make_databases.pl or download_databases.pl.

After configuring the databases, the test_install.pl can be run in order to check that SqueezeMeta is ready to work (see previous section).

4. Choosing an assembly strategy

SqueezeMeta can be run in four different modes, depending of the type of multi-metagenome support. These modes are:

Note that the merged and seqmerge modes work well as a substitute of coassembly for running small datasets in computers with low memory (e.g. 16 Gb) but are very slow for analising large datasets (>10 samples) even in workstations with plenty of resources. Still, setting -contiglen to 1000 or higher can make seqmerge a viable strategy even in those cases. Otherwise, we recommend to use either the sequential or the co-assembly modes.

Regarding the choice of assembler, MEGAHIT and SPAdes work better with short Illumina reads, while Canu and Flye support long reads from PacBio or ONT-Minion. MEGAHIT (the default in SqueezeMeta) is more resource-efficient than SPAdes, consuming less memory, but SPAdes supports more analysis modes and produces slightly better assembly statistics. SqueezeMeta can call SPAdes in three different ways. The option -a spades is meant for metagenomic datasets, and will automatically add the flags --meta -k 21,33,55,77,99,127 to the spades.py call. Conversely, -a rnaspades will add the flags --rna -k 21,33,55,77,99,127. Finally, the option -a spades_base will add no additional flags to the spades.py call. This can be used in conjunction with --assembly options when one wants to fully customize the call to SPAdes, e.g. for assembling single cell genomes.

5. Execution, restart and running scripts

Scripts location

The scripts composing the SqueezeMeta pipeline can be found in the /path/to/SqueezeMeta/scripts directory. Other utility scripts can be found in the /path/to/SqueezeMeta/utils directory. See the PDF manual for more information on utility scripts.

Execution

The command for running SqueezeMeta has the following syntax:

SqueezeMeta.pl -m <mode> -p <projectname> -s <equivfile> -f <raw fastq dir> <options>

Arguments Mandatory parameters

Restarting

Filtering

Assembly

Annotation

Mapping

Binning

Performance

Other

Information

Example SqueezeMeta call: SqueezeMeta.pl -m coassembly -p test -s test.samples -f mydir --nopfam -miniden 50

This will create a project "test" for co-assembling the samples specified in the file "test.samples", using a minimum identity of 50% for taxonomic and functional assignment, and skipping Pfam annotation. The -p parameter indicates the name under which all results and data files will be saved. This is not required for sequential mode, where the name will be taken from the samples file instead. The -f parameter indicates the directory where the read files specified in the sample file are stored.

The samples file:

The samples file specifies the samples, the names of their corresponding raw read files and the sequencing pair represented in those files, separated by tabulators.

It has the format: <Sample> <filename> <pair1|pair2>

An example would be

Sample1 readfileA_1.fastq   pair1
Sample1 readfileA_2.fastq   pair2
Sample1 readfileB_1.fastq   pair1
Sample1 readfileB_2.fastq   pair2
Sample2 readfileC_1.fastq.gz    pair1
Sample2 readfileC_2.fastq.gz    pair2
Sample3 readfileD_1.fastq   pair1   noassembly
Sample3 readfileD_2.fastq   pair2   noassembly

The first column indicates the sample id (this will be the project name in sequential mode), the second contains the file names of the sequences, and the third specifies the pair number of the reads. A fourth optional column can take the noassembly value, indicating that these sample must not be assembled with the rest (but will be mapped against the assembly to get abundances). This is the case for RNAseq reads that can hamper the assembly but we want them mapped to get transcript abundance of the genes in the assembly. Similarly, an extra column with the nobinning value can be included in order to avoid using those samples for binning. Notice that a sample can have more than one set of paired reads. The sequence files can be in fastq or fasta format, and can be gzipped. If a sample contains paired libraries, it is the user's responsability to make sure that the forward and reverse files are truly paired (i.e. they contain the same number of reads in the same order). Some quality filtering / trimming tools may produce unpaired filtered fastq files from paired input files (particularly if run without the right parameters). This may result in SqueezeMeta failing or producing incorrect results.

Restart

Any interrupted SqueezeMeta run can be restarted using the program the flag --restart. It has the syntax:

SqueezeMeta.pl -p <projectname> --restart

This command will restart the run of that project by reading the progress.txt file to find out the point where the run stopped.

Alternatively, the run can be restarted from a specific step by issuing the command:

SqueezeMeta.pl -p <projectname> --restart -step <step_to_restart_from>

By default, already completed steps will not be repeated when restarting, even if requested with -step. In order to repeat already completed steps you must also provide the flag --force_overwrite.

e.g. SqueezeMeta.pl --restart -p <projectname> -step 6 --force_overwrite would restart the pipeline from the taxonomic assignment of genes. The different steps of the pipeline are listed in section 1.

Running scripts

Also, any individual script of the pipeline can be run using the same syntax:

script <projectname> (for instance, 04.rundiamond.pl <projectname> to repeat the DIAMOND run for the project)

6. Analizing an user-supplied assembly

An user-supplied assembly can be passed to SqueezeMeta with the flag -extassembly . The contigs in that fasta file will be analyzed by the SqueezeMeta pipeline starting from step 2.

7. Using external databases for functional annotation

Version 1.0 implements the possibility of using one or several user-provided databases for functional annotation. This is invoked using the -extdb option. Please refer to the manual for details.

8. Extra sensitive detection of ORFs

Version 1.0 implements the --D option (doublepass), that attempts to provide a more sensitive ORF detection by combining the Prodigal prediction with a BlastX search on parts of the contigs where no ORFs were predicted, or where predicted ORFs did not match anything in the taxonomic and functional databases.

9. Testing SqueezeMeta

The download_databases.pl and make_databases.pl scripts also download two datasets for testing that the program is running correctly. Assuming either was run with the directory /download/path as its target the test run can be executed with

cd </download/path/test>
SqueezeMeta.pl -m coassembly -p Hadza -s test.mock.samples -f raw

Alternatively, -m sequential or -m merged can be used.

In addition to this mock dataset, we also provide two real metagenomes. A test run on those can be executed with

SqueezeMeta.pl -m coassembly -p Hadza -s test.samples -f raw

10. Working with Oxford Nanopore MinION and PacBio reads

Since version 0.3.0, SqueezeMeta is able to seamlessly work with single-end reads. In order to obtain better mappings of MinION and PacBio reads against the assembly, we advise to use minimap2 for read counting, by including the -map minimap2-ont (MinION) or -map minimap2-pb (PacBio) flags when calling SqueezeMeta. We also include the Canu and Flye assemblers, which are specially tailored to work with long, noisy reads. They can be selected by including the -a canu or -a flye flag when calling SqueezeMeta. As a shortcut, the --minion flag will use both Canu and minimap2 for Oxford Nanopore MinION reads. As an alternative to assembly, we also provide the sqm_longreads.pl script, which will predict and annotate ORFs within individual long reads.

11. Working in a low-memory environment

In our experience, assembly and DIAMOND alignment against the nr database are the most memory-hungry parts of the pipeline. By default SqueezeMeta will set up the right parameters for DIAMOND and the Canu assembler based on the available memory in the system. DIAMOND memory usage can be manually controlled via the -b parameter (DIAMOND will consume ~5*b Gb of memory according to the documentation, but to be safe we set -b to free_ram/8). Assembly memory usage is trickier, as memory requirements increase with the number of reads in a sample. We have managed to run SqueezeMeta with as much as 42M 2x100 Illumina HiSeq pairs on a virtual machine with only 16Gb of memory. Conceivably, larger samples could be split an assembled in chunks using the merged mode. We include the shortcut flag --lowmem, which will set DIAMOND block size to 3, and Canu memory usage to 15Gb. This is enough to make SqueezeMeta run on 16Gb of memory, and allows the in situ analysis of Oxford Nanopore MinION reads. Under such computational limitations, we have been able to coassemble and analyze 10 MinION metagenomes (taken from SRA project SRP163045) in less than 4 hours.

12. Tips for working in a computing cluster

SqueezeMeta will work fine inside a computing cluster, but there are some extra things that must be taken into account. Here is a list in progress based on frequent issues that have been reported.

13. Updating SqueezeMeta

Assuming your databases are not inside the SqueezeMeta directory, just remove it, download the new version and configure it with

/path/to/SqueezeMeta/utils/install_utils/configure_nodb.pl /path/to/db

14. Downstream analysis of SqueezeMeta results

SqueezeMeta comes with a variety of options to explore the results and generate different plots. These are fully described in the PDF manual and in the wiki. Briefly, the three main ways to analyze the output of SqueezeMeta are the following:

1) Integration with R: We provide the SQMtools R package, which allows to easily load a whole SqueezeMeta project and expose the results into R. The package includes functions to select particular taxa or functions and generate plots. The package also makes the different tables generated by SqueezeMeta easily available for third-party R packages such as vegan (for multivariate analysis), DESeq2 (for differential abundance testing) or for custom analysis pipelines. See examples here. SQMtools can also be used in Mac or Windows, meaning that you can run SqueezeMeta in your Linux server and then move the results to your own computer and analyze them there. See advice for this below.

2) Integration with the anvi'o analysis pipeline: We provide a compatibility layer for loading SqueezeMeta results into the anvi'o analysis and visualization platform (http://merenlab.org/software/anvio/). This includes a built-in query language for selecting the contigs to be visualized in the anvi'o interactive interface. See examples here.

We also include utility scripts for generating itol and pavian -compatible outputs.

15. Analyzing SqueezeMeta results in your desktop computer

Many users run SqueezeMeta remotely (e.g. in a computing cluster). However it is easier to explore the results interactively from your own computer. Since version 1.6.2, we provide an easy way to achieve this. 1) In the system in which you ran SqueezeMeta, run the utility script sqm2zip.py /path/to/my_project /output/dir, where /path/to/my_project is the path to the output of SqueezeMeta, and /output/dir an arbitrary output directory. 2) This will generate a file in /output/dir named my_project.zip, which contains the essential files needed to load your project into SQMtools. Transfer this file to your desktop computer. 3) Assuming R is present in your desktop computer, you can install SQMtools with if (!require("BiocManager", quietly = TRUE)) { install.packages("BiocManager")}; BiocManager::install("SQMtools"). This will work seamlessly in Windows and Mac computers, for Linux you may need to previously install the libcurl development library. 4) You can load the project directly from the zip file (no need for decompressing) with import(SQMtools); SQM = loadSQM("/path/to/my_project.zip").

16. Alternative analysis modes

In addition to the main SqueezeMeta pipeline, we provide two extra modes that enable the analysis of individual reads.

1) sqm_reads.pl: This script performs taxonomic and functional assignments on individual reads rather than contigs. This can be useful when the assembly quality is low, or when looking for low abundance functions that might not have enough coverage to be assembled.

2) sqm_longreads.pl: This script performs taxonomic and functional assignments on individual reads rather than contigs, assuming that more than one ORF can be found in the same read (e.g. as happens in PacBio or MinION reads).

3) sqm_hmm_reads.pl: This script provides a wrapper to the Short-Pair software, which allows to screen the reads for particular functions using an ultra-sensitive HMM algorithm.

4) sqm_mapper.pl: This script maps reads to a given reference using one of the included sequence aligners (Bowtie2, BWA), and provides estimation of the abundance of the contigs and ORFs in the reference. Alternatively, it can be used to filter out the reads mapping to a given reference.

5) sqm_annot.pl: This script performs functional and taxonomic annotation for a set of genes, for instance these encoded in a genome (or sets of contigs).

17. Adding new binners and assemblers

With some extra scripting, you can integrate other assembly and binning programs into the SqueezeMeta pipeline. See the PDF manual for details.

18. License and third-party software

SqueezeMeta is distributed under a GPL-3 license. Additionally, SqueezeMeta redistributes the following third-party software:

19. About

SqueezeMeta is developed by Javier Tamames and Fernando Puente-Sánchez. Feel free to contact us for support (jtamames@cnb.csic.es, fernando.puente.sanchez@slu.se).