The study of viral communities in complex environmental samples, such as soil, can provide valuable insights into the diversity and functions of viral communities in the ecosystem. However, processing and analyzing of virome data can be a challenging task that requires the integration of various computational tools and techniques.
To address these challenges, we have developed SOVAP pipeline that utilizes a suite of state-of-the-art tools for processing, analysis, and annotation viromics and metagenomics data.
It utilizes various tools such as Fastp and Centrifuge for preprocessing and contamination removal, geNomad, Diamond and Megan for identification and annotation of viral contigs which are assembled and clustered using Megahit and CD-HIT. Additionally, this pipeline provides an estimate of the abundance of viral contigs, allowing for a more comprehensive understanding of the virome within the sample. The integration of these tools offers a reliable and effective means of taxonomy classification and annotation of viral contigs, aiding researchers in gaining insight into the composition and function of the virome within the analyzed sample.
By integrating the SOVAP pipeline with IMG/VR and geNomad, it is possible to identify a wider range of viruses, including those that were previously unknown.
The batch-mode script allows for the processing of multiple datasets using the SOVAP pipeline. This feature is particularly useful for large-scale analyses, such as those involving multiple environmental samples or large sequencing datasets.
The following features are coming soon:
The new approach to analyze using both Diamond Databases simultaneously (Genbank and then IMG/VR)
Add Recentrifuge to analyze centrifuge outputs and visualize them
Add Spades to the assembly step, so user can choose between Spades and Megahit
Add removing specific host reads before Centrifuge (using BBDuk)
To successfully run the SOVAP pipeline, your system must have the necessary software installed and accessible through the system's path:
First we need to add these channels to the conda config file (if not already added):
Add the required channels:
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
Then the easiest way to install the required tools and libraries is using conda yaml files to create seperate environments:
Clone the repository:
git clone https://github.com/poursalavati/SOVAP.git
cd SOVAP
conda env create -f centrifuge.yml
conda env create -f megan.yml
conda env create -f genomad.yml
conda env create -f SOVAP.yml
or using conda install command:
conda create -n centrifuge centrifuge=1.0.4 recentrifuge=1.10.0
conda create -n megan megan
conda create -n genomad genomad
conda create -n SOVAP python=3.8 seqkit samtools=1.15.1 fastp=0.23.2 megahit bwa=0.7.17 diamond cd-hit=4.8.1
This section will be added soon
To run the SOVAP pipeline, you can use the Run_SOVAP.py
provided.
Here are the steps to run the pipeline:
conda activate centrifuge
conda activate --stack megan
conda activate --stack genomad
conda activate --stack SOVAP
python ./Run_SOVAP.py
Example command - 1 (IMG/VR database):
python ./Run_SOVAP.py -r1 Virome_R1_001.fastq.gz -r2 Virome_R2_001.fastq.gz -t 25 -x centrifuge/p_compressed/p_compressed -g genomad_db/ -d Diamond/IMGVR.dmnd --end_to_end
Example command - 2 (NCBI database + DIAMOND-MEGAN):
python ./Run_SOVAP.py -r1 Virome_R1_001.fastq.gz -r2 Virome_R2_001.fastq.gz -t 25 -x centrifuge/p_compressed/p_compressed -g genomad_db/ -d Diamond/DIAMOND_viral_database_GB.dmnd -md megan-map/megan-map-Feb2022.db --megan --end_to_end
Batch mode in SOVAP allows for efficient analysis of large datasets by running the pipeline on multiple paired-end fastq files. The user can provide a directory containing all the files to be processed. The pipeline will automatically loop through all files and run the selected options for each dataset, saving the results in separate output directories.
To run the SOVAP pipeline in batch mode you can use the SOVAP_BatchMode.py
.
Note: To run in batch mode, both scripts should be in the same directory
python ./SOVAP_BatchMode.py
Example command - 1 (IMG/VR database - Assuming that the current folder contains multiple paired-end fastq files):
python ./SOVAP_BatchMode.py -i . -o . -t 25 -x centrifuge/p_compressed/p_compressed -g genomad_db/ -d Diamond/IMGVR.dmnd --end_to_end
Example command - 2 (NCBI database + DIAMOND-MEGAN - Assuming that the current folder contains multiple paired-end fastq files):
python ./SOVAP_BatchMode.py -i . -o . -t 25 -x centrifuge/p_compressed/p_compressed -g genomad_db/ -d Diamond/DIAMOND_viral_database_GB.dmnd -md megan-map/megan-map-Feb2022.db --megan --end_to_end
To prepare the IMG/VR database for use with the Diamond analysis step in SOVAP, follow these steps:
Download the latest IMG/VR database (version 4) from the official website (https://img.jgi.doe.gov/vr/).
Extract the database files to a directory.
Use the Diamond software to build a Diamond database from the IMG/VR FASTA files. This can be done using the following command:
diamond makedb --in /path/to/imgvr/fasta/files --db /path/to/output/database --threads 32
This command will create a new Diamond database file in the specified output directory using the IMG/VR FASTA files as input. You can adjust the number of threads used by specifying a different value after --threads
.
Once the Diamond database is built, you can use it with SOVAP by specifying the path to the database file in the -d , --diamond_db
argument when running the pipeline.
Note: Due to the size of the IMG/VR database, building the Diamond database may take several hours or more. It is recommended to use a computer with high processing power and memory for this step (final db size will be more than 65 GB).
While users have the option to use a Genbank virus database, it is strongly recommended to use the IMG/VR database for the best results. The IMG/VR database is a comprehensive and regularly updated database of curated viral genomes that covers a wide range of viral diversity.
The Centrifuge database is a pre-indexed database that contains a comprehensive collection of bacterial and archaeal reference genomes. The pipeline uses this database to classify reads to bacterial taxa as well as to subtract bacterial contamination from metagenomic datasets. The database is indexed to enable fast and accurate classification of the reads, and it can handle large datasets efficiently. The use of this database ensures that the pipeline accurately identifies bacterial contaminants and removes them from the datasets, thereby improving the accuracy of downstream analysis. Overall, the use of the Centrifuge database is an essential step in the SOVAP pipeline for the analysis of metagenomic datasets.
This is a compressed database built from RefSeq genomes of Bacteria and Archaea. The following commands will download and extract centrifuge database:
mkdir centrifuge
wget https://genome-idx.s3.amazonaws.com/centrifuge/p_compressed_2018_4_15.tar.gz
tar -xzvf p_compressed_2018_4_15.tar.gz -C centrifuge
rm p_compressed_2018_4_15.tar.gz
The Megan database is optional and only needed if the user decides to use the --megan
flag for the pipeline.
The --megan
flag will run the DIAMOND analysis and DAA-Meganizer step and will require both the NCBI virus database (in diamond format) and the Megan database.
To download Megan database (megan-map-Feb2022.db.zip) use its official website:
https://software-ab.cs.uni-tuebingen.de/download/megan6/welcome.html
To run Diamond + Megan analysis You can download our pre-indexed Genbank database here:
The outputs of SOVAP pipeline are saved in several subdirectories, each containing the output files generated by a specific step of the pipeline. The subdirectories are as follows:
0_Logs: Contains log files generated during the execution of the pipeline for each step.
1_Fastp_Report: Contains a report with statistics about the quality control performed by Fastp.
1_Fastp_Output: Contains a trimmed version of the input fastq files generated by Fastp.
2_Centrifuge_Output: Contains the output of Centrifuge classification for bacterial reads, used for filtering and analysis of bacterial contamination.
2_Clean_Reads: Contains a cleaned and trimmed version of the input fastq files generated by SOVAP after bacterial contamination filtering.
3_Megahit_Output: Contains the contigs generated by Megahit assembly of non-bacterial reads.
4_geNomad_Output: Contains the annotation output generated by geNomad.
5_Clusters_Abundance: contains the clusters of viral contigs that were generated using CD-HIT and the estimated abundance of each cluster.
6_Diamond-Taxonomy: Contains the output of Diamond classification for viral reads.
If using --megan
flag:
6_Diamond_Megan: Contains the DAA file generated by Diamond and Meganizer step used for visualization and analysis of viral classification with Megan.
Here is a tree diagram that shows the directory structure of the "outputs" directory:
├── 0_Logs
├── 1_Fastp_Report
├── 1_Fastp_Output
├── 2_Centrifuge_Output
├── 2_Clean_Reads
├── 3_Megahit_Output
├── 4_geNomad_Output
├── 5_Clusters_Abundance
├── 6_Diamond-Taxonomy
├── * 6_Diamond_Megan
Poursalavati A. (2023). SOVAP v.1.3 : Soil Virome Analysis Pipeline (1.3). Zenodo. https://doi.org/10.5281/zenodo.7700081
BibTeX , CSL , DataCite , Dublin Core , DCAT , JSON , JSON-LD , GeoJSON , MARCXML , Mendeley
Tools | Web | Cite |
---|---|---|
geNomad and IMG/VR 4 | Link | Paper |
DIAMOND | Link | Paper |
MEGAN | Link | Paper |
Centrifuge | Link | Paper |
Recentrifuge | Link | Paper |
DIAMOND-MEGAN | Link | Paper |
Fastp | Link | Paper |
Megahit | Link | Paper |
BWA-mem | Link | Paper |
CD-HIT | Link | Paper |
Seqkit | Link | Paper |
Samtools | Link | Paper |