#####################################################################
#### __ __(_) _ __ ___ ___ _ __ | |_ ___ _ __ |___ \ ####
#### \ \ / /| || '__|/ __| / _ \ | '__|| __|/ _ \| '__| __) | ####
#### \ V / | || | \__ \| (_) || | | |_| __/| | / __/ ####
#### \_/ |_||_| |___/ \___/ |_| \__|\___||_| |_____| ####
#####################################################################
VirSorter2 applies a multi-classifier, expert-guided approach to detect diverse DNA and RNA virus genomes. It has made major updates to its previous version:
See more details in the publicaiton.
Mamba is the easiest way to install VirSorter2. If you do not have mamba installed, it can be installed following this link.
mamba create -n vs2 -c conda-forge -c bioconda virsorter=2
mamba activate vs2
The development version is most updated. To install the development version:
mamba create -n vs2 -c conda-forge -c bioconda "python>=3.6,<=3.10" scikit-learn=0.22.1 imbalanced-learn pandas seaborn hmmer==3.3 prodigal screed ruamel.yaml "snakemake>=5.18,<=5.26" click "conda-package-handling<=1.9"
mamba activate vs2
git clone https://github.com/jiarong/VirSorter2.git
cd VirSorter2
pip install -e .
If you have apptainer (formerly known as singularity) installed (typical in HPC clusters), the following is the most convenient. Also use the option if you have issues with conda.
apptainer build virsorter2.sif docker://jiarong/virsorter:latest
You will get a file virsorter2.sif
, which is a singularity image that can be run like a binary executable file. You can use the absolute path of this file to replace virsorter
in commands for the rest of the tutorial. Also this image has the database and dependencies included, so you can skip the Download database and dependencies step below.
Before running VirSorter2, users must download all databases and install dependencies (takes 10+ mins, but this only need to be done once). The following command line downloads databases and dependencies to db
directory, and its location is recorded in the tool configuration as a default, so you do not need to type --db-dir
for other virsorter
subcommands.
Note that this step takes ~ 10 mins. If you feel impatient and cancel the process, make sure to remove the diretory specified by -d/--db-dir
(db
in this case) before running again.
# just in case there is a failed attemp before;
# remove the whole diretory specified by -d
rm -rf db
# run setup
virsorter setup -d db -j 4
If you have issues with setup the database with the above setup
command (e.g. server firewall setting does not allow wget
), you can download the database manually at https://osf.io/v46sc/download
. The name of downloaded file should be db.tgz
if downloaded from a brower. Next move/upload it to the current diretory and then run the following commands to extract the database and setup configuration:
tar -xzf db.tgz
virsorter config --init-source --db-dir=./db
To run viral sequence identification on a test dataset:
# fetch testing data
wget -O test.fa https://raw.githubusercontent.com/jiarong/VirSorter2/master/test/8seq.fa
# run classification with 4 threads (-j) and test-out as output diretory (-w)
virsorter run -w test.out -i test.fa --min-length 1500 -j 4 all
ls test.out
Due to the large HMM database that VirSorter2 uses, this small dataset takes a few mins to finish. In the output directory (test.out), three files are useful:
final-viral-combined.fa
: identified viral sequencesfinal-viral-score.tsv
: table with score of each viral sequences across groups and a few more key features, which can be used for further filteringfinal-viral-boundary.tsv
: table with boundary information; This is a intermediate file that 1) might have extra records compared to other two files and should be ignored; 2) do not include the viral sequences w/ < 2 gene but have >= 1 hallmark gene; 3) the group
and trim_pr
are intermediate results and might not match the max_group
and max_score
respectively in final-viral-score.tsv
More details about each of these output files can be found here.
NOTE
Note that suffix ||full
, ||lt2gene
and ||{i}_partial
({i}
can be numbers starting from 0 to max number of viral fragments found in that contig) have been added to original sequence names to differentiate sub-sequences in case of multiple viral subsequences found in one contig. Partial sequences can be treated as proviruses since they are extracted from longer host sequences. Full sequences, however, can be proviruses or free virus since it can be a short fragment sequenced from a provirus region. Moreover, "full" sequences are just sequences with strong viral signal as a whole ("nearly full" is more accurate). They might be trimmed due to partial gene overhang at ends, duplicate segments from circular genomes, and an end trimming step for all identified viral sequences to find the optimal viral segments (longest within 95% of peak score by default). Again, the "full" sequences trimmed by the end trimming step should not be interpreted as provirus, since genes that have low impact on score, such as unknown gene or genes shared by host and virus, could be trimmed. If you prefer the full sequences (ending with ||full) not to be trimmed and leave it to specialized tools such as checkV, you can use --keep-original-seq
option.
The default score cutoff (0.5) works well known viruses (RefSeq). For the real environmental data, we can expect to get false positives (non-viral) with the default cutoff. Generally, samples with more host (e.g. bulk metaG) and unknown sequences (e.g. soil) tends to have more false positives. We find a score cutoff of 0.9 work well as a cutoff for high confidence hits, but there are also many viral hits with score <0.9. It's difficult to separate the viral and non-viral hits by score alone. So we recommend using the default score cutoff (0.5) for maximal sensitivity and then applying a quality checking step using checkV. Here is a tutorial of viral identification SOP used in Sullivan Lab.
--include-groups
)The available options are dsDNAphage, NCLDV, RNA, ssDNA virus, and lavidaviridae. The default is dsDNAphage and ssDNA (changed from all groups since version 2.2), suitable for those only interested in phage. If you are only interested in RNA virus, you can run:
rm -rf test.out
virsorter run -w test.out -i test.fa --include-groups RNA -j 4 all
--min-score
and --classify
)VirSorter2 takes one positional argument, all
or classify
. The default is all
, which means running the whole pipeline, including 1) preprocessing, 2) annotation (feature extraction), and 3) classification. The main computational bottleneck is the annotation step, taking about 95% of CPU time. In case you just want to re-run with different score cutoff (--min-score), classify
argument can skip the annotation steps, and only re-run only the classify step.
virsorter run -w test.out -i test.fa --include-groups "dsDNAphage,ssDNA" -j 4 --min-score 0.8 classify
The above overwrites the previous final output files. If you want to keep previous results, you can use --label
to add a prefix to the new final output files.
virsorter run -w test.out -i test.fa --include-groups "dsDNAphage,ssDNA" -j 4 --min-score 0.9 --label rerun classify
--provirus-off
)In case you need to have some results quickly, there are two options: 1) turn off provirus step with --provirus-off
; this reduces sensitivity on sequences that are only partially viral; 2) subsample ORFs from each sequence with --max-orf-per-seq
; This option subsamples ORFs if a sequence has more ORFs than the number provided. Note that this option is only availale when --provirus-off
is used.
rm -rf test.out
virsorter run -w test.out -i test.fa --provirus-off --max-orf-per-seq 20 all
You can run virsorter run -h
to see all options. VirSorter2 is a wrapper around snakemake, a great pipeline management tool designed for reproducibility, and running on computer clusters. All snakemake options still work with VirSorter2, and users can simply append those snakemake option to virsorter options (after all
or classify
). For example, the --forceall
snakemake option can be used to re-run the pipeline.
virsorter run -w test.out -i test.fa --provirus-off --max-orf-per-seq 20 all --forceall
When you re-run any VirSorter2 command, it will pick up at the step (rule in snakemake term) where it stopped last time. It will do nothing if it succeeded last time. The --forceall
option can be used to enforce the re-run.
DRAMv is a tool for annotating viral contigs identified by VirSorter. It needs two input files from VirSorter: 1) viral contigs, 2) affi-contigs.tab
that have info on viral/nonviral and hallmark genes along contigs. In VirSorter2, these files can be generated by --prep-for-dramv
flag.
rm -rf test.out
virsorter run --prep-for-dramv -w test.out -i test.fa -j 4 all
ls test.out/for-dramv
final-viral-combined.fa
identified viral sequences, including two types:
- full sequences identified as viral (identified with suffix
||full
);- partial sequences identified as viral (identified with suffix
||{i}_partial
); here{i}
can be numbers starting from 0 to max number of viral fragments found in that contig;- short (less than two genes) sequences with hallmark genes identified as viral (identified with suffix
||lt2gene
);
final-viral-score.tsv
This table can be used for further screening of results. It includes the following columns:
- sequence name
- score of each viral sequences across groups (multiple columns)
- max score across groups
- max score group
- contig length
- hallmark gene count
- viral gene %
- nonviral gene %
NOTE
Note that classifiers of different viral groups are not exclusive from each other, and may have overlap in their target viral sequence space, which means this information should not be used or considered as reliable taxonomic classification. We limit the purpose of VirSorter2 to viral idenfication only.
final-viral-boundary.tsv
only some of the columns in this file might be useful:
- seqname: original sequence name
- trim_orf_index_start, trim_orf_index_end: start and end ORF index on orignal sequence of identified viral sequence
- trim_bp_start, trim_bp_end: start and end position on orignal sequence of identified viral sequence
- trim_pr: score of final trimmed viral sequence
- partial: full sequence as viral or partial sequence as viral; this is defined when a full sequence has score > score cutoff, it is full (0), or else any viral sequence extracted within it is partial (1)
- pr_full: score of the original sequence
- hallmark_cnt: hallmark gene count
- group: the classifier of viral group that gives high score; this should NOT be used as reliable classification
NOTE
VirSorter2 tends to sometimes overestimate the size of viral sequence during provirus extraction procedure in order to achieve better sensitity. We recommend cleaning these provirus predictions to remove potential host genes on the edge of the predicted viral region, e.g. using a tool like CheckV (https://bitbucket.org/berkeleylab/checkv).
VirSorter2 currently has classifiers of five viral groups (dsDNAphage, NCLDV, RNA, ssNA virus, and lavidaviridae). It's designed for easy addition of more classifiers. The information of classifiers are store in the database (-d
) specified during setup step. For each viral group, it needs four files below:
model
random forest classifier model for identifying viral sequences
customized.hmm (optional)
a collection of viral HMMs for gene annotation; if not specified, the one in
db/hmm/viral/combined.hmm
is used.
hallmark-gene.list (optional)
names of hallmark gene hmm in the above viral hmm database file; These hallmark gene hmms can be collected by literature search or identified by comparing hallmark gene sequences (protein) against HMMs database above with
hmmsearch
; if not specified, no hallmark genes are counted in feature table
rbs-prodigal-train.db (optional)
prodigal RBS (ribosomal binding site) motif training model; this can be produced with
-t
option in prodigal; This is useful feature for NCLDV due to large genome size for training; For other viral groups, it's OK to skip this file.
In this tutorial, I will show how to make model
for the autolykiviridae family.
First, prepare the dataset needed: 1) high quality viral genomes 2) protein sequence of hallmark gene; and install two more dependecies.
# download genome sequences
wget https://github.com/jiarong/small-dataset/raw/master/autolyki/vibrio_autolyki.fna.gz -O autolyki.fna.gz
# download hallmark gene seqs
wget https://raw.githubusercontent.com/jiarong/small-dataset/master/autolyki/DJR.fa -O DJR.fa
# download source code
git clone https://github.com/jiarong/VirSorter2.git
# install two more dependencies
conda install -c bioconda -y screed hmmer
Then identify hallmark gene HMMs by protein sequences of hallmark genes
Note that we will need the VirSorter2 database here. If you skip the tutorial above, you can download the database by virsorter setup -d db -j 4
. This will take 10+ mins.
# compare all HMMs and protein sequences of hallmark gene
# this will take 10+ mins due to large hmm database file
hmmsearch -T 50 --tblout DJR.hmmtbl --cpu 4 -o /dev/null db/hmm/viral/combined.hmm DJR.fa
# get HMMs names that are signicant hits with protein sequences of hallmark gene
python VirSorter2/virsorter/scripts/prepdb-train-get-seq-from-hmm-domtbl.py 50 DJR.hmmtbl > hallmark-gene.list
With hallmark-gene.list
and the high quality genomes autolyki.fna.gz
, you can train the features that are used for the classifier model.
# train feature file
virsorter train-feature --seqfile autolyki.fna.gz --hallmark hallmark-gene.list --hmm db/hmm/viral/combined.hmm --frags-per-genome 5 --jobs 4 -w autolyki-feature.out
# check output
ls autolyki-feature.out
In the output directory (autolyki-feature.out
), all.pdg.ftr
is the feature file needed for next step.
To make the classifier model, we also need a feature file from cellular organisms. This can be done by collecting genomes from cellular organisms and repeat the above step. Note number of cellular genomes are very large (>200K). Here I will re-use the feature file I have prepared before.
# fetch feature file for cellular organisms
wget https://zenodo.org/record/3823805/files/nonviral-common-random-fragments.ftr.gz?download=1 -O nonviral.ftr.gz
gzip -d nonviral.ftr.gz
# train the classifier model
virsorter train-model --viral-ftrfile autolyki-feature.out/all.pdg.ftr --nonviral-ftrfile nonviral.ftr --balanced --jobs 4 -w autolyki-model.out
In autolyki-model.out
, feature-importances.tsv
shows the importance of each feature used. model
is the classifier model we need. Then put the model
and hallmark-gene.list
in database directory as the existing viral groups. Note that only letters are allowed for group directory under db/group/
.
### attention: only letters (both upper and lower case) are allowed in group names
mkdir db/group/autolykiviridae
cp autolyki-model.out/model db/group/autolykiviridae
cp hallmark-gene.list db/group/autolykiviridae/
Now you can try this new classifier on the testing dataset, and compare with dsDNAphage
classifier:
# download the testing dataset
wget -O test.fa https://raw.githubusercontent.com/jiarong/VirSorter2/master/test/8seq.fa
# identify viral sequences in testing dataset; it takes 10+ mins;
virsorter run -w autolyki-model-test.out -i test.fa --include-groups "dsDNAphage,autolykiviridae" -j 4 --min-score 0.8 all
# check the scores in two classifiers
cat autolyki-model-test.out/final-viral-score.tsv
A: Generally, those with score >0.9 are high confidence. Those with score between 0.5 and 0.9 could be a mixture of viral and non-viral. It's hard to find a optimal score separating viral and non-viral since it depends on % of host sequence and unknown sequences. So we recommend using the default cutoff (0.5) for maximal sensitivity and then applying a quality checking step using checkV to for removing false positives (other than predicting completeness). Here is the viral identification SOP in the Sullivan Lab.
No module name screed
)?A: This is usually caused by the incompatibility between two different package/environment managing tools: Modules (module load
) and conda (conda activate
). There are two solutions: 1) install conda on your own (user level) instead of using the system conda, and thus avoiding module load
; 2) Sometimes server system admins discourage users to install conda at user level. If so, you can remove the module load
or module use
in batch scripts, and instead run them interactively in the terminal to load necessary packages before submitting the batch scripts.
A: MacOS is not supported currently. VirSorter2 runs are typically computationally expensive, and should be run in servers (typically Linux). VirSorter2 leverages large viral protein HMM reference databases to achieve its high sensitivity, and the flip side is that its computationally expensive.
A: Here are a few ways: 1) use more cpu cores (-j); 2) filter your contigs on length (>1500 or >5000) with --min-length
; 3) reduce the viral groups in --include-groups
. For most people interestd in phage, only dsDNAphage and ssDNA are needed, which is the default since version 2.2; 4) increase the threads for hmmsearch (the default is 2) by virsorter config --set HMMSEARCH_THREADS=4
. Usually the IO is the bottleneck, not the CPU though.
A: Only partially viral sequences (ending with _partial) can be confirmed as provirus. Fully viral sequences (ending with ||full) in VirSorter2 defined as contigs with significant viral signal (score >=0.8) as a whole sequene. Thus some could be provirus too: 1) they could be a fragment from within a provirus; 2) the whole sequence has strong viral signal in spite of some host genes at ends.
A: The provirus boundary dectection algorithm in VirSorter2 tends to overextend to host regions. VirSorter2 estimate boundaries by looking at the peak score of sub-sequences and then overextend a bit (within 95% of peak score by default). This is a design decision so that predicted viral sequences can be further passed to a more specialized provirus boundary prediction tool.
A: The default score cutoff (0.5) has high sensitivity but also brings in many non-viral sequences. For phages, we recommend using checkV to remove those non-viral sequences following this protocol. See more details in the answer to the how should I pick a score cutoff.
A: There are three situations that a fully viral sequence can be trimmed. 1) VirSorter2 is based on genes called by prodigal. A few bases overhang beyond the first and last gene are ignored by prodigal and also ignored VirSorter2 by default; 2) Circular sequences are usually split in the middle of a gene and have duplicate segments. VirSorter2 trims the duplicate segments and fixes the split gene by moving the partial gene the start to the end. 3) fully viral sequences only means the whole sequence has significant viral signal (score >=0.95 by default), but VirSorter2 still applies an end trimming step (10% of genes on each end) on them to find the optimal viral segments (longest within 95% of peak score by default). Again, the "full" sequences trimmed by the end trimming step should not be interpreted as provirus, since genes that have low impact on score, such as unknown gene or genes shared by host and virus, could be trimmed. If you prefer the full sequences (ending with ||full) not to be trimmed and leave it to specialized tools such as checkV, you can use --keep-original-seq
option.
VirSorter 2 is jointly developed by the Sullivan Lab at Ohio State University (https://u.osu.edu/viruslab/) and the Viral Genomics Group at the DOE Joint Genome Institute (https://jgi.doe.gov/our-science/scientists-jgi/viral-genomics/). Funding was provided by NSF (#OCE1829831, #ABI1758974), the U.S. Department of Energy (#DE-SC0020173), and the Gordon and Betty Moore Foundation (#3790). The work conducted by the U.S. Department of Energy Joint Genome Institute is supported by the Office of Science of the U.S. Department of Energy under contract no. DE-AC02-05CH11231.