Vclust is an alignment-based tool for fast and accurate calculation of Average Nucleotide Identity (ANI) between complete or metagenomically-assembled viral genomes. The tool also performs ANI-based clustering of genomes according to standards recommended by international virus consortia, including International Committee on Taxonomy of Viruses (ICTV) and Minimum Information about an Uncultivated Virus Genome (MIUViG).
Vclust uses a Lempel-Ziv-based pairwise sequence aligner (LZ-ANI) for ANI calculation. LZ-ANI achieves high sensitivity in detecting matched and mismatched nucleotides, ensuring accurate ANI determination. Its efficiency comes from a simplified indel handling model, making LZ-ANI magnitudes faster than alignment-based tools (e.g., BLASTn, MegaBLAST) while maintaining comparable accuracy to the most sensitive BLASTn searches.
Vclust offers multiple similarity measures between two genome sequences:
Vclust provides six clustering algorithms tailored to various scenarios, including taxonomic classification and dereplication of viral genomes.
Vclust uses three efficient C++ tools - Kmer-db, LZ-ANI, Clusty - for prefiltering, aligning, calculating ANI, and clustering viral genomes. This combination enables the processing of millions of virus genomes within a few hours on a mid-range workstation.
For datasets containing up to 1000 viral genomes, Vclust is available at http://www.vclust.org.
# Install Vclust (requires Python >= 3.7)
pip install vclust
# Prefilter similar genome sequence pairs before conducting pairwise alignments.
vclust prefilter -i example/multifasta.fna -o fltr.txt
# Align similar genome sequence pairs and calculate pairwise ANI measures.
vclust align -i example/multifasta.fna -o ani.tsv --filter fltr.txt
# Cluster genome sequences based on given ANI measure and minimum threshold.
vclust cluster -i ani.tsv -o clusters.tsv --ids ani.ids.tsv --metric ani --ani 0.95
The Vclust documentation is available on the GitHub Wiki and includes the following sections:
Zielezinski A, Gudyś A, Barylski J, Siminski K, Rozwalak P, Dutilh BE, Deorowicz S. Ultrafast and accurate sequence alignment and clustering of viral genomes. bioRxiv [doi:10.1101/2024.06.27.601020].