refresh-bio / vclust

Fast and accurate tool for calculating Average Nucleotide Identity (ANI) and clustering virus genomes and metagenomes
GNU General Public License v3.0
48 stars 1 forks source link

Min. query coverage of vclust cluster when classifying viruses into species following ICTV standards #17

Open lingyi-owl opened 3 days ago

lingyi-owl commented 3 days ago

Hi, I use vclust script to classify viruses into species and genera following ICTV standards.

The script of assigning viruses into putative species (tANI ≥ 95%) is: vclust cluster -i ani.tsv -o species.tsv --ids ani.ids.tsv --algorithm complete --metric tani --tani 0.95

What is the minimum query coverage used in this script?

Thanks in advance, Lingyi

aziele commented 3 days ago

Hi,

The tANI measure is equivalent to the intergenomic similarity used by VIRIDIC. Unlike ANI, which is calculated in respect to the alignment length, tANI takes into account the full lengths of the genomes being compared. This means that tANI reflects the nucleotide identity between two genome sequences, assuming both genomes have 100% coverage. Therefore, it is only appropriate to use tANI when working with complete genomes.

The formula for tANI is as follows:

tANI = (idAB + idBA) / (lenA + lenB) × 100

where:

When clustering with --tani 0.95, Vclust will connect genome pairs that have a tANI value ≥ 95%.

Best, Andrzej