Vclust clustering more different than MMseqs compared to anicalc

LanderDC commented 4 months ago

Hi,

I'm trying out your tool because I want to replace the workflow in our lab that currently uses megaBLAST + CheckV's anicalc/aniclust to something faster. Based on your tweet, Vclust would be perfect for this as it can use the same clustering algorithms as anicalc/aniclust.

However, when I compare the uclust, leiden and mmseqs linclust clusterings to the original anicalc/aniclust, it seems that the latter performs better (in terms of more similar to the original clustering) based on the adjusted rand index, which is the opposite of what I expected. Do you have any idea why that might be?

In addition, how does Vclust handle permuted circular genomes?

Thanks in advance!

afbeelding

The commands I used:

# MegaBLAST + anicalc/aniclust
blastn -query <my_seqs.fna> -db <my_db> -outfmt '6 std qlen slen' -max_target_seqs 10000 -o <my_blast.tsv> -num_threads 72  -perc_identity 90 && \
anicalc.py -i <my_blast.tsv> -o <my_ani.tsv> && \
aniclust.py --fna <my_seqs.fna> --ani <my_ani.tsv> --out <my_clusters.tsv> --min_ani 95 --min_tcov 85 --min_qcov 0 

# MMseqs linclust
mmseqs easy-linclust {file} {file}_mmseqs /tmp --max-seq-len 1000000 --threads 72 --wrapped-scoring 1 --cluster-mode 2 --min-seq-id 0.95 --cov-mode 1 -c 0.85 --kmer-per-seq-scale 0.4

# Vclust UCLUST
vclust.py prefilter -t 72 -i {file} -o fltr.txt --min-kmers 30 --min-ident 0.90 && \
vclust.py align -t 72 -i {file} -o ani.tsv --filter fltr.txt --out-ani 0.9 && \
vclust.py cluster -i ani.tsv -o {file}_vclust_uclust.tsv --ids ani.ids.tsv --algorithm uclust --metric ani --ani 0.95 --cov 0.85 --out-repr

#Vclust Leiden
vclust.py prefilter -t 72 -i {file} -o fltr.txt --min-kmers 30 --min-ident 0.90 && \
vclust.py align -t 72 -i {file} -o ani.tsv --filter fltr.txt --out-ani 0.9 && \
vclust.py cluster -i ani.tsv -o {file}_vclust_leiden.tsv --ids ani.ids.tsv --algorithm leiden --metric ani --ani 0.95 --cov 0.85 --out-repr

aziele commented 4 months ago

Hi,

Thanks for reaching out! While the aniclust.py documentation mentions UCLUST-like clustering, it actually performs CD-HIT clustering. In MMseqs2, this corresponds to --cluster-mode 2, which you are using. It would be interesting to compare the tools using the same clustering algorithm (in Vclust, that's --algorithm cd-hit).

Regarding permuted circular genomes, Vclust should work fine. Like BLAST, it identifies local alignments between two genomes (similar to HSPs in BLAST) and calculates ANI from those local alignments. In the worst case, ANI might be slightly underestimated due to short alignment discontinuities at the breakpoints of circular genomes.

Andrzej

LanderDC commented 4 months ago

Thanks! You are right, Vclust with the CD-HIT algorithm is much more similar to aniclust.py:

afbeelding

refresh-bio / vclust

Vclust clustering more different than MMseqs compared to anicalc #7