simroux / ClusterGenomes

Archive for ClusterGenomes scripts
7 stars 2 forks source link

Increased running time compared to old ClusterGenomes #3

Closed LoreVE closed 3 years ago

LoreVE commented 3 years ago

Hi Simon

We recently came across these "new" versions of ClusterGenomes (we were still using the stampede version from Bitbucket). We noticed that while the nucmer step is way faster than before, the actual clustering is a lot slower than it used to be (and total running time is up). Also, AF (alignment fraction?) and wgID (whole genome ID?) are added to the clstr file. So we were wondering if the actual clustering method is changed, or if these are just for information only (in case we might remove their calculations from the scripts to speed it up?)? I have tried to figure it out based on the scripts, but I'm not very familiar with perl... So I was wondering if you could help us.

Thanks!

simroux commented 3 years ago

Hi Lore,

Right, wgID is "whole genome ID", i.e. combined id% and AF. Typically version 5.1 is much quicker on both steps (nucmer and actual clustering), but it somewhat depends on your dataset, specifically whether a lot of sequences are similar to each other or if most sequences are completely distinct.

Anyway, my recommendation today for scaling up would be to use "anicalc" and "aniclust" from the CheckV package: https://bitbucket.org/berkeleylab/checkv/src/master/. Instructions are available at the bottom of the CheckV readme ("Supporting code: Rapid genome clustering based on pairwise ANI"). I'll add a note to the Readme here :-)

Thanks ! Best, Simon