Question about the fasta cluster iteratively

sanger-pathogens / Roary

Rapid large-scale prokaryote pan genome analysis

Other

302 stars 189 forks source link

Its because you can get overclustering with odd centroids chosen and we can use information we already know about the dataset to improve the results.

For example, imagine we have a gene thats 100% identical in every genome, and a similar gene thats 98% identical. These would be split into 2 clusters by iteratively running cd-hit (all the genes 100% identical in all genomes go in one, the rest go in the other), which makes sense biologically. If you just ran cd-hit with a 95% threshold, then both genes would be clustered together and you would have to split the cluster manually later.

sanger-pathogens / Roary

Question about the fasta cluster iteratively #581