sanger-pathogens / Roary

Rapid large-scale prokaryote pan genome analysis
http://sanger-pathogens.github.io/Roary
Other
302 stars 189 forks source link

Question about the fasta cluster iteratively #581

Closed bucongfan closed 1 year ago

bucongfan commented 1 year ago

I have a question about the first core step: cluster fasta by ch-hit iteratively which is also the key step to reducing the number of proteins.

Why we need to cluster iteratively instead of direct cluster once using the expected threshold?

This question that's always puzzled me and hope to get your reply

Thanks!

andrewjpage commented 1 year ago

Its because you can get overclustering with odd centroids chosen and we can use information we already know about the dataset to improve the results.

For example, imagine we have a gene thats 100% identical in every genome, and a similar gene thats 98% identical. These would be split into 2 clusters by iteratively running cd-hit (all the genes 100% identical in all genomes go in one, the rest go in the other), which makes sense biologically. If you just ran cd-hit with a 95% threshold, then both genes would be clustered together and you would have to split the cluster manually later.