gui11aume / starcode

All pairs search and sequence clustering
GNU General Public License v3.0
90 stars 21 forks source link

Description of connected components clustering algorithm #46

Closed pol-delta closed 1 year ago

pol-delta commented 1 year ago

Apologies if I missed this somewhere, but can you please provide a brief description of the "connected components" clustering algorithm and how it differs from message passing and sphere clustering? I couldn't find anything on it in the documentation or published paper.

By the way, Starcode is great, so thank you very much for all your work on it!

gui11aume commented 1 year ago

The spheres clustering algorithm sorts the sequences from highest to lowest occurrence. The sequence with highest occurrence is made a centroid and all the sequences within a distance dist of it are added to the cluster. All the sequences of the cluster are removed from the list and the process is repeated starting from the sequence with highest occurrence. This mode is called "spheres" because the clusters have radius dist.

The message-passing algorithm sorts the sequences from lowest to highest occurrence. Every sequence is initially labelled as centroid. The sequence with lowest occurrence loses the centroid label and is attached to its closest neighbor if 1) they are within distance dist and 2) the occurrence of the neighbor is at least r times higher. In the case that conditions are not met for any neighbor, the sequence simply keeps its centroid label. When there are multiple closest neighbors, the sequence loses the centroid label and is attached to all the closest neighbors. The process is repeated by processing sequences in order of occurrence. In this mode, the clusters can have a radius larger than dist. It is called "message-passing" because it is inspired from an algorithm with the same name.

The connected-component algorithm merges all sequences that are within a distance dist of each other. This is independent of the order in which sequences are processed, or of their occurrence. The centroid is the sequence with highest occurrence. This mode is called "connected components" because the clusters are the connected components of a graph where edges exist between sequences within a distance dist of each other. It was added later than spheres and message-passing cluster.

pol-delta commented 1 year ago

Thank you, that helps tremendously!