gui11aume / starcode

All pairs search and sequence clustering
GNU General Public License v3.0
90 stars 21 forks source link

Overlapping spheres. #23

Closed ezorita closed 6 years ago

ezorita commented 6 years ago

Implements the following updates to the definition of Spheres clustering (requested in Issue #22):

Current behavior:

  1. Sort by sequence count.
  2. By seqcount order: if a sequence has not been claimed, it becomes a centroid. Otherwise continue to the next sequence.
  3. The centroid claims all its hits that haven't been claimed yet by another centroid.

New behavior:

  1. Sort by sequence count.
  2. By seqcount order: if a sequence has not been claimed, it becomes a centroid. Otherwise continue to the next sequence.
  3. The centroid claims all its hits that haven't been claimed yet by another centroid or all hits whose distance to the current centroid is less than the distance to its canonical.

Remarks:

The new algorithm is not so greedy and produces more meaningful spheres.

ezorita commented 6 years ago

Additionally we can also define a minimum rate to steal sequences from other spheres. This is to avoid tiny satellites to steal sequences from a big sphere, hence creating an independent (and probably wrong) cluster.