gui11aume / starcode

All pairs search and sequence clustering
GNU General Public License v3.0
90 stars 21 forks source link

Transfer sequence ids is inefficient #24

Closed ezorita closed 6 years ago

ezorita commented 6 years ago

Update the useq id algorithm so that:

New functions:

ezorita commented 6 years ago

Message passing:

  1. Matches are from child to parent, but each child is tagged with the cluster centroid.
  2. Sequences are sorted by the count of the centroid (or by centroid alphabetical order if equal counts).
  3. All sequences with same centroid are consecutive in list and are processed in order.
    • Exploit this to merge sequence ids on the fly while processing this list. Ideally, the final list size should have been precomputed previously and it's possible to directly merge-sort the ids.

Spheres clustering:

  1. The centroid has edges to its matches.
  2. Sequences are sorted by count.
  3. Only centroids are printed, the list of matches is explored to produce the cluster sequence list.
    • Exploit the match list to merge-sort ids on the fly. Again, ideally the final size of the id list should have been precomputed.
ezorita commented 6 years ago

Seq ids will not be kept sorted. They are collected in a stack when the clusters are defined and sorted right before printing.