Transfer sequence ids is inefficient

ezorita commented 6 years ago

Update the useq id algorithm so that:

All useqs are stored in a malloc'ed space. Don't use ambiguous pointer/int representation.
Do not transfer useqs one by one. Realloc'ing for each transfer is inefficient.

New functions:

transfer_useq_id_count: this function will be used in the clustering process, when the canonicals are assigned. Instead of transferring the ids, where only the final gross count of ids is computed for each centroid. (Note that this requires all useqs to transfer directly to their final centroid, not intermediates).
prealloc_useq_ids: realloc the useq buffer in each centroid to the final size.
transfer_useq_ids: merge useq ids to the cluster.

ezorita commented 6 years ago

Matches are from child to parent, but each child is tagged with the cluster centroid.
Sequences are sorted by the count of the centroid (or by centroid alphabetical order if equal counts).
All sequences with same centroid are consecutive in list and are processed in order.
- Exploit this to merge sequence ids on the fly while processing this list. Ideally, the final list size should have been precomputed previously and it's possible to directly merge-sort the ids.

The centroid has edges to its matches.
Sequences are sorted by count.
Only centroids are printed, the list of matches is explored to produce the cluster sequence list.
- Exploit the match list to merge-sort ids on the fly. Again, ideally the final size of the id list should have been precomputed.

ezorita commented 6 years ago

Seq ids will not be kept sorted. They are collected in a stack when the clusters are defined and sorted right before printing.

gui11aume / starcode