Open ctb opened 2 years ago
If we could index and cluster a relatively large number of samples and consider them core clusters, I could easily assign new samples to the current clusters. So, by adding a new sample, I will score it against the existing clusters and choose the closest cluster(s) to be assigned to if within the user-defined similarity threshold. If not, it can be assigned as a new cluster that could grow over time with more queries. So, it's a kind of a dynamic clustering approach, and it's already implemented in kSpider and to be released soon after resolving #2271
Wow, very cool. Thanks for the quick feedback everyone!
If we could index and cluster a relatively large number of samples and consider them core clusters, I could easily assign new samples to the current clusters. So, by adding a new sample, I will score it against the existing clusters and choose the closest cluster(s) to be assigned to if within the user-defined similarity threshold. If not, it can be assigned as a new cluster that could grow over time with more queries. So, it's a kind of a dynamic clustering approach, and it's already implemented in kSpider and to be released soon after resolving #2271
question @mr-eyes - is the clustering you're doing here heuristic / progressive / stochastic, or not? (there's a word I'm looking for... those are not it... sigh) that is, if you shuffle the input order, will you get different results?
I am generally a bigger fan of deterministic clustering, which often is incompatible with progressive or online clustering. that's one reason why I like the basic sourmash approach of trying to just make everything really fast, so you can redo clustering from scratch each time :).
but, in this case, I think we just need to be clear about when and where we're applying heuristics and generating clusters that may (subtly or not so subtly) change depending on input order etc.
we're starting to develop more options for large-scale clustering of sketches over in https://github.com/sourmash-bio/sourmash/issues/2271 (GTDB and SRA scale, even!).
one feature that's been requested and discussed in the µbioinfo slack is updating of clustering as new samples come out, which I think falls into the "online clustering" category -
@boulund -
(my emphasis in bold ;)
I don't know how amenable the work that @mr-eyes is doing with kSpider is to this kind of approach, but it is definitely something we should think about. Any thoughts, Mo?
I also like the point about it being stored in an effective way - touched on by @bluegenes over here.