sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
476 stars 79 forks source link

can we update clustering results with new signatures? #2272

Open ctb opened 2 years ago

ctb commented 2 years ago

we're starting to develop more options for large-scale clustering of sketches over in https://github.com/sourmash-bio/sourmash/issues/2271 (GTDB and SRA scale, even!).

one feature that's been requested and discussed in the µbioinfo slack is updating of clustering as new samples come out, which I think falls into the "online clustering" category -

@boulund -

Does anyone know of a good method to quickly compare metagenomic samples, just to get some kind of overall similarity or to be able to cluster them into sample types (gut microbiome, vaginal, skin, oral, etc). I want to be able to continuously add new samples so it would be best if it didn't require recomputation with all samples. It would be even more awesome if it could be stored in an effective way so it's easy to pull out subsets of samples and cluster ad-hoc somehow.

(my emphasis in bold ;)

I don't know how amenable the work that @mr-eyes is doing with kSpider is to this kind of approach, but it is definitely something we should think about. Any thoughts, Mo?

I also like the point about it being stored in an effective way - touched on by @bluegenes over here.

mr-eyes commented 2 years ago

If we could index and cluster a relatively large number of samples and consider them core clusters, I could easily assign new samples to the current clusters. So, by adding a new sample, I will score it against the existing clusters and choose the closest cluster(s) to be assigned to if within the user-defined similarity threshold. If not, it can be assigned as a new cluster that could grow over time with more queries. So, it's a kind of a dynamic clustering approach, and it's already implemented in kSpider and to be released soon after resolving #2271

boulund commented 2 years ago

Wow, very cool. Thanks for the quick feedback everyone!

ctb commented 2 years ago

If we could index and cluster a relatively large number of samples and consider them core clusters, I could easily assign new samples to the current clusters. So, by adding a new sample, I will score it against the existing clusters and choose the closest cluster(s) to be assigned to if within the user-defined similarity threshold. If not, it can be assigned as a new cluster that could grow over time with more queries. So, it's a kind of a dynamic clustering approach, and it's already implemented in kSpider and to be released soon after resolving #2271

question @mr-eyes - is the clustering you're doing here heuristic / progressive / stochastic, or not? (there's a word I'm looking for... those are not it... sigh) that is, if you shuffle the input order, will you get different results?

I am generally a bigger fan of deterministic clustering, which often is incompatible with progressive or online clustering. that's one reason why I like the basic sourmash approach of trying to just make everything really fast, so you can redo clustering from scratch each time :).

but, in this case, I think we just need to be clear about when and where we're applying heuristics and generating clusters that may (subtly or not so subtly) change depending on input order etc.