scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.76k stars 497 forks source link

[question] is it possible to merge results from 2 clusterings ? #473

Open erwanlenagard opened 3 years ago

erwanlenagard commented 3 years ago

Hello,

Sorry if my question is irrelevant. I'm quite a newbie that read HDBSCAN documentation but I still need some help...

I've a dataset of news articles that I agregate on a daily basis. I already have a 3 months archive that will continue to grow on the following months. I'm using HDBSCAN to group similar articles into topics in order to analyse the media coverage of those events over time. I mean that an event may be covered by medias during several days, but the list of topics should be frequently updated because previously unseen events will be covered by medias on the next periods.

It seems that I have better clustering results on short periods (few days or a week of data) more than on the full dataset. My question : is it possible to compute clusters on a rolling period and merge the results (= merge common topics of consecutive periods) ? How should I proceed ?

I understand that _hdbscan.approximatepredict() helps me to assign topics for new data points, but will not create new topics. So it doesn't fit my needs...

Thanks for your help

lmcinnes commented 3 years ago

There isn't really anything built in to handle this unfortunately. The best thing to do would be to effectively build your own solution. You can certainly clustering a rolling period -- and rolling is important for the next step: merging common topics. As long as you have rolling periods (so there is overlap among consecutive periods) then you can "score" how similar clusters are by looking at the Jaccard similarity of the clusters (i.e. the ratio of the size of intersection of the clusters -- the documents they have in common -- and size of the union of the clusters -- the total number of unique documents in the two clusters). Worst case you can then set thresholds and merge according to that. That may be enough for your purposes.

If you want to get particularly fancy you could try embedding the rolling data slices with AlignedUMAP (https://umap-learn.readthedocs.io/en/latest/aligned_umap_politics_demo.html) and then cluster that. That does no allow continuous updating of course, which is something you may be looking for.

erwanlenagard commented 3 years ago

Thank you very much ! :)