scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.78k stars 497 forks source link

Experiments on clustering tweets #87

Closed bwang482 closed 7 years ago

bwang482 commented 7 years ago

Excellent implementation! Thanks guys!!

I have done some experiments now trying to cluster a bunch of tweets (about 350) using hdbscan but the results I have to say, are mush worse than other more 'mainstream' ones from sklearn..

I have tried:

In [62]: clusterer = hdbscan.HDBSCAN(min_cluster_size=N, metric='precomputed')
    ...: clusters = clusterer.fit_predict(sims)
    ...: print('Number of clusters =', clusters.max())
    ...: print(clusters)

In most times I am getting '-1' for cluster labels (is this normal?) and quite often all instances are labelled as '-1's if my N is over 10.. I wonder where the difficulty is? Has hdbscan proven to be below average clustering tool for short and noisy text like tweets?

Thanks!

lmcinnes commented 7 years ago

In practice that few data points in that large dimensional a space is simply not enough for there to actually be clusters. So in some sense I think this is actually doing the right thing: I would be very surprised if there were actually clusters, especially with a minimum cluster size of 10 data points. You could try reducing the dimension via some dimension reduction technique, but realistically you need more data when dealing with that sort of dimensionality, especially for density based algorithms. If you want to know more about your data it might be helpful to look at a hubness plot.

bwang482 commented 7 years ago

Ok in theory there should be clusters as these tweets were collected using comment keywords and within them there should be sub-topics.

I have tried topic modelling and also tfidf with dimentionality reduction with more tweets. Now it has 1000 instances with 50 features. I do get some clusters now with a cluster being labelled as '-1', is this normal?

Thanks for your reply :+1:

lmcinnes commented 7 years ago

Yes, that's pretty normal. If you just want to partition your data, and have every point assigned to a cluster regardless of how much of an outlier it is then you'll want a different algorithm. In practice with small datasets unless everything is very tidily grouped you can expect to have a fair amount of noise points.

On Thu, Feb 23, 2017 at 6:49 AM, bluemonk482 notifications@github.com wrote:

Ok in theory there should be clusters as these tweets were collected using comment keywords and within them there should be sub-topics.

I have tried topic modelling and also tfidf with dimentionality reduction with more tweets. Now it has 1000 instances with 50 features. I do get some clusters now with a cluster being labelled as '-1', is this normal?

Thanks for your reply 👍

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/87#issuecomment-281971836, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBbuMAvf5Mu6vT513120GFqtQ1KiSks5rfXJKgaJpZM4MHPAT .

bwang482 commented 7 years ago

Actually what I ultimately want is to cluster my tweets into a set of sub-topics; then sub-sample to a list of close aligned tweets from each cluster which (ideally) these sub-sampled tweets should be talking about similar things..

Do you have any suggestions on what algorithm or what type of clustering algorithms might be more suitable in my case? So fat after a few quick experiments it seems Affinity Propagation works better for me.

Thanks very much for your help Leland!

lmcinnes commented 7 years ago

I would personally be quite surprised if Affinity Propagation gave particularly useful results. If you need partitioning then Mean Shift is not a terrible idea, and I would also seriously consider a Hierarchical clustering approach which will give you a richer cluster structure to explore. On that front you could stick with HDBSCAN* but instead of taking the labels directly you can explore the condensed tree which gives a hierarchical decomposition of clusters that may be easier to work with (since it is a simpler tree) than standard hierarchical clustering.

On Fri, Feb 24, 2017 at 9:11 AM, bluemonk482 notifications@github.com wrote:

Actually what I ultimately want is to cluster my tweets into a set of sub-topics; then sub-sample to a list of close aligned tweets from each cluster which (ideally) these sub-sampled tweets should be talking about similar things..

Do you have any suggestions on what algorithm or what type of clustering algorithms might be more suitable in my case? So fat after a few quick experiments it seems Affinity Propagation works better for me.

Thanks very much for your help Leland!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/87#issuecomment-282300186, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBQqfxpNFoyV0tmZqe23jcH_GzWJ2ks5rfuUkgaJpZM4MHPAT .

bwang482 commented 7 years ago

Why do you think Affinity Propagation would not give useful results?

Affinity propagation simultaneously considers all data points as potential prototypes and passes soft information around until a subset of data points "win" and become the exemplars.

Doesn't it sound applicable in my case? And it doesn't require a pre-defined K for number of clusters, like K-means. I have tried Mean Shift a few times, and it returns a single cluster at all the times.

The issues I have with Hierarchical clustering is that it requires more careful parameter optimisation. I have perform such clustering on many data sets, so I don't want to tune such hyper-parameters for each data. Also think hierarchical agglomerative method makes hard decisions that can cause it to get stuck in poor solutions? Affinity propagation is softer in making those decisions.

My research is not in clustering so that might sound a bit naive.. :simple_smile:

lmcinnes commented 7 years ago

I have personally had poor experiences getting Affinity Propagation to give good results, even on fairly easy to cluster data sets. Affinity Propagation was one of my favoured algorithms when I set out on a personal project to compare and contrast clustering algorithms over a wide range of datasets and clustering situations. By the time I was done Affinity Prop was my least favourite. It can have a lot of difficulty actually getting good clusters; it is extremely sensitive to parameters -- in practice you have to play with the preference vector and with the damping parameter if you hope to get a good representative clustering; the preference vector is a proxy parameter for number of clusters, but in a non-intuitive and non-linear way.

I am happy to recommend Affinity Prop for clustering non-metric space data, e.g. where you have asymmetric similarities etc. as it is one of the only algorithms that can do this. For general data under, say, a Euclidean metric, I have found it to very rarely be a good choice.

On Thu, Mar 2, 2017 at 6:46 AM, bluemonk482 notifications@github.com wrote:

Why do you think Affinity Propagation would not give useful results?

Affinity propagation simultaneously considers all data points as potential prototypes and passes soft information around until a subset of data points "win" and become the exemplars.

Doesn't it sound applicable in my case? And it doesn't require a pre-defined K for number of clusters, like K-means. I have tried Mean Shift a few times, and it returns a single cluster at all the times.

The issues I have with Hierarchical clustering is that it requires more careful parameter optimisation. I have perform such clustering on many data sets, so I don't want to tune such hyper-parameters for each data. Also think hierarchical agglomerative method makes hard decisions that can cause it to get stuck in poor solutions? Affinity propagation is softer in making those decisions.

My research is not in clustering so that might sound a bit naive.. :simple_smile:

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/87#issuecomment-283632518, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBdHlI5OCsmmpCZ0S7eFe3RtBbuJzks5rhqwogaJpZM4MHPAT .

bwang482 commented 7 years ago

hmm thanks for the suggestions ! :+1: