scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.81k stars 504 forks source link

Issues with Prediction and Hamming metric #201

Open armsp opened 6 years ago

armsp commented 6 years ago

I have been trying HDBSCAN to cluster text data and i am very happy with the results. However i am fairly new to ML and Python. I was using the default metric for a while and after some preliminary research thought Hamming metric would probably give me better results (am i correct in this assumption?). And later i also started checking the predicted results. But i noticed some anomalies and pattern that i am tabulating below - I am representing the TF-IDF sparse matrix which i input by s_matrix

Prediction = Disabled, Metric = Default

INPUT RUN ERROR
s_matrix YES NO
s_matrix.toarray() YES NO

Prediction = Disabled, Metric = Hamming

INPUT RUN ERROR
s_matrix NO TypeError
s_matrix.toarray() YES NO

Prediction = Enabled, Metric = Default

INPUT RUN ERROR
s_matrix NO ValueError
s_matrix.toarray() YES NO

Prediction = Enabled, Metric = Hamming

INPUT RUN ERROR
s_matrix NO TypeError
s_matrix.toarray() YES NO

Here TypeError means : TypeError: Scipy distance metrices do not support sparse matrices and ValueError means : ValueError: Setting an array element with a sequence Can you please tell me how can i mitigate these issues. I really need to use prediction and hamming metric (unless there is a better metric for text than hamming). I looked at #174 but it doesn't seem to help much. Any advice on implementing our own prediction methods and metric distances are also welcome.

lmcinnes commented 6 years ago

I think you might do better with Jaccard as a metric on TDIDF matrices. I will admit, however, that sparse matrix support is a little ... sparse. It works, but not everything is available in the sparse matrix case (hence some of the errors you are seeing -- the TypeErrors at least). I don't think prediction will work with sparse matrices as input at all, but there should probably be a better error message than the one you are getting.

The other thing I would consider trying is using some dimension reduction first to get to a manageable dense array as a lot of things will work better there. For these purposes just using a truncated SVD to get down to 50 or 100 dimensions should be good enough. If you do that then I think most of the rest of the functionality should work well.

armsp commented 6 years ago

Thank you so much @lmcinnes i tried "jaccard" as a metric but it identifies every point as noise no matter what i do (changing the min_cluster_size and min_samples). Is this a possible bug?

I tried truncated SVD on my TF-IDF matrix and it seems to work with both hamming and prediction enabled. Since Jaccard doen't seem to be working at all, is there an informed guess i can make as to which metric i should employ? (I don't mind writing my own metric too, if i do what would work best with text data according to you)

lmcinnes commented 6 years ago

That sounds like a bug, but really jaccard ought to just work (I believe hdbscan is using upstream implementations from sklearn and scipy). Hamming isn't a terrible choice. You could also look at dice if you want to try something else.

armsp commented 6 years ago

@lmcinnes oh, that bad news then. On that prompt i tried a few other metrics too like "dice" as you suggested and "cosine" too. But dice also failed to work i.e recognized all the points as noise even with different parameters. And cosine gave me the error ValueError: Unrecognized metric 'cosine' . I will inform you about other metrics later.(I was giving them svd vectors since Scipy distance metrics do not support sparse matrices) My documents that are sentences effectively that i am trying to cluster are very short. Just a few words per sentence(less than 10). What do you suggest would work well? Can i write my own distance metrics, do you have any suggestions for that?

lmcinnes commented 6 years ago

Given data that sparse it will be very hard to get too much in the way of clusters. I would actually suggest going through dimension reduction (PCA, UMAP, both) and then just clustering using euclidean distance in the reduced space.

armsp commented 6 years ago

I am trying UMAP and will keep you posted. By the way, according to Sklears offical docs http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html i found out that jaccard and dice are Metrics intended for boolean-valued vector spaces. I don't think they would work at all unlike your suggestion earlier.

lmcinnes commented 6 years ago

Assuming very sparse data then hamming or dice will be fine -- they will just ignore counts. On the other hand if you have non-negligible counts that matter, yes, you will need something else. Cosine would be good (but has issues as it is not a metric). You can use angular distance instead -- just l2 normalise the vectors and then use Euclidean distance.