scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.8k stars 502 forks source link

Overflow and divide by zero when using validity_index #127

Open cmrosenberg opened 7 years ago

cmrosenberg commented 7 years ago

Hello! Thank you for this excellent Clustering module!

I am attempting to cluster text data using HDBSCAN, and I want to assess the resulting clustering by using the provided validity index. However, the validity_index procedure throws a series of numerical errors at me. Concretely, the following code:

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

import hdbscan

training_set = fetch_20newsgroups(subset='train', categories=['alt.atheism', 'sci.space'])

vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(training_set.data)

clusterer = hdbscan.HDBSCAN()

clustering = clusterer.fit(vectors)

print(hdbscan.validity.validity_index(vectors.toarray(), clusterer.labels_))

Produces the following output:

<virtualenv>/lib/python3.6/site-packages/hdbscan/validity.py:34: RuntimeWarning: divide by zero encountered in power
  result **= (-1.0 / d)
<virtualenv>/lib/python3.6/site-packages/hdbscan/validity.py:31: RuntimeWarning: overflow encountered in power
  distance_matrix != 0]) ** d
<virtualenv>/lib/python3.6/site-packages/hdbscan/validity.py:375: RuntimeWarning: invalid value encountered in double_scalars
  max(min_density_sep, density_sparseness[i])
nan

I am on version 0.8.10. Am I missing something?

lmcinnes commented 7 years ago

I believe it is somehow taking the dimension of your vectors to be 0, which is what would cause the problem. As to why it is doing that -- I am not honestly sure at first inspection. You can try setting d explicitly via a keyword argument in the call. That may provide a workaround in the meantime.

lmcinnes commented 7 years ago

As a side note, I would really recommend you do a dimension reduction step prior to running HDBSCAN here -- it won't do well with very high dimensional data (being a density based algorithm). Just using TruncatedSVD from scikit-learn would be a good start, although potentially using UMAP may be an interesting step (you can reduce to say 10 or 50 dimensions instead of just 2).

cmrosenberg commented 7 years ago

Thank you for your swift and thoughtful response, @lmcinnes! I'll investigate some more and report back if I find something. Will also keep your advice about dimensionality reduction in mind.

giosco commented 5 years ago

Hi, I am facing the same issues as above, although I print out and run checks for d=0 and it never fails; d is always bigger than zero. Have you found an answer to this? I use PCA for dimension reduction (work with around 15 columns).

Really like your work on hdbscan, thanks!!

lmcinnes commented 5 years ago

I admit that the validity index has never been that high on my priority list, so I haven't looked into this in any real detail. It is certainly related to the fact that we are raising something to the power of -1/d, so my best guess is that the value being raised to that power is zero, or sufficiently small that the internals of the power computation are breaking. I suspect that means your answer is very close to zero. I'm not sure the best way to catch/handle that situation to be honest. You can check by altering the code to skip that line and see what sorts of values result is taking.

laminenoureddine commented 5 years ago

@lmcinnes : Hello, I'm facing the same problem with a precomputed distance. When I tried to figure out where it fails exactly, I found that the value being raised to that power is zero, but why ? Because I simply found that all the points of the given cluster are exactly the same i.e. all pairwise distances of the points of this cluster are 0, and then naturally 0**(-1/d) raises the error.

But OK, in such exceptional situation where all the points of a given cluster are exactly the same and the pairwise distances between them are 0, what can we say about the stability/validity of this cluster ? Can we fix the code by setting it to 100% valid/stable when such case occur ?

Thanks in advance for you answer

zoiets commented 4 years ago

Hi, I have encountered the same issue in a similar setting ( with features made by tf-idf ) Although I do dimension reduction I wish to evaluate clustering with original metric (cosine_distance) on original feature space, because of choosing dimension reduction parameters in the same way as other hyperparameters.

After some debugging I found two issues:

dmenager commented 4 years ago

I have a very similar situation to everyone here, and I would also benefit from a fix to this issue

DogNick commented 3 years ago

I also have this situation same to @cmrosenberg , i m looking forward to a fix to this issue

DogNick commented 3 years ago

@lmcinnes

tadorfer commented 2 years ago

I was facing the same issue and created a PR with a fix. The fix is simple: if all feature values in a cluster are the same, their pairwise distance is zero and thus you'd have 0**(-1/d), which raises an error and returns nan, as @laminenoureddine pointed out. However, what it should return in such a case is an array of zeros of length len(distance_matrix).

This can be demonstrated as follows:

1) Replica of the all_points_core_distance function:

def all_points_core_distance(distance_matrix, d=2.0):
    distance_matrix[distance_matrix != 0] = (1.0 / distance_matrix[
            distance_matrix != 0]) ** d
    result = distance_matrix.sum(axis=1)
    result /= distance_matrix.shape[0] - 1
    result **= (-1.0 / d)

    return result

2) Generate a random distance matrix:

import numpy as np
distance_matrix = np.random.rand(5, 5)
distance_matrix
array([[0.56963041, 0.6252871 , 0.66563989, 0.16890913, 0.8897523 ],
       [0.99149002, 0.38510157, 0.78422476, 0.32735191, 0.21385746],
       [0.3270214 , 0.82219126, 0.27109467, 0.0402336 , 0.87518265],
       [0.6994377 , 0.31734435, 0.73074764, 0.39523852, 0.4298506 ],
       [0.18092312, 0.80021031, 0.35009419, 0.12110236, 0.43478347]])

3) Run the function

core_dists = all_points_core_distance(distance_matrix, d=distance_matrix.shape[0])
core_dists
array([0.22265342, 0.27328665, 0.05308749, 0.38175941, 0.15563301])

4) Check how the output changes if the pair-wise distances are a lot smaller

distance_matrix = distance_matrix / 10**6 # decrease value by a factor of a million
core_dists = all_points_core_distance(distance_matrix, d=distance_matrix.shape[0])
core_dists
array([2.22653422e-07, 2.73286653e-07, 5.30874899e-08, 3.81759408e-07,
       1.55633010e-07])

Conclusion

The smaller the values in the distance_matrix, the smaller the output. Thus, if all distances are zero, this function should output an array of zeros of length len(distance_matrix).