Overflow and divide by zero when using validity_index

cmrosenberg commented 7 years ago

Hello! Thank you for this excellent Clustering module!

I am attempting to cluster text data using HDBSCAN, and I want to assess the resulting clustering by using the provided validity index. However, the validity_index procedure throws a series of numerical errors at me. Concretely, the following code:

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

import hdbscan

training_set = fetch_20newsgroups(subset='train', categories=['alt.atheism', 'sci.space'])

vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(training_set.data)

clusterer = hdbscan.HDBSCAN()

clustering = clusterer.fit(vectors)

print(hdbscan.validity.validity_index(vectors.toarray(), clusterer.labels_))

Produces the following output:

<virtualenv>/lib/python3.6/site-packages/hdbscan/validity.py:34: RuntimeWarning: divide by zero encountered in power
  result **= (-1.0 / d)
<virtualenv>/lib/python3.6/site-packages/hdbscan/validity.py:31: RuntimeWarning: overflow encountered in power
  distance_matrix != 0]) ** d
<virtualenv>/lib/python3.6/site-packages/hdbscan/validity.py:375: RuntimeWarning: invalid value encountered in double_scalars
  max(min_density_sep, density_sparseness[i])
nan

I am on version 0.8.10. Am I missing something?

lmcinnes commented 7 years ago

I believe it is somehow taking the dimension of your vectors to be 0, which is what would cause the problem. As to why it is doing that -- I am not honestly sure at first inspection. You can try setting d explicitly via a keyword argument in the call. That may provide a workaround in the meantime.

lmcinnes commented 7 years ago

As a side note, I would really recommend you do a dimension reduction step prior to running HDBSCAN here -- it won't do well with very high dimensional data (being a density based algorithm). Just using TruncatedSVD from scikit-learn would be a good start, although potentially using UMAP may be an interesting step (you can reduce to say 10 or 50 dimensions instead of just 2).

cmrosenberg commented 7 years ago

Thank you for your swift and thoughtful response, @lmcinnes! I'll investigate some more and report back if I find something. Will also keep your advice about dimensionality reduction in mind.

giosco commented 5 years ago

Hi, I am facing the same issues as above, although I print out and run checks for d=0 and it never fails; d is always bigger than zero. Have you found an answer to this? I use PCA for dimension reduction (work with around 15 columns).

Really like your work on hdbscan, thanks!!

lmcinnes commented 5 years ago

I admit that the validity index has never been that high on my priority list, so I haven't looked into this in any real detail. It is certainly related to the fact that we are raising something to the power of -1/d, so my best guess is that the value being raised to that power is zero, or sufficiently small that the internals of the power computation are breaking. I suspect that means your answer is very close to zero. I'm not sure the best way to catch/handle that situation to be honest. You can check by altering the code to skip that line and see what sorts of values result is taking.

laminenoureddine commented 5 years ago

@lmcinnes : Hello, I'm facing the same problem with a precomputed distance. When I tried to figure out where it fails exactly, I found that the value being raised to that power is zero, but why ? Because I simply found that all the points of the given cluster are exactly the same i.e. all pairwise distances of the points of this cluster are 0, and then naturally 0**(-1/d) raises the error.

But OK, in such exceptional situation where all the points of a given cluster are exactly the same and the pairwise distances between them are 0, what can we say about the stability/validity of this cluster ? Can we fix the code by setting it to 100% valid/stable when such case occur ?

Thanks in advance for you answer

zoiets commented 4 years ago

Hi, I have encountered the same issue in a similar setting ( with features made by tf-idf ) Although I do dimension reduction I wish to evaluate clustering with original metric (cosine_distance) on original feature space, because of choosing dimension reduction parameters in the same way as other hyperparameters.

After some debugging I found two issues:

first, as mentioned by lamin3, all points of a cluster have zero distance between them (which for cosine_distance does not mean that vectors are the same). The fix is easy: just return zero vector from all_points_core_distance.
second, with large number of dimension (as tf-idf features have) d, m**d is smaller than smallest representable float (i.e. returning zero), or bigger than biggest representable float (returning Inf). Only a limited range of m returns valid results. One way to deal with this issue it is to specify smaller d. Another option is to scale distance_matrix accordingly. Both option work, but affect validity_index values, and thus aren't ideal. Alternative options might be using long arithmetic, which would slow down index calculation significantly, or using an alternative way of calculating eq (3.1) from DBCV paper (e.g. using asymptotic approximation for large d).

dmenager commented 4 years ago

I have a very similar situation to everyone here, and I would also benefit from a fix to this issue

DogNick commented 3 years ago

I also have this situation same to @cmrosenberg , i m looking forward to a fix to this issue

DogNick commented 3 years ago

@lmcinnes

tadorfer commented 2 years ago

I was facing the same issue and created a PR with a fix. The fix is simple: if all feature values in a cluster are the same, their pairwise distance is zero and thus you'd have 0**(-1/d), which raises an error and returns nan, as @laminenoureddine pointed out. However, what it should return in such a case is an array of zeros of length len(distance_matrix).

This can be demonstrated as follows:

1) Replica of the all_points_core_distance function:

def all_points_core_distance(distance_matrix, d=2.0):
    distance_matrix[distance_matrix != 0] = (1.0 / distance_matrix[
            distance_matrix != 0]) ** d
    result = distance_matrix.sum(axis=1)
    result /= distance_matrix.shape[0] - 1
    result **= (-1.0 / d)

    return result

2) Generate a random distance matrix:

import numpy as np
distance_matrix = np.random.rand(5, 5)
distance_matrix
array([[0.56963041, 0.6252871 , 0.66563989, 0.16890913, 0.8897523 ],
       [0.99149002, 0.38510157, 0.78422476, 0.32735191, 0.21385746],
       [0.3270214 , 0.82219126, 0.27109467, 0.0402336 , 0.87518265],
       [0.6994377 , 0.31734435, 0.73074764, 0.39523852, 0.4298506 ],
       [0.18092312, 0.80021031, 0.35009419, 0.12110236, 0.43478347]])

3) Run the function

core_dists = all_points_core_distance(distance_matrix, d=distance_matrix.shape[0])
core_dists
array([0.22265342, 0.27328665, 0.05308749, 0.38175941, 0.15563301])

4) Check how the output changes if the pair-wise distances are a lot smaller

distance_matrix = distance_matrix / 10**6 # decrease value by a factor of a million
core_dists = all_points_core_distance(distance_matrix, d=distance_matrix.shape[0])
core_dists
array([2.22653422e-07, 2.73286653e-07, 5.30874899e-08, 3.81759408e-07,
       1.55633010e-07])

Conclusion

The smaller the values in the distance_matrix, the smaller the output. Thus, if all distances are zero, this function should output an array of zeros of length len(distance_matrix).

scikit-learn-contrib / hdbscan

Overflow and divide by zero when using validity_index #127