Open cmrosenberg opened 7 years ago
I believe it is somehow taking the dimension of your vectors to be 0, which is what would cause the problem. As to why it is doing that -- I am not honestly sure at first inspection. You can try setting d explicitly via a keyword argument in the call. That may provide a workaround in the meantime.
As a side note, I would really recommend you do a dimension reduction step prior to running HDBSCAN here -- it won't do well with very high dimensional data (being a density based algorithm). Just using TruncatedSVD from scikit-learn would be a good start, although potentially using UMAP may be an interesting step (you can reduce to say 10 or 50 dimensions instead of just 2).
Thank you for your swift and thoughtful response, @lmcinnes! I'll investigate some more and report back if I find something. Will also keep your advice about dimensionality reduction in mind.
Hi, I am facing the same issues as above, although I print out and run checks for d=0 and it never fails; d is always bigger than zero. Have you found an answer to this? I use PCA for dimension reduction (work with around 15 columns).
Really like your work on hdbscan, thanks!!
I admit that the validity index has never been that high on my priority list, so I haven't looked into this in any real detail. It is certainly related to the fact that we are raising something to the power of -1/d
, so my best guess is that the value being raised to that power is zero, or sufficiently small that the internals of the power computation are breaking. I suspect that means your answer is very close to zero. I'm not sure the best way to catch/handle that situation to be honest. You can check by altering the code to skip that line and see what sorts of values result is taking.
@lmcinnes : Hello, I'm facing the same problem with a precomputed distance. When I tried to figure out where it fails exactly, I found that the value being raised to that power is zero, but why ? Because I simply found that all the points of the given cluster are exactly the same i.e. all pairwise distances of the points of this cluster are 0, and then naturally 0**(-1/d) raises the error.
But OK, in such exceptional situation where all the points of a given cluster are exactly the same and the pairwise distances between them are 0, what can we say about the stability/validity of this cluster ? Can we fix the code by setting it to 100% valid/stable when such case occur ?
Thanks in advance for you answer
Hi, I have encountered the same issue in a similar setting ( with features made by tf-idf ) Although I do dimension reduction I wish to evaluate clustering with original metric (cosine_distance) on original feature space, because of choosing dimension reduction parameters in the same way as other hyperparameters.
After some debugging I found two issues:
I have a very similar situation to everyone here, and I would also benefit from a fix to this issue
I also have this situation same to @cmrosenberg , i m looking forward to a fix to this issue
@lmcinnes
I was facing the same issue and created a PR with a fix. The fix is simple: if all feature values in a cluster are the same, their pairwise distance is zero and thus you'd have 0**(-1/d), which raises an error and returns nan
, as @laminenoureddine pointed out. However, what it should return in such a case is an array of zeros of length len(distance_matrix)
.
This can be demonstrated as follows:
1) Replica of the all_points_core_distance
function:
def all_points_core_distance(distance_matrix, d=2.0):
distance_matrix[distance_matrix != 0] = (1.0 / distance_matrix[
distance_matrix != 0]) ** d
result = distance_matrix.sum(axis=1)
result /= distance_matrix.shape[0] - 1
result **= (-1.0 / d)
return result
2) Generate a random distance matrix:
import numpy as np
distance_matrix = np.random.rand(5, 5)
distance_matrix
array([[0.56963041, 0.6252871 , 0.66563989, 0.16890913, 0.8897523 ],
[0.99149002, 0.38510157, 0.78422476, 0.32735191, 0.21385746],
[0.3270214 , 0.82219126, 0.27109467, 0.0402336 , 0.87518265],
[0.6994377 , 0.31734435, 0.73074764, 0.39523852, 0.4298506 ],
[0.18092312, 0.80021031, 0.35009419, 0.12110236, 0.43478347]])
3) Run the function
core_dists = all_points_core_distance(distance_matrix, d=distance_matrix.shape[0])
core_dists
array([0.22265342, 0.27328665, 0.05308749, 0.38175941, 0.15563301])
4) Check how the output changes if the pair-wise distances are a lot smaller
distance_matrix = distance_matrix / 10**6 # decrease value by a factor of a million
core_dists = all_points_core_distance(distance_matrix, d=distance_matrix.shape[0])
core_dists
array([2.22653422e-07, 2.73286653e-07, 5.30874899e-08, 3.81759408e-07,
1.55633010e-07])
Conclusion
The smaller the values in the distance_matrix
, the smaller the output. Thus, if all distances are zero, this function should output an array of zeros of length len(distance_matrix)
.
Hello! Thank you for this excellent Clustering module!
I am attempting to cluster text data using HDBSCAN, and I want to assess the resulting clustering by using the provided validity index. However, the validity_index procedure throws a series of numerical errors at me. Concretely, the following code:
Produces the following output:
I am on version
0.8.10
. Am I missing something?