scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.77k stars 496 forks source link

Validity index calculation results in ValueError while calculating min/max #576

Open tacitvenom opened 1 year ago

tacitvenom commented 1 year ago

For a clustering usecase, I tried different parameters and while calculating validity index, I run into the following ValueError:

<dir>/envs/venv/lib/python3.9/site-packages/hdbscan/validity.py:33: RuntimeWarning: invalid value encountered in divide
  result /= distance_matrix.shape[0] - 1
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [41], line 1
----> 1 validity_index(np.random.rand(5, 2), np.array([-1, 1, 1, 1, 0]))

File <dir>/envs/venv/lib/python3.9/site-packages/hdbscan/validity.py:372, in validity_index(X, labels, metric, d, per_cluster_scores, mst_raw_dist, verbose, **kwd_args)
    358     distances_for_mst, core_distances[
    359         cluster_id] = distances_between_points(
    360         X,
   (...)
    367         **kwd_args
    368     )
    370     mst_nodes[cluster_id], mst_edges[cluster_id] = \
    371         internal_minimum_spanning_tree(distances_for_mst)
--> 372     density_sparseness[cluster_id] = mst_edges[cluster_id].T[2].max()
    374 for i in range(max_cluster_id):
    376     if np.sum(labels == i) == 0:

File <dir>/envs/venv/lib/python3.9/site-packages/numpy/core/_methods.py:40, in _amax(a, axis, out, keepdims, initial, where)
     38 def _amax(a, axis=None, out=None, keepdims=False,
     39           initial=_NoValue, where=True):
---> 40     return umr_maximum(a, axis, None, out, keepdims, initial, where)

ValueError: zero-size array to reduction operation maximum which has no identity

Following is a toy example, I could reproduce with:

from hdbscan import validity_index
validity_index(np.random.rand(5, 2), np.array([-1, 1, 1, 1, 0]))
lmcinnes commented 1 year ago

I think the catch may be having a cluster of size 1; that would definitely break something.

tacitvenom commented 1 year ago

In this toy example, there are two clusters (cluster id 0 and 1). The original data had around 3000 clusters though.

armenbod commented 1 year ago

It appears that ValueError is present if have cluster size of 2, which I believe is not expected behaviour - you can test it using adapted toy model from above:

validity_index(np.random.rand(5, 2), np.array([-1, 1, 1, 0, 0]))

(Noise cluster -1 size does not impact)

Issue is prohibiting me from using the useful metric for a datasource that has several small (correct) clusterings.

mdagost commented 1 year ago

I'm having the same issue. It looks like it is indeed coming from clusters of size 2. Is there any update on a fix for that, or do I need to change the hyperparameters so that I don't get clusters that small?

dboeckenhoff commented 10 months ago

Having the same thing still in 2023. Serious bug with size-2 clusters