scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.8k stars 501 forks source link

"ValueError: zero-size array to reduction operation minimum which has no identity" with no leafs #151

Closed m-dz closed 6 years ago

m-dz commented 6 years ago

Hi, I think I might by accident tracked an error related to #115 and #144, please see below:

Using the current master branch on Win 10 64 bits and Python 2.7.14

import numpy as np
import matplotlib.pyplot as plt

from hdbscan import HDBSCAN

# Generate data
test_data = np.array([
    [0.0, 0.0],
    [1.0, 1.0],
    [0.8, 1.0],
    [1.0, 0.8],
    [0.8, 0.8]])

# HDBSCAN
np.random.seed(1)
hdb_unweighted = HDBSCAN(min_cluster_size=3, gen_min_span_tree=True, allow_single_cluster=True)
hdb_unweighted.fit(test_data)

fig = plt.figure()
cd = hdb_unweighted.condensed_tree_
cd.plot()
fig.suptitle('Unweighted HDBSCAN condensed tree plot'); plt.show()

Whole traceback ("anonymised"):

Traceback (most recent call last):
  File "...\JetBrains\PyCharm 2017.2.4\helpers\pydev\pydev_run_in_console.py", line 37, in run_file
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File ".../.PyCharm2017.3/config/scratches/scratch_2.py", line 22, in <module>
    cd.plot()
  File "build\bdist.win-amd64\egg\hdbscan\plots.py", line 321, in plot
    max_rectangle_per_icicle=max_rectangles_per_icicle)
  File "build\bdist.win-amd64\egg\hdbscan\plots.py", line 104, in get_plot_data
    leaves = _get_leaves(self._raw_tree)
  File "build\bdist.win-amd64\egg\hdbscan\plots.py", line 44, in _get_leaves
    root = cluster_tree['parent'].min()
  File "C:\ProgramData\Anaconda3\envs\venv_temp_hdbscan_dev_py27\lib\site-packages\numpy\core\_methods.py", line 29, in _amin
    return umr_minimum(a, axis, None, out, keepdims)
ValueError: zero-size array to reduction operation minimum which has no identity

I have tracked down the problem to lines 42-45 of plots.py:

def _get_leaves(condensed_tree):
    cluster_tree = condensed_tree[condensed_tree['child_size'] > 1]
    root = cluster_tree['parent'].min()
    return _recurse_leaf_dfs(cluster_tree, root)

cluster_tree created here is empty, so line 44 throws an error.

I am not sure if there is any solution to this except maybe plotting the single_linkage_tree_?

m-dz commented 6 years ago

On the other hand, might be possible that the condense_tree method is missing nodes 5 - 8, please see the single linkage three plot for the same data here:

image

m-dz commented 6 years ago

"Fixed" with having two clusters:

test_data = np.array([[0.0, 0.0], [1.0, 1.0], [0.8, 1.0], [1.0, 0.8], [0.8, 0.8], [0.0, 1.0], [0.0, 0.8], [0.2, 1.0], [0.2, 0.8]])
lmcinnes commented 6 years ago

This turned out to be a straightforward fix in a file I was working on anyway, so I think it is now fixed in the master branch. Let me know.

m-dz commented 6 years ago

Great, wasn't sure what to do within if cluster_tree.shape[0] == 0. Will fetch and test tomorrow.

m-dz commented 6 years ago

Got another error with int64 not having len in line, fixed with:

if isinstance(leaves, np.int64):
    cluster_x_coords = {leaves: leaf_separation}
else:
    cluster_x_coords = dict(zip(leaves, [leaf_separation * x
                                             for x in range(len(leaves))]))

I can do a P with "fix for fix for issue #151" or something along that.

lmcinnes commented 6 years ago

Thanks!

kavin26 commented 6 years ago

Hi, I recently got to know of HDBSCAN package for clustering and i'm testing with my news articles dataset, planning to deploy it in production.... i'd executed the code

hd_cluster_model = hdbscan.HDBSCAN(min_cluster_size=5,min_samples=5,alpha=0.8,memory='/media/kavin/kavin-linux-os-data/data1/news_recommendation/',prediction_data=True,cluster_selection_method='leaf',metric='manhattan') hd_cluster_model.fit(train_tfidf_matrix)

for which i got error, Traceback (most recent call last): File "/usr/lib/python3.5/code.py", line 91, in runcode exec(code, self.locals) File "<input>", line 1, in <module> File "/usr/local/lib/python3.5/dist-packages/hdbscan/hdbscan_.py", line 816, in fit self._min_spanning_tree) = hdbscan(X, **kwargs) File "/usr/local/lib/python3.5/dist-packages/hdbscan/hdbscan_.py", line 565, in hdbscan match_reference_implementation) + \ File "/usr/local/lib/python3.5/dist-packages/hdbscan/hdbscan_.py", line 62, in _tree_to_labels match_reference_implementation) File "hdbscan/_hdbscan_tree.pyx", line 610, in hdbscan._hdbscan_tree.get_clusters (hdbscan/_hdbscan_tree.c:11757) File "hdbscan/_hdbscan_tree.pyx", line 691, in hdbscan._hdbscan_tree.get_clusters (hdbscan/_hdbscan_tree.c:11205) File "hdbscan/_hdbscan_tree.pyx", line 607, in hdbscan._hdbscan_tree.get_cluster_tree_leaves (hdbscan/_hdbscan_tree.c:10449) File "/usr/local/lib/python3.5/dist-packages/numpy/core/_methods.py", line 29, in _amin return umr_minimum(a, axis, None, out, keepdims) ValueError: zero-size array to reduction operation minimum which has no identity

whats the possible cause of error?

lmcinnes commented 6 years ago

The current master branch on github resolves this issue, but the fix has not been pushed out in a release yet. You can either pull from the repository and build it yourself or patch your install according to commit 98eef99. Hopefully I'll be rolling out a new release in the not too distant future, but I wanted to potentially gather a few more patches/bug fixes before making a release.

kavin26 commented 6 years ago

i'd used the github repository, but still error hasnt been solved

lmcinnes commented 6 years ago

That's a little more disconcerting, I believed this was resolved. Let me check a little further. I'm on holiday right now, so I can't promise prompt results unfortunately.

kavin26 commented 6 years ago

let me try to debug myself... i should not spoil your holidays ;) but any specific reason for cosine metric not supported for prediction data? because cosine metric gives me decent clusters when compared with other distance metrics... i'm using tfidf matrix computed from raw news articles as feature set in clusterer.fit and my application is news recommendation....

lmcinnes commented 6 years ago

Cosine is not actually a distance metric (it fails the triangle inequality). Due to heavy use of the triangle inequality in some of the code this makes it hard to work with unless you do all distance calculations by brute force which is problematic for anything but very small datasets.

ameinel commented 6 years ago

Hi, I have encountered the same bug #151 with my dataset. I used the current master version of the repository. Actually, I get the error "ValueError: zero-size array to reduction operation minimum which has no identity" only when I set cluster_selection_method = 'leaf' (for 'eom' everything is fine) only if there is no clusters any more. It runs perfectly if N_clusters > 0 but then it crashes as I e.g. constantly increase min_samples such that there are no clusters any more. Here is the full error message that I have received: `--------------------------------------------------------------------------- ValueError Traceback (most recent call last)

in () 2 cluster_instance = hdbscan.HDBSCAN(algorithm='best', alpha=1.0, approx_min_span_tree=True, 3 gen_min_span_tree=False, metric='euclidean', min_cluster_size=param_tmp, ----> 4 cluster_selection_method='leaf',min_samples=param_tmp, p=None).fit(features_all) 5 len(set(cluster_instance.labels_)) - (1 if -1 in cluster_instance.labels_ else 0) ~/venvs/clustering/lib/python3.6/site-packages/hdbscan/hdbscan_.py in fit(self, X, y) ~/venvs/clustering/lib/python3.6/site-packages/hdbscan/hdbscan_.py in hdbscan(X, min_cluster_size, min_samples, alpha, metric, p, leaf_size, algorithm, memory, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, cluster_selection_method, allow_single_cluster, match_reference_implementation, **kwargs) ~/venvs/clustering/lib/python3.6/site-packages/hdbscan/hdbscan_.py in _tree_to_labels(X, single_linkage_tree, min_cluster_size, cluster_selection_method, allow_single_cluster, match_reference_implementation) hdbscan/_hdbscan_tree.pyx in hdbscan._hdbscan_tree.get_clusters (hdbscan/_hdbscan_tree.c:11757)() hdbscan/_hdbscan_tree.pyx in hdbscan._hdbscan_tree.get_clusters (hdbscan/_hdbscan_tree.c:11205)() hdbscan/_hdbscan_tree.pyx in hdbscan._hdbscan_tree.get_cluster_tree_leaves (hdbscan/_hdbscan_tree.c:10449)() ~/venvs/clustering/lib/python3.6/site-packages/numpy/core/_methods.py in _amin(a, axis, out, keepdims) 27 28 def _amin(a, axis=None, out=None, keepdims=False): ---> 29 return umr_minimum(a, axis, None, out, keepdims) 30 31 def _sum(a, axis=None, dtype=None, out=None, keepdims=False): ValueError: zero-size array to reduction operation minimum which has no`
lmcinnes commented 6 years ago

I believe that last commit should resolve the issue. Thanks for the report.

ameinel commented 6 years ago

Yes, it works for my case now. Thanks for fixing the issue so fast.

colobas commented 6 years ago

Hey I'm getting this error too. It's cause by trying to compute .min() on an empty numpy array here:

cpdef list get_cluster_tree_leaves(np.ndarray cluster_tree):
    if cluster_tree.shape[0] == 0:
        return []
    root = cluster_tree['parent'].min()
    return recurse_leaf_dfs(cluster_tree, root)
lmcinnes commented 6 years ago

That is disconcerting. I believe the guard, just above, was supposed to cover just such a case with an early return. Can you share the data by any chance? I'm a little disconcerted because it seems like this shouldn't be possible according to what the code says -- somehow an empty array must have a non-zero shape.

colobas commented 6 years ago

Hey,

Unfortunately I cannot share data. I can however say that I'm using a precomputed metric, and so I'm passing a distance metric to the clusterer. Not sure if that can make a difference.

Thanks for the quick reply, Guilherme Grijó Pires

ML+Data Engineer @ https://jungle.ai

mail@gpir.es guilherme.pires@jungle.ai

+351 91 504 94 34

Apr 9, 2018, 2:37 PM by notifications@github.com:

That is disconcerting. I believe the guard, just above, was supposed to cover just such a case with an early return. Can you share the data by any chance? I'm a little disconcerted because it seems like this shouldn't be possible according to what the code says -- somehow an empty array must have a non-zero shape.

— You are receiving this because you commented. Reply to this email directly, > view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/151#issuecomment-379869231> , or > mute the thread https://github.com/notifications/unsubscribe-auth/AL8Jo17_ATh0cR9WB_lmiO5fzJtCvZHvks5tm7iEgaJpZM4Q_M4e> .

colobas commented 6 years ago

Actually, you're right: It's not breaking on that line, but on the if-statement before that. I'm going to further debug and I'll update you

lmcinnes commented 6 years ago

Thanks. It is possible this got fixed in master but an update didn't get rolled out to PyPI and conda-forge. Keep me posted, and thanks for taking the time to dig into this, it is greatly appreciated (especially when issues are hard to reproduce).

colobas commented 6 years ago

Didn't have much time to further investigate, but got it to work by substituting if cluster_tree.shape[0] == 0 for if len(cluster_tree) == 0. This suggests that cluster_tree is arriving as something other than a numpy array. When I have time I'll dig a bit deeper

lmcinnes commented 6 years ago

So it would seem. The fix may be to go with len then since that is valid for numpy arrays as well as whatever is actually arriving there for you.