scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.8k stars 502 forks source link

Training models with cluster_selection_method leaf leads to errors #115

Open stefanloerwald opened 7 years ago

stefanloerwald commented 7 years ago

Using fbe0fe6b4be4caf49d7ae5eaf44f2ea47e8be7aa, hdbscan fails for combinations of cluster_selection_method leaf and some (ridiculous) values of min_cluster_size or min_samples:

Traceback (most recent call last):
  File "HDBSCAN.py", line 79, in Train    trained_model = model.fit(x)
  File "build/bdist.linux-x86_64/egg/hdbscan/hdbscan_.py", line 864, in fit
  File "build/bdist.linux-x86_64/egg/hdbscan/hdbscan_.py", line 613, in hdbscan
  File "build/bdist.linux-x86_64/egg/hdbscan/hdbscan_.py", line 110, in _tree_to_labels
  File "hdbscan/_hdbscan_tree.pyx", line 610, in hdbscan._hdbscan_tree.get_clusters (hdbscan/_hdbscan_tree.c:11757)
  File "hdbscan/_hdbscan_tree.pyx", line 691, in hdbscan._hdbscan_tree.get_clusters (hdbscan/_hdbscan_tree.c:11205)
  File "hdbscan/_hdbscan_tree.pyx", line 607, in hdbscan._hdbscan_tree.get_cluster_tree_leaves (hdbscan/_hdbscan_tree.c:10449)
  File "/usr/local/lib/python2.7/dist-packages/numpy/core/_methods.py", line 29, in _amin    return umr_minimum(a, axis, None, out, keepdims)
 ValueError: zero-size array to reduction operation minimum which has no identity Program exited with code 1

The errors do not occur with cluster_selection_method eom.

This error is reproducible with the iris dataset and the following parameter settings of hdbscan: min_cluster_size = 70, min_samples = 500, metric = cityblock, alpha = 0.1, p = 1, algorithm = best, leaf_size = 4, approx_min_span_tree = True, gen_min_span_tree = True, cluster_selection_method = leaf, allow_single_cluster = True, match_reference_implementation = False

lmcinnes commented 7 years ago

I am amazed that didn't fail earlier in the process to be honest -- were you fuzzing the implementation or something? I'll see if I can track down what the right way to handle this is.