scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.8k stars 503 forks source link

ValueError: k must be less than or equal to the number of training points #263

Open Oolev opened 5 years ago

Oolev commented 5 years ago

When running the included code example on a different set of data (3000 points), this error pops:

Traceback (most recent call last): File "hdbscn_test.py", line 49, in hdb = HDBSCAN(min_cluster_size=50).fit(X) File "build/bdist.linux-x8664/egg/hdbscan/hdbscan.py", line 882, in fit File "build/bdist.linux-x8664/egg/hdbscan/hdbscan.py", line 586, in hdbscan File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/memory.py", line 342, in call return self.func(*args, **kwargs) File "build/bdist.linux-x8664/egg/hdbscan/hdbscan.py", line 265, in _hdbscan_boruvka_kdtree File "hdbscan/_hdbscan_boruvka.pyx", line 375, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.init File "hdbscan/_hdbscan_boruvka.pyx", line 420, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds File "sklearn/neighbors/binary_tree.pxi", line 1309, in sklearn.neighbors.kd_tree.BinaryTree.query ValueError: k must be less than or equal to the number of training points

Any idea how to fix this? Changing the parameters' values doesn't solve the issue.

lmcinnes commented 5 years ago

That seems remarkably odd; is your data transposed or something by accident? It seems like it feels you are asking for more neighbors than there is data, and I believe that check is fairly robust; I suspect something is astray with your data somehow?

Oolev commented 5 years ago

My data points (3000 in total) form a small dense spherical shape surrounded by less dense 'noise'. The algorithm is expected to recognize the spherical shape. (picture attached)(do not mind the colors). log__3001__n-10-000

lmcinnes commented 5 years ago

I think the question is if you call print(data.shape) does it return (3, 3000) or (3000, 3). If it is the first one, then that is the issue -- you'll want to run hdbscan on data.T.

rakeshskc commented 4 years ago

@lmcinnes data.T solved my problem, Thanks