Closed m-dz closed 6 years ago
On the other hand, might be possible that the condense_tree
method is missing nodes 5 - 8, please see the single linkage three plot for the same data here:
"Fixed" with having two clusters:
test_data = np.array([[0.0, 0.0], [1.0, 1.0], [0.8, 1.0], [1.0, 0.8], [0.8, 0.8], [0.0, 1.0], [0.0, 0.8], [0.2, 1.0], [0.2, 0.8]])
This turned out to be a straightforward fix in a file I was working on anyway, so I think it is now fixed in the master branch. Let me know.
Great, wasn't sure what to do within if cluster_tree.shape[0] == 0
. Will fetch and test tomorrow.
Got another error with int64
not having len
in line, fixed with:
if isinstance(leaves, np.int64):
cluster_x_coords = {leaves: leaf_separation}
else:
cluster_x_coords = dict(zip(leaves, [leaf_separation * x
for x in range(len(leaves))]))
I can do a P with "fix for fix for issue #151" or something along that.
Thanks!
Hi, I recently got to know of HDBSCAN package for clustering and i'm testing with my news articles dataset, planning to deploy it in production.... i'd executed the code
hd_cluster_model = hdbscan.HDBSCAN(min_cluster_size=5,min_samples=5,alpha=0.8,memory='/media/kavin/kavin-linux-os-data/data1/news_recommendation/',prediction_data=True,cluster_selection_method='leaf',metric='manhattan') hd_cluster_model.fit(train_tfidf_matrix)
for which i got error,
Traceback (most recent call last): File "/usr/lib/python3.5/code.py", line 91, in runcode exec(code, self.locals) File "<input>", line 1, in <module> File "/usr/local/lib/python3.5/dist-packages/hdbscan/hdbscan_.py", line 816, in fit self._min_spanning_tree) = hdbscan(X, **kwargs) File "/usr/local/lib/python3.5/dist-packages/hdbscan/hdbscan_.py", line 565, in hdbscan match_reference_implementation) + \ File "/usr/local/lib/python3.5/dist-packages/hdbscan/hdbscan_.py", line 62, in _tree_to_labels match_reference_implementation) File "hdbscan/_hdbscan_tree.pyx", line 610, in hdbscan._hdbscan_tree.get_clusters (hdbscan/_hdbscan_tree.c:11757) File "hdbscan/_hdbscan_tree.pyx", line 691, in hdbscan._hdbscan_tree.get_clusters (hdbscan/_hdbscan_tree.c:11205) File "hdbscan/_hdbscan_tree.pyx", line 607, in hdbscan._hdbscan_tree.get_cluster_tree_leaves (hdbscan/_hdbscan_tree.c:10449) File "/usr/local/lib/python3.5/dist-packages/numpy/core/_methods.py", line 29, in _amin return umr_minimum(a, axis, None, out, keepdims) ValueError: zero-size array to reduction operation minimum which has no identity
whats the possible cause of error?
The current master branch on github resolves this issue, but the fix has not been pushed out in a release yet. You can either pull from the repository and build it yourself or patch your install according to commit 98eef99. Hopefully I'll be rolling out a new release in the not too distant future, but I wanted to potentially gather a few more patches/bug fixes before making a release.
i'd used the github repository, but still error hasnt been solved
That's a little more disconcerting, I believed this was resolved. Let me check a little further. I'm on holiday right now, so I can't promise prompt results unfortunately.
let me try to debug myself... i should not spoil your holidays ;) but any specific reason for cosine metric not supported for prediction data
? because cosine metric gives me decent clusters when compared with other distance metrics... i'm using tfidf matrix computed from raw news articles as feature set in clusterer.fit
and my application is news recommendation....
Cosine is not actually a distance metric (it fails the triangle inequality). Due to heavy use of the triangle inequality in some of the code this makes it hard to work with unless you do all distance calculations by brute force which is problematic for anything but very small datasets.
Hi, I have encountered the same bug #151 with my dataset. I used the current master version of the repository. Actually, I get the error "ValueError: zero-size array to reduction operation minimum which has no identity" only when I set cluster_selection_method = 'leaf' (for 'eom' everything is fine) only if there is no clusters any more. It runs perfectly if N_clusters > 0 but then it crashes as I e.g. constantly increase min_samples such that there are no clusters any more. Here is the full error message that I have received: `--------------------------------------------------------------------------- ValueError Traceback (most recent call last)
I believe that last commit should resolve the issue. Thanks for the report.
Yes, it works for my case now. Thanks for fixing the issue so fast.
Hey I'm getting this error too. It's cause by trying to compute .min()
on an empty numpy array here:
cpdef list get_cluster_tree_leaves(np.ndarray cluster_tree):
if cluster_tree.shape[0] == 0:
return []
root = cluster_tree['parent'].min()
return recurse_leaf_dfs(cluster_tree, root)
That is disconcerting. I believe the guard, just above, was supposed to cover just such a case with an early return. Can you share the data by any chance? I'm a little disconcerted because it seems like this shouldn't be possible according to what the code says -- somehow an empty array must have a non-zero shape.
Hey,
Unfortunately I cannot share data. I can however say that I'm using a precomputed metric, and so I'm passing a distance metric to the clusterer. Not sure if that can make a difference.
Thanks for the quick reply, Guilherme Grijó Pires
ML+Data Engineer @ https://jungle.ai
mail@gpir.es guilherme.pires@jungle.ai
+351 91 504 94 34
Apr 9, 2018, 2:37 PM by notifications@github.com:
That is disconcerting. I believe the guard, just above, was supposed to cover just such a case with an early return. Can you share the data by any chance? I'm a little disconcerted because it seems like this shouldn't be possible according to what the code says -- somehow an empty array must have a non-zero shape.
— You are receiving this because you commented. Reply to this email directly, > view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/151#issuecomment-379869231> , or > mute the thread https://github.com/notifications/unsubscribe-auth/AL8Jo17_ATh0cR9WB_lmiO5fzJtCvZHvks5tm7iEgaJpZM4Q_M4e> .
Actually, you're right: It's not breaking on that line, but on the if
-statement before that. I'm going to further debug and I'll update you
Thanks. It is possible this got fixed in master but an update didn't get rolled out to PyPI and conda-forge. Keep me posted, and thanks for taking the time to dig into this, it is greatly appreciated (especially when issues are hard to reproduce).
Didn't have much time to further investigate, but got it to work by substituting if cluster_tree.shape[0] == 0
for if len(cluster_tree) == 0
. This suggests that cluster_tree
is arriving as something other than a numpy array. When I have time I'll dig a bit deeper
So it would seem. The fix may be to go with len
then since that is valid for numpy arrays as well as whatever is actually arriving there for you.
Hi, I think I might by accident tracked an error related to #115 and #144, please see below:
Using the current master branch on Win 10 64 bits and Python 2.7.14
Whole traceback ("anonymised"):
I have tracked down the problem to lines 42-45 of
plots.py
:cluster_tree
created here is empty, so line 44 throws an error.I am not sure if there is any solution to this except maybe plotting the
single_linkage_tree_
?