Open Garrus990 opened 4 years ago
I am having an identical issue. Any thoughts on this?
The parameter min_samples
is used for computing the linkage tree, and defaults to the value of min_cluster_size
. When setting min_cluster_size
to a large value min_samples
should be set to something smaller to reduce memory usage.
(Old issue, but still open so adding this in case anyone else has similar trouble)
Hi guys,
thank you for the implementation of the algorithm - it works mostly incredibly good. But only recently have I encountered an error that is not addressed according to a quick search in google. I am trying to cluster a dataset of size
(560823, 2)
. I have successfully clustered 10x larger datasets, but this time I've a particular problem. In my dataset, there is a central, dense, mass of points, that I would like to 'extract' from the rest, rather loose observations. HDBSCAN and other density-based methods seem to be perfect for that. In order to extract this mass and disregard all the rest I am setting the parametermin_cluster_size
to 10000 (I've tried also 5000 and 2500). In all the cases the procedure halts and throws an error with the following stack:The error is, as you can see, quite cryptic :) The code I am using:
I installed
hdbscan
ver. 0.8.26 via conda and I am running it on python 3.7.7.Further info: As I suspect that some kind of algorithm's internal object is too big, I played a little with the params and it turns out, that for my dataset the errors starts popping up somewhere between 200'000 and 300'000 observations for the
min_cluster_size
of 2500 and somewhere between 50'000 and 100'000 formin_cluster_size
of 10000, so these two params are interconnected.Cheers!