Open robroc opened 4 years ago
I'm running into a similar issue. I have a 574314 x 2 matrix and running into memory issues with the following error:
joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGKILL(-9)}
I tried to increase the leaf_size
but still run into memory issues. Any idea how I can get around this?
You may have issues if your min_cluster_size
is large and your min_samples
is not set. You could try setting min_samples
to something smallish and see if that helps.
You're right setting min_samples
to a small value got things working! Thanks for your help!
I'm running into a similar issue. I have a 574314 x 2 matrix and running into memory issues with the following error:
joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGKILL(-9)}
I tried to increase the
leaf_size
but still run into memory issues. Any idea how I can get around this?
Hello, I encountered this same error on a much smaller ~37K x 2 matrix. I set min_cluster_size to 25 and tried values of min_samples all the way down to 3. The fitting function completes when the dataset is pared down to the first 16384 rows, but fails on anything larger regardless of the parameter values. To rule out malformed data, I tried running on a 4K matrix that inlcuded the 16385th row and its neighbors; and it worked.
Note, when the exception occurs, it happens very quickly; almost as if a pre-check is being performed before clustering is attempted.
Is there anything else I can try besides adjusting min_samples to get this dataset to go through?
I'm running into a similar issue, but the problem only occurs with the haversine metric. Does the full distance matrix is calculated when using this metric, or something like that?
Haversine should be using Ball trees, and not require the full distance matrix. However, depending on the data it is possible that the ball tree search can still end up using a bit of memory. Even so, it should fit in memory for most dataset sizes you suggest. Ultimately this is in the balltree creation / search phase, which is all in scikit-learn. You should attempt balltree constructions and searches with the dataset and verify if they work just fine.
我试图在一个包含大约 460,000 个点的数据集上进行聚类,但在
.fit
调用时遇到了内存错误。这是相关的代码:
eps = .05 / 6378 clusterer = (hdbscan.HDBSCAN(min_cluster_size = 120, min_samples = 100, cluster_selection_epsilon = eps, metric = 'haversine' ) rads = np.radians(df[['latitude', 'longitude']].values) clusterer.fit(rads)
这是错误跟踪:
<ipython-input-6-363d24b7c8bb> in make_clusters(data) ---> 15 clusterer.fit(rads) 16 data['clusters'] = clusterer.labels_ 17 data['membership_strength'] = clusterer.probabilities_ ~\AppData\Local\Continuum\miniconda3\envs\geo\lib\site-packages\hdbscan\hdbscan_.py in fit(self, X, y) 913 self._condensed_tree, 914 self._single_linkage_tree, --> 915 self._min_spanning_tree) = hdbscan(X, **kwargs) 916 917 if self.prediction_data: ~\AppData\Local\Continuum\miniconda3\envs\geo\lib\site-packages\hdbscan\hdbscan_.py in hdbscan(X, min_cluster_size, min_samples, alpha, cluster_selection_epsilon, metric, p, leaf_size, algorithm, memory, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, cluster_selection_method, allow_single_cluster, match_reference_implementation, **kwargs) 624 approx_min_span_tree, 625 gen_min_span_tree, --> 626 core_dist_n_jobs, **kwargs) 627 628 return _tree_to_labels(X, ~\AppData\Local\Continuum\miniconda3\envs\geo\lib\site-packages\joblib\memory.py in __call__(self, *args, **kwargs) 353 354 def __call__(self, *args, **kwargs): --> 355 return self.func(*args, **kwargs) 356 357 def call_and_shelve(self, *args, **kwargs): ~\AppData\Local\Continuum\miniconda3\envs\geo\lib\site-packages\hdbscan\hdbscan_.py in _hdbscan_boruvka_balltree(X, min_samples, alpha, metric, p, leaf_size, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, **kwargs) 308 leaf_size=leaf_size // 3, 309 approx_min_span_tree=approx_min_span_tree, --> 310 n_jobs=core_dist_n_jobs, **kwargs) 311 min_spanning_tree = alg.spanning_tree() 312 # Sort edges of the min_spanning_tree by weight hdbscan\_hdbscan_boruvka.pyx in hdbscan._hdbscan_boruvka.BallTreeBoruvkaAlgorithm.__init__() hdbscan\dist_metrics.pyx in hdbscan.dist_metrics.DistanceMetric.pairwise() MemoryError: Unable to allocate 32.0 GiB for an array with shape (65535, 65535) and data type float64
@lmcinnes suggested setting the
leaf_size
in theHDBSCAN
constructor to 100 and that solved the issue.For reference, this was was the working instance:
clusterer = (hdbscan.HDBSCAN(min_cluster_size = 120, min_samples = 100, cluster_selection_epsilon = eps, metric = 'haversine', leaf_size = 100 )
the problem is same as me. how did you solve this problem?
This may not be a satisfying answer, but FWIW, I used a two-step approach. This entailed (1) using a more scalable algorithm (dbscan IIRC) to cluster the full sample into a large number of weighted clusters then (2) clustering the centroids of the clusters from step 1 using hdbscan. It was time consuming and not ideal, but it did get the job done.
Also, have you tried decreasing min_samples? lmcinnes suggested this earlier in the thread. It didn't work for me, but appears to have worked for harpalsahota.
I was trying to cluster on a dataset with about 460,000 points and ran into a memory error when
.fit
was called.This was the pertinent code:
This was the error trace:
@lmcinnes suggested setting the
leaf_size
in theHDBSCAN
constructor to 100 and that solved the issue.For reference, this was was the working instance: