scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.79k stars 500 forks source link

Memory error when clustering on a large dataset (~500,000 points) #345

Open robroc opened 4 years ago

robroc commented 4 years ago

I was trying to cluster on a dataset with about 460,000 points and ran into a memory error when .fit was called.

This was the pertinent code:

eps = .05 / 6378

clusterer = (hdbscan.HDBSCAN(min_cluster_size = 120,
                             min_samples = 100,
                             cluster_selection_epsilon = eps,
                             metric = 'haversine' )

rads = np.radians(df[['latitude', 'longitude']].values)
clusterer.fit(rads)

This was the error trace:

<ipython-input-6-363d24b7c8bb> in make_clusters(data)
---> 15     clusterer.fit(rads)
     16     data['clusters'] = clusterer.labels_
     17     data['membership_strength'] = clusterer.probabilities_

~\AppData\Local\Continuum\miniconda3\envs\geo\lib\site-packages\hdbscan\hdbscan_.py in fit(self, X, y)
    913          self._condensed_tree,
    914          self._single_linkage_tree,
--> 915          self._min_spanning_tree) = hdbscan(X, **kwargs)
    916
    917         if self.prediction_data:

~\AppData\Local\Continuum\miniconda3\envs\geo\lib\site-packages\hdbscan\hdbscan_.py in hdbscan(X, min_cluster_size, min_samples, alpha, cluster_selection_epsilon, metric, p, leaf_size, algorithm, memory, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, cluster_selection_method, allow_single_cluster, match_reference_implementation, **kwargs)
    624                                                approx_min_span_tree,
    625                                                gen_min_span_tree,
--> 626                                                core_dist_n_jobs, **kwargs)
    627
    628     return _tree_to_labels(X,

~\AppData\Local\Continuum\miniconda3\envs\geo\lib\site-packages\joblib\memory.py in __call__(self, *args, **kwargs)
    353
    354     def __call__(self, *args, **kwargs):
--> 355         return self.func(*args, **kwargs)
    356
    357     def call_and_shelve(self, *args, **kwargs):

~\AppData\Local\Continuum\miniconda3\envs\geo\lib\site-packages\hdbscan\hdbscan_.py in _hdbscan_boruvka_balltree(X, min_samples, alpha, metric, p, leaf_size, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, **kwargs)
    308                                    leaf_size=leaf_size // 3,
    309                                    approx_min_span_tree=approx_min_span_tree,
--> 310                                    n_jobs=core_dist_n_jobs, **kwargs)
    311     min_spanning_tree = alg.spanning_tree()
    312     # Sort edges of the min_spanning_tree by weight

hdbscan\_hdbscan_boruvka.pyx in hdbscan._hdbscan_boruvka.BallTreeBoruvkaAlgorithm.__init__()

hdbscan\dist_metrics.pyx in hdbscan.dist_metrics.DistanceMetric.pairwise()

MemoryError: Unable to allocate 32.0 GiB for an array with shape (65535, 65535) and data type float64

@lmcinnes suggested setting the leaf_size in the HDBSCAN constructor to 100 and that solved the issue.

For reference, this was was the working instance:

clusterer = (hdbscan.HDBSCAN(min_cluster_size = 120,
                             min_samples = 100,
                             cluster_selection_epsilon = eps,
                             metric = 'haversine',
                             leaf_size = 100 )
harpalsahota commented 4 years ago

I'm running into a similar issue. I have a 574314 x 2 matrix and running into memory issues with the following error:

joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGKILL(-9)}

I tried to increase the leaf_size but still run into memory issues. Any idea how I can get around this?

lmcinnes commented 4 years ago

You may have issues if your min_cluster_size is large and your min_samples is not set. You could try setting min_samples to something smallish and see if that helps.

harpalsahota commented 4 years ago

You're right setting min_samples to a small value got things working! Thanks for your help!

ddifranco commented 3 years ago

I'm running into a similar issue. I have a 574314 x 2 matrix and running into memory issues with the following error:

joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGKILL(-9)}

I tried to increase the leaf_size but still run into memory issues. Any idea how I can get around this?

Hello, I encountered this same error on a much smaller ~37K x 2 matrix. I set min_cluster_size to 25 and tried values of min_samples all the way down to 3. The fitting function completes when the dataset is pared down to the first 16384 rows, but fails on anything larger regardless of the parameter values. To rule out malformed data, I tried running on a 4K matrix that inlcuded the 16385th row and its neighbors; and it worked.

Note, when the exception occurs, it happens very quickly; almost as if a pre-check is being performed before clustering is attempted.

Is there anything else I can try besides adjusting min_samples to get this dataset to go through?

gperaza commented 3 years ago

I'm running into a similar issue, but the problem only occurs with the haversine metric. Does the full distance matrix is calculated when using this metric, or something like that?

lmcinnes commented 3 years ago

Haversine should be using Ball trees, and not require the full distance matrix. However, depending on the data it is possible that the ball tree search can still end up using a bit of memory. Even so, it should fit in memory for most dataset sizes you suggest. Ultimately this is in the balltree creation / search phase, which is all in scikit-learn. You should attempt balltree constructions and searches with the dataset and verify if they work just fine.

yyHMA commented 1 year ago

我试图在一个包含大约 460,000 个点的数据集上进行聚类,但在.fit调用时遇到了内存错误。

这是相关的代码:

eps = .05 / 6378

clusterer = (hdbscan.HDBSCAN(min_cluster_size = 120,
                             min_samples = 100,
                             cluster_selection_epsilon = eps,
                             metric = 'haversine' )

rads = np.radians(df[['latitude', 'longitude']].values)
clusterer.fit(rads)

这是错误跟踪:

<ipython-input-6-363d24b7c8bb> in make_clusters(data)
---> 15     clusterer.fit(rads)
     16     data['clusters'] = clusterer.labels_
     17     data['membership_strength'] = clusterer.probabilities_

~\AppData\Local\Continuum\miniconda3\envs\geo\lib\site-packages\hdbscan\hdbscan_.py in fit(self, X, y)
    913          self._condensed_tree,
    914          self._single_linkage_tree,
--> 915          self._min_spanning_tree) = hdbscan(X, **kwargs)
    916
    917         if self.prediction_data:

~\AppData\Local\Continuum\miniconda3\envs\geo\lib\site-packages\hdbscan\hdbscan_.py in hdbscan(X, min_cluster_size, min_samples, alpha, cluster_selection_epsilon, metric, p, leaf_size, algorithm, memory, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, cluster_selection_method, allow_single_cluster, match_reference_implementation, **kwargs)
    624                                                approx_min_span_tree,
    625                                                gen_min_span_tree,
--> 626                                                core_dist_n_jobs, **kwargs)
    627
    628     return _tree_to_labels(X,

~\AppData\Local\Continuum\miniconda3\envs\geo\lib\site-packages\joblib\memory.py in __call__(self, *args, **kwargs)
    353
    354     def __call__(self, *args, **kwargs):
--> 355         return self.func(*args, **kwargs)
    356
    357     def call_and_shelve(self, *args, **kwargs):

~\AppData\Local\Continuum\miniconda3\envs\geo\lib\site-packages\hdbscan\hdbscan_.py in _hdbscan_boruvka_balltree(X, min_samples, alpha, metric, p, leaf_size, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, **kwargs)
    308                                    leaf_size=leaf_size // 3,
    309                                    approx_min_span_tree=approx_min_span_tree,
--> 310                                    n_jobs=core_dist_n_jobs, **kwargs)
    311     min_spanning_tree = alg.spanning_tree()
    312     # Sort edges of the min_spanning_tree by weight

hdbscan\_hdbscan_boruvka.pyx in hdbscan._hdbscan_boruvka.BallTreeBoruvkaAlgorithm.__init__()

hdbscan\dist_metrics.pyx in hdbscan.dist_metrics.DistanceMetric.pairwise()

MemoryError: Unable to allocate 32.0 GiB for an array with shape (65535, 65535) and data type float64

@lmcinnes suggested setting the leaf_size in the HDBSCAN constructor to 100 and that solved the issue.

For reference, this was was the working instance:

clusterer = (hdbscan.HDBSCAN(min_cluster_size = 120,
                             min_samples = 100,
                             cluster_selection_epsilon = eps,
                             metric = 'haversine',
                             leaf_size = 100 )

the problem is same as me. how did you solve this problem?

ddifranco commented 1 year ago

This may not be a satisfying answer, but FWIW, I used a two-step approach. This entailed (1) using a more scalable algorithm (dbscan IIRC) to cluster the full sample into a large number of weighted clusters then (2) clustering the centroids of the clusters from step 1 using hdbscan. It was time consuming and not ideal, but it did get the job done.

Also, have you tried decreasing min_samples? lmcinnes suggested this earlier in the thread. It didn't work for me, but appears to have worked for harpalsahota.