scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.81k stars 506 forks source link

joblib error for (relatively) big min_cluster_size parameter values #372

Open Garrus990 opened 4 years ago

Garrus990 commented 4 years ago

Hi guys,

thank you for the implementation of the algorithm - it works mostly incredibly good. But only recently have I encountered an error that is not addressed according to a quick search in google. I am trying to cluster a dataset of size (560823, 2). I have successfully clustered 10x larger datasets, but this time I've a particular problem. In my dataset, there is a central, dense, mass of points, that I would like to 'extract' from the rest, rather loose observations. HDBSCAN and other density-based methods seem to be perfect for that. In order to extract this mass and disregard all the rest I am setting the parameter min_cluster_size to 10000 (I've tried also 5000 and 2500). In all the cases the procedure halts and throws an error with the following stack:

---------------------------------------------------------------------------
_RemoteTraceback                          Traceback (most recent call last)
_RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 344, in _sendback_result
    exception=exception))
  File "/opt/conda/lib/python3.7/site-packages/joblib/externals/loky/backend/queues.py", line 240, in put
    self._writer.send_bytes(obj)
  File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 393, in _send_bytes
    header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
"""

The above exception was the direct cause of the following exception:

error                                     Traceback (most recent call last)
<ipython-input-31-5141c7286f77> in <module>
      3 hdbscan_obj = HDBSCAN(min_cluster_size=2500, )
      4 # hdbscan_obj.fit(SWISSPROT_PCA_TABLE)
----> 5 hdbscan_obj.fit(SWISSPROT_UMAP_TABLE[['umap_0', 'umap_1']].sample(300000))
      6 labels = hdbscan_obj.labels_
      7 pkl.dump(labels, open('HDBSCAN_labels.pkl', 'wb'))

/opt/conda/lib/python3.7/site-packages/hdbscan/hdbscan_.py in fit(self, X, y)
    917          self._condensed_tree,
    918          self._single_linkage_tree,
--> 919          self._min_spanning_tree) = hdbscan(X, **kwargs)
    920 
    921         if self.prediction_data:

/opt/conda/lib/python3.7/site-packages/hdbscan/hdbscan_.py in hdbscan(X, min_cluster_size, min_samples, alpha, cluster_selection_epsilon, metric, p, leaf_size, algorithm, memory, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, cluster_selection_method, allow_single_cluster, match_reference_implementation, **kwargs)
    613                                              approx_min_span_tree,
    614                                              gen_min_span_tree,
--> 615                                              core_dist_n_jobs, **kwargs)
    616         else:  # Metric is a valid BallTree metric
    617             # TO DO: Need heuristic to decide when to go to boruvka;

/opt/conda/lib/python3.7/site-packages/joblib/memory.py in __call__(self, *args, **kwargs)
    353 
    354     def __call__(self, *args, **kwargs):
--> 355         return self.func(*args, **kwargs)
    356 
    357     def call_and_shelve(self, *args, **kwargs):

/opt/conda/lib/python3.7/site-packages/hdbscan/hdbscan_.py in _hdbscan_boruvka_kdtree(X, min_samples, alpha, metric, p, leaf_size, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, **kwargs)
    276                                  leaf_size=leaf_size // 3,
    277                                  approx_min_span_tree=approx_min_span_tree,
--> 278                                  n_jobs=core_dist_n_jobs, **kwargs)
    279     min_spanning_tree = alg.spanning_tree()
    280     # Sort edges of the min_spanning_tree by weight

hdbscan/_hdbscan_boruvka.pyx in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.__init__()

hdbscan/_hdbscan_boruvka.pyx in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds()

/opt/conda/lib/python3.7/site-packages/joblib/parallel.py in __call__(self, iterable)
   1015 
   1016             with self._backend.retrieval_context():
-> 1017                 self.retrieve()
   1018             # Make sure that we get a last message telling us we are done
   1019             elapsed_time = time.time() - self._start_time

/opt/conda/lib/python3.7/site-packages/joblib/parallel.py in retrieve(self)
    907             try:
    908                 if getattr(self._backend, 'supports_timeout', False):
--> 909                     self._output.extend(job.get(timeout=self.timeout))
    910                 else:
    911                     self._output.extend(job.get())

/opt/conda/lib/python3.7/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
    560         AsyncResults.get from multiprocessing."""
    561         try:
--> 562             return future.result(timeout=timeout)
    563         except LokyTimeoutError:
    564             raise TimeoutError()

/opt/conda/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
    433                 raise CancelledError()
    434             elif self._state == FINISHED:
--> 435                 return self.__get_result()
    436             else:
    437                 raise TimeoutError()

/opt/conda/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

error: 'i' format requires -2147483648 <= number <= 2147483647

The error is, as you can see, quite cryptic :) The code I am using:

hdbscan_obj = HDBSCAN(min_cluster_size=2500, )
hdbscan_obj.fit(MY_TABLE[['var_0', 'var_1']]

I installed hdbscan ver. 0.8.26 via conda and I am running it on python 3.7.7.

Further info: As I suspect that some kind of algorithm's internal object is too big, I played a little with the params and it turns out, that for my dataset the errors starts popping up somewhere between 200'000 and 300'000 observations for the min_cluster_size of 2500 and somewhere between 50'000 and 100'000 for min_cluster_size of 10000, so these two params are interconnected.

Cheers!

eparsonnet93 commented 4 years ago

I am having an identical issue. Any thoughts on this?

sa2329 commented 2 months ago

The parameter min_samples is used for computing the linkage tree, and defaults to the value of min_cluster_size. When setting min_cluster_size to a large value min_samples should be set to something smaller to reduce memory usage.

(Old issue, but still open so adding this in case anyone else has similar trouble)