scikit-learn-contrib / hdbscan

A high performance implementation of HDBSCAN clustering.
http://hdbscan.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
2.77k stars 496 forks source link

boruvka joblib error #22

Open eyaler opened 8 years ago

eyaler commented 8 years ago

hdbscan 0.6.5, sklearn 0.17.0 calling HDBSCAN.fit() with algorithm=boruvka_kdtree or boruvka_balltree, i sometimes get this following error. it works fine with algorithm=prims_kdtree or prims_balltree

Traceback (most recent call last): File "", line 1, in File "c:\python2764\Lib\multiprocessing\forking.py", line 380, in main prepare(preparation_data) File "c:\python2764\Lib\multiprocessing\forking.py", line 495, in prepare 'parents_main', file, pathname, etc ...( references to my code calling HDBSCAN.fit() )... File "C:\Users\eyalg\virtualenv\future64\lib\site-packages\hdbscan\hdbscan.py", line 531, in fit self._min_spanning_tree) = hdbscan(X, self.getparams()) File "C:\Users\eyalg\virtualenv\future64\lib\site-packages\hdbscan\hdbscan.py", line 363, in hdbscan gen_min_span_tree) File "C:\Users\eyalg\virtualenv\future64\lib\site-packages\sklearn\externals\joblib\memory.py", line 283, in call* return self.func(_args, kwargs) File "C:\Users\eyalg\virtualenv\future64\lib\site-packages\hdbscan\hdbscan.py", line 163, in _hdbscan_boruvka_kdtree alg = KDTreeBoruvkaAlgorithm(tree, min_samples, metric=metric, leaf_size=leaf_size // 3) File "hdbscan/_hdbscan_boruvka.pyx", line 335, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.init (hdbscan_hdbscan_boruvka.c:4746) File "hdbscan/_hdbscan_boruvka.pyx", line 364, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds (hdbscan_hdbscan_boruvka.c:5401) File "C:\Users\eyalg\virtualenv\future64\lib\site-packages\sklearn\externals\joblib\parallel.py", line 771, in call n_jobs = self._initialize_pool() File "C:\Users\eyalg\virtualenv\future64\lib\site-packages\sklearn\externals\joblib\parallel.py", line 518, in _initialize_pool raise ImportError('[joblib] Attempting to do parallel computing ' ImportError: [joblib] Attempting to do parallel computing without protecting your import on a system that does not support forking. To use parallel-computing in a script, you must protect your main loop using "if name == 'main'". Please see the joblib documentation on Parallel for more information

lmcinnes commented 8 years ago

I believe this is a Windows related error, and unfortunately my access to Windows systems to test and debug on is very limited. I'll try to get to this when I can, but I can't promise a swift resolution in this case.

lmcinnes commented 8 years ago

It seems this is a known issue with joblib on windows ... you'll see similar problems with the scikit-learn RandomForest if you specify n_jobs. You need to wrap any part of your code that isn't function definitions or imports in an

if __name__ == "__main__":

to have it work properly. See http://pythonhosted.org/joblib/parallel.html for details, or https://www.mail-archive.com/scikit-learn-general@lists.sourceforge.net/msg08093.html for examples of other ways this error can crop up.

eyaler commented 8 years ago

Indeed this solves the issue, however maybe you can allow the user to run with the equivalent of njobs=1?

lmcinnes commented 8 years ago

Sorry for the long delay. Getting started on this now. In master you can set core_dist_n_jobs=1 to achieve this. This should appear in the next release.

essandess commented 6 years ago

I see this issue on macOS in a Jupyter notebook working with scikit-learn and multicore processing. This MWE tickles the issue:

import numpy as np, numpy.random as npr
from sklearn import cluster

data = npr.poisson(1,(100,10))
algorithm = cluster.KMeans
algorithm_kwargs = dict(n_clusters=4,n_jobs=-1)
estimator = algorithm(**algorithm_kwargs)
labels = estimator.fit_predict(data)

The code runs as expected when run as a Python script.

However, under a Jupyter notebook, it throws this error:

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-64-5a0071cf17dc> in <module>()
      3 algorithm_kwargs = dict(n_clusters=4,n_jobs=-1)
      4 estimator = algorithm(**algorithm_kwargs)
----> 5 labels = estimator.fit_predict(data)

/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/cluster/k_means_.py in fit_predict(self, X, y)
    915             Index of the cluster each sample belongs to.
    916         """
--> 917         return self.fit(X).labels_
    918 
    919     def fit_transform(self, X, y=None):

/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/cluster/k_means_.py in fit(self, X, y)
    894                 tol=self.tol, random_state=random_state, copy_x=self.copy_x,
    895                 n_jobs=self.n_jobs, algorithm=self.algorithm,
--> 896                 return_n_iter=True)
    897         return self
    898 

/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/cluster/k_means_.py in k_means(X, n_clusters, init, precompute_distances, n_init, max_iter, verbose, tol, random_state, copy_x, n_jobs, algorithm, return_n_iter)
    361                                    # Change seed to ensure variety
    362                                    random_state=seed)
--> 363             for seed in seeds)
    364         # Get results with the lowest inertia
    365         labels, inertia, centers, n_iters = zip(*results)

/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
    747         self._aborting = False
    748         if not self._managed_backend:
--> 749             n_jobs = self._initialize_backend()
    750         else:
    751             n_jobs = self._effective_n_jobs()

/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in _initialize_backend(self)
    545         try:
    546             n_jobs = self._backend.configure(n_jobs=self.n_jobs, parallel=self,
--> 547                                              **self._backend_args)
    548             if self.timeout is not None and not self._backend.supports_timeout:
    549                 warnings.warn(

/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py in configure(self, n_jobs, parallel, **backend_args)
    303         if already_forked:
    304             raise ImportError(
--> 305                 '[joblib] Attempting to do parallel computing '
    306                 'without protecting your import on a system that does '
    307                 'not support forking. To use parallel-computing in a '

ImportError: [joblib] Attempting to do parallel computing without protecting your import on a system that does not support forking. To use parallel-computing in a script, you must protect your main loop using "if __name__ == '__main__'". Please see the joblib documentation on Parallel for more information

I'm not sure if this is an issue with joblib or some downstream issue with Jupyter, but it is a problem.

The very same issue arises in HDBSCAN], e.g. the following alternate code tickles the issue:

import hdbscan

algorithm = hdbscan.HDBSCAN
algorithm_kwargs = dict(min_cluster_size=10,allow_single_cluster=True)
lmcinnes commented 6 years ago

I think this is an upstream issue with joblib, but thanks for reporting. I'll try to look into it when I can get some time and ensure that it is upstream, and check if there is a way to workaround the issue internally in hdbscan.

essandess commented 6 years ago

Thanks. Or possibly upstream with Jupyter—I didn’t have this issue until recently, after upgrading my stack. Whether it’s Jupyter or joblib, joblib’s error message is borked.

simonhughes22 commented 6 years ago

Why is this closed? it is still an issue with the latest joblib (0.12) when running in jupyter on a mac. Works fine for a while then randomly starts failing forcing me to restart the whole kernel and lose ea lot of work

lmcinnes commented 6 years ago

The original issue (relating to windows) was closed, and this was never re-opened. I haven't been able to reproduce it myself, and as far as I can tell it is a joblib issue that I can't do much about (I don't pretend to know or understand joblib well enough to suggest a fix).

simonhughes22 commented 6 years ago

Ah yes I need to post this on the joblib repo, assuming they have one

simonhughes22 commented 6 years ago

Posted here: https://github.com/joblib/joblib/issues/709

lmcinnes commented 6 years ago

Thanks!

On Fri, Jun 29, 2018 at 1:47 PM Simon Hughes notifications@github.com wrote:

Posted here: joblib/joblib#709 https://github.com/joblib/joblib/issues/709

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/22#issuecomment-401426040, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBTm0MiRTBbKNOq_LIR5-JWNv5uVgks5uBmgegaJpZM4HTNO2 .