Open eyaler opened 8 years ago
I believe this is a Windows related error, and unfortunately my access to Windows systems to test and debug on is very limited. I'll try to get to this when I can, but I can't promise a swift resolution in this case.
It seems this is a known issue with joblib on windows ... you'll see similar problems with the scikit-learn RandomForest if you specify n_jobs. You need to wrap any part of your code that isn't function definitions or imports in an
if __name__ == "__main__":
to have it work properly. See http://pythonhosted.org/joblib/parallel.html for details, or https://www.mail-archive.com/scikit-learn-general@lists.sourceforge.net/msg08093.html for examples of other ways this error can crop up.
Indeed this solves the issue, however maybe you can allow the user to run with the equivalent of njobs=1?
Sorry for the long delay. Getting started on this now. In master you can set core_dist_n_jobs=1 to achieve this. This should appear in the next release.
I see this issue on macOS in a Jupyter notebook working with scikit-learn and multicore processing. This MWE tickles the issue:
import numpy as np, numpy.random as npr
from sklearn import cluster
data = npr.poisson(1,(100,10))
algorithm = cluster.KMeans
algorithm_kwargs = dict(n_clusters=4,n_jobs=-1)
estimator = algorithm(**algorithm_kwargs)
labels = estimator.fit_predict(data)
The code runs as expected when run as a Python script.
However, under a Jupyter notebook, it throws this error:
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-64-5a0071cf17dc> in <module>()
3 algorithm_kwargs = dict(n_clusters=4,n_jobs=-1)
4 estimator = algorithm(**algorithm_kwargs)
----> 5 labels = estimator.fit_predict(data)
/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/cluster/k_means_.py in fit_predict(self, X, y)
915 Index of the cluster each sample belongs to.
916 """
--> 917 return self.fit(X).labels_
918
919 def fit_transform(self, X, y=None):
/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/cluster/k_means_.py in fit(self, X, y)
894 tol=self.tol, random_state=random_state, copy_x=self.copy_x,
895 n_jobs=self.n_jobs, algorithm=self.algorithm,
--> 896 return_n_iter=True)
897 return self
898
/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/cluster/k_means_.py in k_means(X, n_clusters, init, precompute_distances, n_init, max_iter, verbose, tol, random_state, copy_x, n_jobs, algorithm, return_n_iter)
361 # Change seed to ensure variety
362 random_state=seed)
--> 363 for seed in seeds)
364 # Get results with the lowest inertia
365 labels, inertia, centers, n_iters = zip(*results)
/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
747 self._aborting = False
748 if not self._managed_backend:
--> 749 n_jobs = self._initialize_backend()
750 else:
751 n_jobs = self._effective_n_jobs()
/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in _initialize_backend(self)
545 try:
546 n_jobs = self._backend.configure(n_jobs=self.n_jobs, parallel=self,
--> 547 **self._backend_args)
548 if self.timeout is not None and not self._backend.supports_timeout:
549 warnings.warn(
/opt/local/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py in configure(self, n_jobs, parallel, **backend_args)
303 if already_forked:
304 raise ImportError(
--> 305 '[joblib] Attempting to do parallel computing '
306 'without protecting your import on a system that does '
307 'not support forking. To use parallel-computing in a '
ImportError: [joblib] Attempting to do parallel computing without protecting your import on a system that does not support forking. To use parallel-computing in a script, you must protect your main loop using "if __name__ == '__main__'". Please see the joblib documentation on Parallel for more information
I'm not sure if this is an issue with joblib or some downstream issue with Jupyter, but it is a problem.
The very same issue arises in HDBSCAN], e.g. the following alternate code tickles the issue:
import hdbscan
algorithm = hdbscan.HDBSCAN
algorithm_kwargs = dict(min_cluster_size=10,allow_single_cluster=True)
I think this is an upstream issue with joblib, but thanks for reporting. I'll try to look into it when I can get some time and ensure that it is upstream, and check if there is a way to workaround the issue internally in hdbscan.
Thanks. Or possibly upstream with Jupyter—I didn’t have this issue until recently, after upgrading my stack. Whether it’s Jupyter or joblib, joblib’s error message is borked.
Why is this closed? it is still an issue with the latest joblib (0.12) when running in jupyter on a mac. Works fine for a while then randomly starts failing forcing me to restart the whole kernel and lose ea lot of work
The original issue (relating to windows) was closed, and this was never re-opened. I haven't been able to reproduce it myself, and as far as I can tell it is a joblib issue that I can't do much about (I don't pretend to know or understand joblib well enough to suggest a fix).
Ah yes I need to post this on the joblib repo, assuming they have one
Posted here: https://github.com/joblib/joblib/issues/709
Thanks!
On Fri, Jun 29, 2018 at 1:47 PM Simon Hughes notifications@github.com wrote:
Posted here: joblib/joblib#709 https://github.com/joblib/joblib/issues/709
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/scikit-learn-contrib/hdbscan/issues/22#issuecomment-401426040, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBTm0MiRTBbKNOq_LIR5-JWNv5uVgks5uBmgegaJpZM4HTNO2 .
hdbscan 0.6.5, sklearn 0.17.0 calling HDBSCAN.fit() with algorithm=boruvka_kdtree or boruvka_balltree, i sometimes get this following error. it works fine with algorithm=prims_kdtree or prims_balltree
Traceback (most recent call last): File "", line 1, in
File "c:\python2764\Lib\multiprocessing\forking.py", line 380, in main
prepare(preparation_data)
File "c:\python2764\Lib\multiprocessing\forking.py", line 495, in prepare
'parents_main', file, pathname, etc
...( references to my code calling HDBSCAN.fit() )...
File "C:\Users\eyalg\virtualenv\future64\lib\site-packages\hdbscan\hdbscan.py", line 531, in fit
self._min_spanning_tree) = hdbscan(X, self.getparams())
File "C:\Users\eyalg\virtualenv\future64\lib\site-packages\hdbscan\hdbscan.py", line 363, in hdbscan
gen_min_span_tree)
File "C:\Users\eyalg\virtualenv\future64\lib\site-packages\sklearn\externals\joblib\memory.py", line 283, in call*
return self.func(_args, kwargs)
File "C:\Users\eyalg\virtualenv\future64\lib\site-packages\hdbscan\hdbscan.py", line 163, in _hdbscan_boruvka_kdtree
alg = KDTreeBoruvkaAlgorithm(tree, min_samples, metric=metric, leaf_size=leaf_size // 3)
File "hdbscan/_hdbscan_boruvka.pyx", line 335, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.init (hdbscan_hdbscan_boruvka.c:4746)
File "hdbscan/_hdbscan_boruvka.pyx", line 364, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds (hdbscan_hdbscan_boruvka.c:5401)
File "C:\Users\eyalg\virtualenv\future64\lib\site-packages\sklearn\externals\joblib\parallel.py", line 771, in call
n_jobs = self._initialize_pool()
File "C:\Users\eyalg\virtualenv\future64\lib\site-packages\sklearn\externals\joblib\parallel.py", line 518, in _initialize_pool
raise ImportError('[joblib] Attempting to do parallel computing '
ImportError: [joblib] Attempting to do parallel computing without protecting your import on a system that does not support forking. To use parallel-computing in a script, you must protect your main loop using "if name == 'main'". Please see the joblib documentation on Parallel for more information