Closed vene closed 10 years ago
Under linux or windows?
Mac OS X Lion (10.7), can anybody reproduce?
And cross_val_score
with njobs=2
works in the same IPython session? This is really weird. Could you try to add print statements or pdb.set_trace() (the latter might be complicated in a multiprocessing env with piped stdin and stdout...).
Not in the same session because the process needs to be killed, but it works in an identical IPython session in the same virtualenv. I don't really know how to debug this, I'll try some prints to see where it hangs.
Vlad
On Feb 16, 2012, at 12:15 , Olivier Grisel wrote:
And
cross_val_score
withnjobs=2
works in the same IPython session? This is really weird. Could you try to add print statements or pdb.set_trace() (the latter might be complicated in a multiprocessing env with piped stdin and stdout...).
Reply to this email directly or view it on GitHub: https://github.com/scikit-learn/scikit-learn/issues/636#issuecomment-3998709
After a heavy tracing session I (strangely) found the hanging line to be a call to np.dot
(!) in safe_sparse_dot called from euclidean_distances in k_init
.
The exact same code runs when n_jobs=1
(ie. avoiding the special branch that bypasses the call to Parallel
when n_jobs=1
)
The interesting part (though I haven't checked this extensively) is that this happens in all 3 of my virtualenvs: stable numpy&scipy, git head numpy&scipy and scipy-superpack.
Any ideas?
On Feb 16, 2012, at 13:15 , Olivier Grisel wrote:
And
cross_val_score
withnjobs=2
works in the same IPython session? This is really weird. Could you try to add print statements or pdb.set_trace() (the latter might be complicated in a multiprocessing env with piped stdin and stdout...).
Reply to this email directly or view it on GitHub: https://github.com/scikit-learn/scikit-learn/issues/636#issuecomment-3998709
Can you try to attach a gdb to the running python processes and get a backtrace?
$ sudo gdb python <pid of one of the hanging python processes>
(gdb) bt # get the backtrace for the C program (python + libatlas.so)
(gdb) py-bt # get the python interpreter stacktrace, just to check
I'm trying to run scikit-learn for the first time on my Mac and am getting a hang on sklearn.cluster.tests.test_k_means.test_k_means_plus_plus_init_2_jobs
also.
Mac OS X Lion (10.7), Python 2.7, latest scikit-learn master from Github, virtualenv, numpy==1.6.1, scipy==0.10.1
I haven't tried gdb or anything yet. I'll be at the PyData Workshop tomorrow (Saturday) in case someone is there and wants to poke around on my computer.
I did a Git bisect to narrow it down:
1520ad3ebb337ae44d1f4398ce15826c871a098d is the first bad commit
commit 1520ad3ebb337ae44d1f4398ce15826c871a098d
Author: Robert Layton <robertlayton@gmail.com>
Date: Sun Jan 29 13:13:03 2012 +1100
Parallel version of k-means.
:040000 040000 5d86a2a99b9ef029b32a7bcc8810c7ddb318521d 238d5519d937c1fa340dba051f9e41f4ede3421b M doc
:040000 040000 d9b083cee7addc2c403cb2d95458c411c5482a48 260f9a8fcbd3ac7b6d03fb4785193015d7d9bb0c M sklearn
So far I have been unable to get GDB working with Python. This probably isn't useful, but here's what I get when I Ctrl-C it after it hangs:
^CProcess PoolWorker-15:
Process PoolWorker-16:
Traceback (most recent call last):
File "/usr/local/Cellar/python/2.7.2/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/local/Cellar/python/2.7.2/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/Cellar/python/2.7.2/lib/python2.7/multiprocessing/pool.py", line 85, in worker
task = get()
File "/usr/local/Cellar/python/2.7.2/lib/python2.7/multiprocessing/queues.py", line 376, in get
racquire()
KeyboardInterrupt
Traceback (most recent call last):
File "/usr/local/Cellar/python/2.7.2/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/local/Cellar/python/2.7.2/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/Cellar/python/2.7.2/lib/python2.7/multiprocessing/pool.py", line 85, in worker
task = get()
File "/usr/local/Cellar/python/2.7.2/lib/python2.7/multiprocessing/queues.py", line 378, in get
return recv()
KeyboardInterrupt
Ok so the git bisect tells us that this bug was introduced at the same time as we introduce the parallel kmeans feature itself. The python stacktrace might tell use that there is a race condition in multiprocessing:
this might be related: to http://bugs.python.org/issue9205 which has been fixed in python 3 but maybe not backported in python 2.7.
By the way @vene and @njwilson, which versions of python are you using?
Edit: I mean exactly, such as python --version
I'm using Python 2.7.2
Same here... I could never reproduce this race condition. So this might not be linked to an already backported python bugfix.
How many cores / CPUs do you have on your machine? I am running a dual core machine.
Same. Dual core.
@njwilson can you try to checkout and build this branch and see if you can still reproduce the issue?
https://github.com/ogrisel/scikit-learn/compare/no-kmeans-profile
cython profiling seems to be messing around with python threads and I have heard recently that python forking and the the python threads are not good friends: http://bugs.python.org/issue6721
I tried the no-kmeans-profile branch and the problem still exists.
@cournape and I tracked the bug down to a segfault in Grand Central Dispatch on the BLAS DGEMM function (only after a fork):
https://gist.github.com/2027412
I have just reported the issue to Apple. Hopefully it will be fixed at some point. In the mean time we will have to skip this test on OSX I think.
Wow, great work!
Let's hope they are responsive.
On Mar 13, 2012, at 09:51 , Olivier Grisel wrote:
@cournape and I tracked the bug down to a segfault in Grand Central Dispatch on the BLAS DGEMM function (only after a fork):
https://gist.github.com/2027412
I have just reported the issue to Apple. Hopefully it will be fixed at some point. In the mean time we will have to skip this test on OSX I think.
Reply to this email directly or view it on GitHub: https://github.com/scikit-learn/scikit-learn/issues/636#issuecomment-4468828
Now (on Mountain Lion) I don't get hanging but I get random segfaults in the sklearn.cluster.tests.test_k_means.test_k_means_plus_plus_init_2_jobs
.
This is the output: http://pastebin.com/aDnYUBkX
I assume this is the same bug @ogrisel and @cournape found, that Apple confirmed as expected behaviour (you can't call accelerate in subprocesses)
Can I compile my own parallel, optimized blas and lapack then? Did anybody do this on Lion or Mountain Lion because I can't find much about this endeavour.
Accelerate can be called after a fork but you need to execv the subprocess with the python binary (which multiprocessing does not do under posix). I think @cournape is working on a numpy build that uses OpenBlas and that could server as a good alternative for Accelerate (even though early benchmark should a slight decrease in performances).
So is there any way we can avoid these problems on OS X? Can we check the blas before trying to do something in parallel?
I think @cournape would like to make it possible to switch the blas / lapack implementation for numpy at runtime but that might be pretty tricky to implement and will obviously only be available in some undetermined future version of numpy if ever.
I don't know if we can detect whether the Blas in used is linked to Accelerate and issue a warning in joblib prior to forking in that case.
Hm... ok probably there is no chance really... should we just add this to the docs as a known issue in OS X?
+1 it never hurts to document stuff.
does anybody know if EPD on osx produces the same bug?
does anybody know if EPD on osx produces the same bug?
I use EPD with MKL on my mac and I've never had a problem so I would say no until I'm proved wrong...
I added a warning in bd3a6c2fcc3ea15ba17b1e2c9946c81f1f0b05a9. @ogrisel @vene could you please check that it is accurate? Should we close this issue or leave it open?
FYI there is a similar issue with OpenBLAS. I just reported here: xianyi/OpenBLAS#294
I'm having the same issue with Ubuntu here. It was working a few days ago.
uname -a: Linux 3.8.0-29-generic #42-Ubuntu SMP Tue Aug 13 19:40:39 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
python --version Python 2.7.4
>>> import numpy; print numpy.__version__
1.7.1
>>> import scipy; print scipy.__version__
0.11.0
>>> import sklearn; print sklearn.__version__
0.15-git
Recent upgrades via apt-get Commit Log for Sun Sep 1 13:30:29 2013 Installed the following packages: libblas-dev (1.2.20110419-5) liblapack-dev (3.4.2-1~exp3) liblapack-doc (3.4.2-1~exp3) liblapacke (3.4.2-1~exp3) liblapacke-dev (3.4.2-1~exp3)
KMeans with n_jobs=2 hangs, cross_validation with n_jobs=2 runs.
That's weird. I recently pushed efa74b491ed46427abe12a76ad4629dfae8a77d4 to master but that was to skip this test if openblas is used, so it should not be the problem. Still can you try to checkout a past version (using git log) and retry to check whether this commit is the source of the regression?. You cannot git bisect
easily because of the hanging behavior so manually check is probably required.
Also what is the output of:
from numpy.distutils.system_info import get_info
print(get_info('blas_opt'))
on your box?
/usr/lib/python2.7/dist-packages/numpy/distutils/system_info.py:1504: UserWarning:
Atlas (http://math-atlas.sourceforge.net/) libraries not found.
Directories to search for the libraries can be specified in the
numpy/distutils/site.cfg file (section [atlas]) or by setting
the ATLAS environment variable.
warnings.warn(AtlasNotFoundError.__doc__)
{'libraries': ['blas'], 'library_dirs': ['/usr/lib'], 'language': 'f77', 'define_macros': [('NO_ATLAS_INFO', 1)]}
That's odd, I thought I installed Atlas. I need to reinstall numpy right?
Weird. Maybe numpy is confused. I just checked on a up to date Ubuntu 13.04 machine and if I install libatlas-base-dev
I get the following output:
python -c "from numpy.distutils.system_info import get_info; print(get_info('blas_opt'))"
Found executable /usr/bin/gfortran
ATLAS version 3.8.4 built by buildd on Mon Nov 5 22:34:08 UTC 2012:
UNAME : Linux komainu 3.2.0-23-generic #36-Ubuntu SMP Tue Apr 10 20:39:51 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
INSTFLG : -1 0 -a 1
ARCHDEFS : -DATL_OS_Linux -DATL_ARCH_UNKNOWNx86 -DATL_CPUMHZ=1596 -DATL_USE64BITS -DATL_GAS_x8664
F2CDEFS : -DAdd_ -DF77_INTEGER=int -DStringSunStyle
CACHEEDGE: 262144
F77 : gfortran, version GNU Fortran (Ubuntu/Linaro 4.7.2-5ubuntu5) 4.7.2
F77FLAGS : -O -Wa,--noexecstack -fPIC -m64
SMC : gcc, version gcc (Ubuntu/Linaro 4.7.2-5ubuntu5) 4.7.2
SMCFLAGS : -O -fomit-frame-pointer -Wa,--noexecstack -fPIC -m64
SKC : gcc, version gcc (Ubuntu/Linaro 4.7.2-5ubuntu5) 4.7.2
SKCFLAGS : -O -fomit-frame-pointer -Wa,--noexecstack -fPIC -m64
{'libraries': ['ptf77blas', 'ptcblas', 'atlas'], 'library_dirs': ['/usr/lib/atlas-base'], 'define_macros': [('ATLAS_INFO', '"\\"3.8.4\\""')], 'language': 'c', 'include_dirs': ['/usr/include/atlas']}
Note that the headers seem to be required: if I uninstall the headers from libatlas-base-dev
while keeping the dynamic library from the package libatlas3-base
then I see the line:
$ apt-get remove libatlas-base-dev
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages will be REMOVED:
libatlas-base-dev
0 upgraded, 0 newly installed, 1 to remove and 48 not upgraded.
After this operation, 42.7 MB disk space will be freed.
Do you want to continue [Y/n]?
(Reading database ... 29054 files and directories currently installed.)
Removing libatlas-base-dev ...
update-alternatives: using /usr/lib/libblas/libblas.so to provide /usr/lib/libblas.so (libblas.so) in auto mode
Basically it switches back to the symlinks to the non-optimized versions of BLAS and LAPACK... So the dev headers are effectively a runtime dependency for optimized numpy under Ubuntu.
Just to confirm what I just said:
$ apt-get install libatlas-base-dev
Reading package lists... Done
Building dependency tree
Reading state information... Done
Suggested packages:
libblas-doc liblapack-doc
The following NEW packages will be installed:
libatlas-base-dev
0 upgraded, 1 newly installed, 0 to remove and 48 not upgraded.
Need to get 0 B/8,899 kB of archives.
After this operation, 42.7 MB of additional disk space will be used.
Selecting previously unselected package libatlas-base-dev.
(Reading database ... 29036 files and directories currently installed.)
Unpacking libatlas-base-dev (from .../libatlas-base-dev_3.8.4-9ubuntu1_amd64.deb) ...
Setting up libatlas-base-dev (3.8.4-9ubuntu1) ...
update-alternatives: using /usr/lib/atlas-base/atlas/libblas.so to provide /usr/lib/libblas.so (libblas.so) in auto mode
update-alternatives: using /usr/lib/atlas-base/atlas/liblapack.so to provide /usr/lib/liblapack.so (liblapack.so) in auto mode
Thanks for that, it got me on the right track. I followed the instructions here and it appears to work.
Hanging on a clean install with Anaconda on a new i7 Mabook Pro Retina and Mavericks.
Hanging on a clean install with Anaconda on a new i7 Mabook Pro Retina and Mavericks.
Which version of scikit-learn? We believe that we have fixed this in master.
Hey @GaelVaroquaux, I'm using scikit-learn 0.14.1 np17py27_1 from Anaconda.
@kevindavenport can you please give us the output of:
from numpy.distutils.system_info import get_info
print(get_info('blas_opt'))
If you have gdb installed, can you also try?
$ sudo gdb python <pid of one of the hanging python processes>
(gdb) bt # get the backtrace for the C program (python + libatlas.so)
(gdb) py-bt # get the python interpreter stacktrace, just to check
@ogrisel Thanks for your help. The dedication you guys have to this project is incredible!
{'extra_link_args': ['-Wl,-framework', '-Wl,Accelerate'], 'extra_compile_args': ['-msse3', '-I/System/Library/Frameworks/vecLib.framework/Headers'], 'define_macros': [('NO_ATLAS_INFO', 3)]}
I found an article stating how I can install gdb with homebrew:
brew install https://raw.github.com/Homebrew/homebrew-dupes/master/gdb.rb
but I also read "gdb has been replaced by lldb, and is no longer supported. gcc and llvm-gcc are also gone, replaced by clang."
Alright: anaconda is apparently linked against OSX vecLib instead of MKL or Atlas which causes the issue. Therefore this is the same issue as previously. We don't have any clean solution for this in the short term besides advising you to build Atlas from source to use that instead of vecLib.
Peter Wang from continuum states that they build against Apple Accelerate Framework. Do any of you guys have experience with pointing my environment to Atlas instead?
Maybe numpy.distutils.system_info
is lying then? Anyway the freeze looks real and AFAIK MKL is fork-safe.
Numpy developers started working on a binary package for numpy on OSX that builds against an embedded Atlas that does not suffer from this bug, this the comments in numpy/numpy#4007. If this experiment is validated a such numpy whl packages get published on PyPI we can close this bug and re-enable the skipped tests for future versions of numpy.
A few months ago I decided to just conda install MKL https://store.continuum.io/cshop/mkl-optimizations/ which would make my python install use the Intel MKL optimizations over the built in accelerate framework. This let me use the parallel features of joblib in scikit, but recently the anaconda 2.0 and/or numpy upgrade broke it again. Everything checks out here though:
from numpy.distutils.system_info import get_info
print(get_info('blas_opt'))
print(get_info('lapack_opt'))
{'libraries': ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread'], 'library_dirs': ['/Users/balthasar/anaconda/lib'], 'define_macros': [('SCIPY_MKL_H', None)], 'include_dirs': ['/Users/balthasar/anaconda/include']}
{'libraries': ['mkl_lapack95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread'], 'library_dirs': ['/Users/balthasar/anaconda/lib'], 'define_macros': [('SCIPY_MKL_H', None)], 'include_dirs': ['/Users/balthasar/anaconda/include']}
This is bad news. Can you confirm that the following breaks with numpy + mkl?
import numpy as np
from joblib import Parallel, delayed
a = np.random.randn(3000, 3000)
# Force initialization of multithreaded BLAS context in main process
x = np.dot(a, a.T)
if __name__ == "__main__":
# Force usage of of multithreaded BLAS in fork-child processes
Parallel(n_jobs=2)(delayed(np.dot)(a, a.T) for _ in range(3))
print('ok')
If you get a segfault with the OSX process crash report, please report the stack trace here.
I close this issue as the original bug has now been fixed by the new numpy wheel package on PyPI.
@kevindavenport could you please open a new issue for if you can reproduce the crash with numpy MKL and the code snippet provided in my earlier comment?
I am still getting this bug with the latest versions of everything on a mac. numpy + mkl breaks.
mkl, not accelerate? I didn't realize that was a problem, too. You are not using the anaconda version, are you?
I first noticed this when running 'make test' hanged. I tried with stable and bleeding edge scipy (I initially thought it was something arpack related).
The test
sklearn.cluster.tests.test_k_means.test_k_means_plus_plus_init_2_jobs
hangs the process.Running in IPython something like
KMeans(init='k-means++', n_jobs=2).fit(np.random.randn(100, 100))
hangs as well.I thought maybe there was something wrong with my setup, but
cross_val_score
works OK withn_jobs=2
.