scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
59.89k stars 25.37k forks source link

Parallel K-Means hangs on Mac OS X Lion #636

Closed vene closed 10 years ago

vene commented 12 years ago

I first noticed this when running 'make test' hanged. I tried with stable and bleeding edge scipy (I initially thought it was something arpack related).

The test sklearn.cluster.tests.test_k_means.test_k_means_plus_plus_init_2_jobs hangs the process.

Running in IPython something like KMeans(init='k-means++', n_jobs=2).fit(np.random.randn(100, 100)) hangs as well.

I thought maybe there was something wrong with my setup, but cross_val_score works OK with n_jobs=2.

ogrisel commented 12 years ago

Under linux or windows?

vene commented 12 years ago

Mac OS X Lion (10.7), can anybody reproduce?

ogrisel commented 12 years ago

And cross_val_score with njobs=2 works in the same IPython session? This is really weird. Could you try to add print statements or pdb.set_trace() (the latter might be complicated in a multiprocessing env with piped stdin and stdout...).

vene commented 12 years ago

Not in the same session because the process needs to be killed, but it works in an identical IPython session in the same virtualenv. I don't really know how to debug this, I'll try some prints to see where it hangs.

Vlad

On Feb 16, 2012, at 12:15 , Olivier Grisel wrote:

And cross_val_score with njobs=2 works in the same IPython session? This is really weird. Could you try to add print statements or pdb.set_trace() (the latter might be complicated in a multiprocessing env with piped stdin and stdout...).


Reply to this email directly or view it on GitHub: https://github.com/scikit-learn/scikit-learn/issues/636#issuecomment-3998709

vene commented 12 years ago

After a heavy tracing session I (strangely) found the hanging line to be a call to np.dot (!) in safe_sparse_dot called from euclidean_distances in k_init.

The exact same code runs when n_jobs=1 (ie. avoiding the special branch that bypasses the call to Parallel when n_jobs=1)

The interesting part (though I haven't checked this extensively) is that this happens in all 3 of my virtualenvs: stable numpy&scipy, git head numpy&scipy and scipy-superpack.

Any ideas?

On Feb 16, 2012, at 13:15 , Olivier Grisel wrote:

And cross_val_score with njobs=2 works in the same IPython session? This is really weird. Could you try to add print statements or pdb.set_trace() (the latter might be complicated in a multiprocessing env with piped stdin and stdout...).


Reply to this email directly or view it on GitHub: https://github.com/scikit-learn/scikit-learn/issues/636#issuecomment-3998709

ogrisel commented 12 years ago

Can you try to attach a gdb to the running python processes and get a backtrace?

$ sudo gdb python <pid of one of the hanging python processes>
(gdb) bt  # get the backtrace for the C program (python + libatlas.so)
(gdb) py-bt # get the python interpreter stacktrace, just to check
njwilson commented 12 years ago

I'm trying to run scikit-learn for the first time on my Mac and am getting a hang on sklearn.cluster.tests.test_k_means.test_k_means_plus_plus_init_2_jobs also.

Mac OS X Lion (10.7), Python 2.7, latest scikit-learn master from Github, virtualenv, numpy==1.6.1, scipy==0.10.1

I haven't tried gdb or anything yet. I'll be at the PyData Workshop tomorrow (Saturday) in case someone is there and wants to poke around on my computer.

njwilson commented 12 years ago

I did a Git bisect to narrow it down:

1520ad3ebb337ae44d1f4398ce15826c871a098d is the first bad commit
commit 1520ad3ebb337ae44d1f4398ce15826c871a098d
Author: Robert Layton <robertlayton@gmail.com>
Date:   Sun Jan 29 13:13:03 2012 +1100

    Parallel version of k-means.

:040000 040000 5d86a2a99b9ef029b32a7bcc8810c7ddb318521d 238d5519d937c1fa340dba051f9e41f4ede3421b M  doc
:040000 040000 d9b083cee7addc2c403cb2d95458c411c5482a48 260f9a8fcbd3ac7b6d03fb4785193015d7d9bb0c M  sklearn

So far I have been unable to get GDB working with Python. This probably isn't useful, but here's what I get when I Ctrl-C it after it hangs:

^CProcess PoolWorker-15:
Process PoolWorker-16:
Traceback (most recent call last):
  File "/usr/local/Cellar/python/2.7.2/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/local/Cellar/python/2.7.2/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/Cellar/python/2.7.2/lib/python2.7/multiprocessing/pool.py", line 85, in worker
    task = get()
  File "/usr/local/Cellar/python/2.7.2/lib/python2.7/multiprocessing/queues.py", line 376, in get
    racquire()
KeyboardInterrupt
Traceback (most recent call last):
  File "/usr/local/Cellar/python/2.7.2/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/local/Cellar/python/2.7.2/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/Cellar/python/2.7.2/lib/python2.7/multiprocessing/pool.py", line 85, in worker
    task = get()
  File "/usr/local/Cellar/python/2.7.2/lib/python2.7/multiprocessing/queues.py", line 378, in get
    return recv()
KeyboardInterrupt
ogrisel commented 12 years ago

Ok so the git bisect tells us that this bug was introduced at the same time as we introduce the parallel kmeans feature itself. The python stacktrace might tell use that there is a race condition in multiprocessing:

this might be related: to http://bugs.python.org/issue9205 which has been fixed in python 3 but maybe not backported in python 2.7.

By the way @vene and @njwilson, which versions of python are you using?

Edit: I mean exactly, such as python --version

njwilson commented 12 years ago

I'm using Python 2.7.2

ogrisel commented 12 years ago

Same here... I could never reproduce this race condition. So this might not be linked to an already backported python bugfix.

ogrisel commented 12 years ago

How many cores / CPUs do you have on your machine? I am running a dual core machine.

njwilson commented 12 years ago

Same. Dual core.

ogrisel commented 12 years ago

@njwilson can you try to checkout and build this branch and see if you can still reproduce the issue?

https://github.com/ogrisel/scikit-learn/compare/no-kmeans-profile

cython profiling seems to be messing around with python threads and I have heard recently that python forking and the the python threads are not good friends: http://bugs.python.org/issue6721

njwilson commented 12 years ago

I tried the no-kmeans-profile branch and the problem still exists.

ogrisel commented 12 years ago

@cournape and I tracked the bug down to a segfault in Grand Central Dispatch on the BLAS DGEMM function (only after a fork):

https://gist.github.com/2027412

I have just reported the issue to Apple. Hopefully it will be fixed at some point. In the mean time we will have to skip this test on OSX I think.

vene commented 12 years ago

Wow, great work!

Let's hope they are responsive.

On Mar 13, 2012, at 09:51 , Olivier Grisel wrote:

@cournape and I tracked the bug down to a segfault in Grand Central Dispatch on the BLAS DGEMM function (only after a fork):

https://gist.github.com/2027412

I have just reported the issue to Apple. Hopefully it will be fixed at some point. In the mean time we will have to skip this test on OSX I think.


Reply to this email directly or view it on GitHub: https://github.com/scikit-learn/scikit-learn/issues/636#issuecomment-4468828

vene commented 12 years ago

Now (on Mountain Lion) I don't get hanging but I get random segfaults in the sklearn.cluster.tests.test_k_means.test_k_means_plus_plus_init_2_jobs. This is the output: http://pastebin.com/aDnYUBkX

vene commented 12 years ago

I assume this is the same bug @ogrisel and @cournape found, that Apple confirmed as expected behaviour (you can't call accelerate in subprocesses)

Can I compile my own parallel, optimized blas and lapack then? Did anybody do this on Lion or Mountain Lion because I can't find much about this endeavour.

ogrisel commented 12 years ago

Accelerate can be called after a fork but you need to execv the subprocess with the python binary (which multiprocessing does not do under posix). I think @cournape is working on a numpy build that uses OpenBlas and that could server as a good alternative for Accelerate (even though early benchmark should a slight decrease in performances).

amueller commented 12 years ago

So is there any way we can avoid these problems on OS X? Can we check the blas before trying to do something in parallel?

ogrisel commented 12 years ago

I think @cournape would like to make it possible to switch the blas / lapack implementation for numpy at runtime but that might be pretty tricky to implement and will obviously only be available in some undetermined future version of numpy if ever.

I don't know if we can detect whether the Blas in used is linked to Accelerate and issue a warning in joblib prior to forking in that case.

amueller commented 12 years ago

Hm... ok probably there is no chance really... should we just add this to the docs as a known issue in OS X?

ogrisel commented 12 years ago

+1 it never hurts to document stuff.

satra commented 12 years ago

does anybody know if EPD on osx produces the same bug?

agramfort commented 12 years ago

does anybody know if EPD on osx produces the same bug?

I use EPD with MKL on my mac and I've never had a problem so I would say no until I'm proved wrong...

amueller commented 11 years ago

I added a warning in bd3a6c2fcc3ea15ba17b1e2c9946c81f1f0b05a9. @ogrisel @vene could you please check that it is accurate? Should we close this issue or leave it open?

ogrisel commented 11 years ago

FYI there is a similar issue with OpenBLAS. I just reported here: xianyi/OpenBLAS#294

robertlayton commented 11 years ago

I'm having the same issue with Ubuntu here. It was working a few days ago.

uname -a: Linux 3.8.0-29-generic #42-Ubuntu SMP Tue Aug 13 19:40:39 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux python --version Python 2.7.4

>>> import numpy; print numpy.__version__
1.7.1
>>> import scipy; print scipy.__version__
0.11.0
>>> import sklearn; print sklearn.__version__
0.15-git

Recent upgrades via apt-get Commit Log for Sun Sep 1 13:30:29 2013 Installed the following packages: libblas-dev (1.2.20110419-5) liblapack-dev (3.4.2-1~exp3) liblapack-doc (3.4.2-1~exp3) liblapacke (3.4.2-1~exp3) liblapacke-dev (3.4.2-1~exp3)

KMeans with n_jobs=2 hangs, cross_validation with n_jobs=2 runs.

ogrisel commented 11 years ago

That's weird. I recently pushed efa74b491ed46427abe12a76ad4629dfae8a77d4 to master but that was to skip this test if openblas is used, so it should not be the problem. Still can you try to checkout a past version (using git log) and retry to check whether this commit is the source of the regression?. You cannot git bisect easily because of the hanging behavior so manually check is probably required.

Also what is the output of:

from numpy.distutils.system_info import get_info
print(get_info('blas_opt'))

on your box?

robertlayton commented 11 years ago
/usr/lib/python2.7/dist-packages/numpy/distutils/system_info.py:1504: UserWarning: 
    Atlas (http://math-atlas.sourceforge.net/) libraries not found.
    Directories to search for the libraries can be specified in the
    numpy/distutils/site.cfg file (section [atlas]) or by setting
    the ATLAS environment variable.
  warnings.warn(AtlasNotFoundError.__doc__)
{'libraries': ['blas'], 'library_dirs': ['/usr/lib'], 'language': 'f77', 'define_macros': [('NO_ATLAS_INFO', 1)]}

That's odd, I thought I installed Atlas. I need to reinstall numpy right?

ogrisel commented 11 years ago

Weird. Maybe numpy is confused. I just checked on a up to date Ubuntu 13.04 machine and if I install libatlas-base-dev I get the following output:

python -c "from numpy.distutils.system_info import get_info; print(get_info('blas_opt'))"
Found executable /usr/bin/gfortran
ATLAS version 3.8.4 built by buildd on Mon Nov  5 22:34:08 UTC 2012:
   UNAME    : Linux komainu 3.2.0-23-generic #36-Ubuntu SMP Tue Apr 10 20:39:51 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
   INSTFLG  : -1 0 -a 1
   ARCHDEFS : -DATL_OS_Linux -DATL_ARCH_UNKNOWNx86 -DATL_CPUMHZ=1596 -DATL_USE64BITS -DATL_GAS_x8664
   F2CDEFS  : -DAdd_ -DF77_INTEGER=int -DStringSunStyle
   CACHEEDGE: 262144
   F77      : gfortran, version GNU Fortran (Ubuntu/Linaro 4.7.2-5ubuntu5) 4.7.2
   F77FLAGS : -O -Wa,--noexecstack -fPIC -m64
   SMC      : gcc, version gcc (Ubuntu/Linaro 4.7.2-5ubuntu5) 4.7.2
   SMCFLAGS : -O -fomit-frame-pointer -Wa,--noexecstack -fPIC -m64
   SKC      : gcc, version gcc (Ubuntu/Linaro 4.7.2-5ubuntu5) 4.7.2
   SKCFLAGS : -O -fomit-frame-pointer -Wa,--noexecstack -fPIC -m64
{'libraries': ['ptf77blas', 'ptcblas', 'atlas'], 'library_dirs': ['/usr/lib/atlas-base'], 'define_macros': [('ATLAS_INFO', '"\\"3.8.4\\""')], 'language': 'c', 'include_dirs': ['/usr/include/atlas']}
ogrisel commented 11 years ago

Note that the headers seem to be required: if I uninstall the headers from libatlas-base-dev while keeping the dynamic library from the package libatlas3-base then I see the line:

$ apt-get remove libatlas-base-dev
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages will be REMOVED:
  libatlas-base-dev
0 upgraded, 0 newly installed, 1 to remove and 48 not upgraded.
After this operation, 42.7 MB disk space will be freed.
Do you want to continue [Y/n]?
(Reading database ... 29054 files and directories currently installed.)
Removing libatlas-base-dev ...
update-alternatives: using /usr/lib/libblas/libblas.so to provide /usr/lib/libblas.so (libblas.so) in auto mode

Basically it switches back to the symlinks to the non-optimized versions of BLAS and LAPACK... So the dev headers are effectively a runtime dependency for optimized numpy under Ubuntu.

ogrisel commented 11 years ago

Just to confirm what I just said:

$ apt-get install libatlas-base-dev
Reading package lists... Done
Building dependency tree
Reading state information... Done
Suggested packages:
  libblas-doc liblapack-doc
The following NEW packages will be installed:
  libatlas-base-dev
0 upgraded, 1 newly installed, 0 to remove and 48 not upgraded.
Need to get 0 B/8,899 kB of archives.
After this operation, 42.7 MB of additional disk space will be used.
Selecting previously unselected package libatlas-base-dev.
(Reading database ... 29036 files and directories currently installed.)
Unpacking libatlas-base-dev (from .../libatlas-base-dev_3.8.4-9ubuntu1_amd64.deb) ...
Setting up libatlas-base-dev (3.8.4-9ubuntu1) ...
update-alternatives: using /usr/lib/atlas-base/atlas/libblas.so to provide /usr/lib/libblas.so (libblas.so) in auto mode
update-alternatives: using /usr/lib/atlas-base/atlas/liblapack.so to provide /usr/lib/liblapack.so (liblapack.so) in auto mode
robertlayton commented 11 years ago

Thanks for that, it got me on the right track. I followed the instructions here and it appears to work.

kevindavenport commented 10 years ago

Hanging on a clean install with Anaconda on a new i7 Mabook Pro Retina and Mavericks.

GaelVaroquaux commented 10 years ago

Hanging on a clean install with Anaconda on a new i7 Mabook Pro Retina and Mavericks.

Which version of scikit-learn? We believe that we have fixed this in master.

kevindavenport commented 10 years ago

Hey @GaelVaroquaux, I'm using scikit-learn 0.14.1 np17py27_1 from Anaconda.

ogrisel commented 10 years ago

@kevindavenport can you please give us the output of:

from numpy.distutils.system_info import get_info
print(get_info('blas_opt'))

If you have gdb installed, can you also try?

$ sudo gdb python <pid of one of the hanging python processes>
(gdb) bt  # get the backtrace for the C program (python + libatlas.so)
(gdb) py-bt # get the python interpreter stacktrace, just to check
kevindavenport commented 10 years ago

@ogrisel Thanks for your help. The dedication you guys have to this project is incredible!

{'extra_link_args': ['-Wl,-framework', '-Wl,Accelerate'], 'extra_compile_args': ['-msse3', '-I/System/Library/Frameworks/vecLib.framework/Headers'], 'define_macros': [('NO_ATLAS_INFO', 3)]}

I found an article stating how I can install gdb with homebrew:

brew install https://raw.github.com/Homebrew/homebrew-dupes/master/gdb.rb

but I also read "gdb has been replaced by lldb, and is no longer supported. gcc and llvm-gcc are also gone, replaced by clang."

ogrisel commented 10 years ago

Alright: anaconda is apparently linked against OSX vecLib instead of MKL or Atlas which causes the issue. Therefore this is the same issue as previously. We don't have any clean solution for this in the short term besides advising you to build Atlas from source to use that instead of vecLib.

kevindavenport commented 10 years ago

Peter Wang from continuum states that they build against Apple Accelerate Framework. Do any of you guys have experience with pointing my environment to Atlas instead?

ogrisel commented 10 years ago

Maybe numpy.distutils.system_info is lying then? Anyway the freeze looks real and AFAIK MKL is fork-safe.

ogrisel commented 10 years ago

Numpy developers started working on a binary package for numpy on OSX that builds against an embedded Atlas that does not suffer from this bug, this the comments in numpy/numpy#4007. If this experiment is validated a such numpy whl packages get published on PyPI we can close this bug and re-enable the skipped tests for future versions of numpy.

kevindavenport commented 10 years ago

A few months ago I decided to just conda install MKL https://store.continuum.io/cshop/mkl-optimizations/ which would make my python install use the Intel MKL optimizations over the built in accelerate framework. This let me use the parallel features of joblib in scikit, but recently the anaconda 2.0 and/or numpy upgrade broke it again. Everything checks out here though:

from numpy.distutils.system_info import get_info
print(get_info('blas_opt'))
print(get_info('lapack_opt'))
{'libraries': ['mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread'], 'library_dirs': ['/Users/balthasar/anaconda/lib'], 'define_macros': [('SCIPY_MKL_H', None)], 'include_dirs': ['/Users/balthasar/anaconda/include']}
{'libraries': ['mkl_lapack95_lp64', 'mkl_intel_lp64', 'mkl_intel_thread', 'mkl_core', 'iomp5', 'pthread'], 'library_dirs': ['/Users/balthasar/anaconda/lib'], 'define_macros': [('SCIPY_MKL_H', None)], 'include_dirs': ['/Users/balthasar/anaconda/include']}
ogrisel commented 10 years ago

This is bad news. Can you confirm that the following breaks with numpy + mkl?

import numpy as np
from joblib import Parallel, delayed

a = np.random.randn(3000, 3000)

# Force initialization of multithreaded BLAS context in main process
x = np.dot(a, a.T)

if __name__ == "__main__":
    # Force usage of of multithreaded BLAS in fork-child processes
    Parallel(n_jobs=2)(delayed(np.dot)(a, a.T) for _ in range(3))
    print('ok')
ogrisel commented 10 years ago

If you get a segfault with the OSX process crash report, please report the stack trace here.

ogrisel commented 10 years ago

I close this issue as the original bug has now been fixed by the new numpy wheel package on PyPI.

@kevindavenport could you please open a new issue for if you can reproduce the crash with numpy MKL and the code snippet provided in my earlier comment?

potatochip commented 9 years ago

I am still getting this bug with the latest versions of everything on a mac. numpy + mkl breaks.

amueller commented 9 years ago

mkl, not accelerate? I didn't realize that was a problem, too. You are not using the anaconda version, are you?