clustering: new parallelization implementation

inspirehep / beard

Bibliographic Entity Automatic Recognition and Disambiguation

Other

66 stars 36 forks source link

clustering: new parallelization implementation #65

Closed MSusik closed 9 years ago

MSusik commented 9 years ago

Gets rid of Joblib
Removes feature of passing negative value of n_jobs.

Signed-off-by: Mateusz Susik mateusz.susik@cern.ch

glouppe commented 9 years ago

Besides my nitpicks, this looks good to me. Thanks! Two questions though:

Does this solve the blocking issue we had?
Have you checked this works fine both for Python 2.7 and 3.4?

MSusik commented 9 years ago

Does this solve the blocking issue we had?

Yes, just a one small improvement is needed, as this implementation can hang.

Have you checked this works fine both for Python 2.7 and 3.4?

Yes.

glouppe commented 9 years ago

Great then, +1 for merge once my comments are fixed.

glouppe commented 9 years ago

CC: @ogrisel Just to let you know we have add hanging issues with joblib -- processes are all stalling, CPU usage goes down to 0, and then nothing more happens. Do you know where this could be coming from? These are very difficult to reproduce and seem to appear at random... Directly using multiprocessing solves our immediate problem, but it would be nice if joblib could be used again.

MSusik commented 9 years ago

Note that the approach is completely different as joblib spawns a process for every block of data, while I keep a pool of processes that run all the time.

ogrisel commented 9 years ago

I want to make it possible to reuse joblib pools too across several consecutive calls to Parallel.__call__. However this is probably not the cause of the hanging.

Which implementation of BLAS do you use when you observe the hanging? anaconda's MKL? OSX's builtin accelerate?

ogrisel commented 9 years ago

Note that the approach is completely different as joblib spawns a process for every block of data, while I keep a pool of processes that run all the time.

Note: joblib spawns a pool with a fixed number of worker process per call to Parallel (using a multiprocessing Pool instance using the the apply_async under the hood). But you can pass many blocks of data and the number of workers stay constants.

ogrisel commented 9 years ago

BTW, could you also tell be if you observe the hanging behavior when enabling the forkserver start method under Python 3.4+? To try that you need to modify the main block of the main script that starts your Python program with:

import multiprocessing as mp
# import your modules here

if __name__ == '__main__':
    mp.set_start_method('forkserver')
    # call your code here

More details here: https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods

MSusik commented 9 years ago

Which implementation of BLAS do you use when you observe the hanging? anaconda's MKL?

We observed the error when using anaconda's MKL and when working outside of anaconda. For example here is a numpy's config for the run without anaconda:

blas_info:
    libraries = ['blas']
    library_dirs = ['/usr/lib']
    language = f77
lapack_info:
    libraries = ['lapack']
    library_dirs = ['/usr/lib']
    language = f77
atlas_threads_info:
  NOT AVAILABLE
blas_opt_info:
    libraries = ['blas']
    library_dirs = ['/usr/lib']
    language = f77
    define_macros = [('NO_ATLAS_INFO', 1)]
atlas_blas_threads_info:
  NOT AVAILABLE
openblas_info:
  NOT AVAILABLE
lapack_opt_info:
    libraries = ['lapack', 'blas']
    library_dirs = ['/usr/lib']
    language = f77
    define_macros = [('NO_ATLAS_INFO', 1)]
atlas_info:
  NOT AVAILABLE
lapack_mkl_info:
  NOT AVAILABLE
blas_mkl_info:
  NOT AVAILABLE
atlas_blas_info:
  NOT AVAILABLE
mkl_info:
  NOT AVAILABLE

ogrisel commented 9 years ago

That's weird. Do you use OpenMP-based libraries or compiled extensions (e.g. Cython prange constructs)?

Would be great to provide a standalone joblib snippet that reproduce the freeze so that I can try to debug.