inspirehep / beard

Bibliographic Entity Automatic Recognition and Disambiguation

Other

66 stars 36 forks source link

blocking: no multiprocessing for n_jobs=1 #83

Closed jacenkow closed 8 years ago

jacenkow commented 8 years ago

In case of n_jobs set to 1 job the blocking methods should omit the use of multiprocessing. This gets really annoying, when beard is used by Celery, since tasks cannot start own subprocesses.

CC: @glouppe, @kaplun

MSusik commented 8 years ago

You might want to implement sth similiar to Joblib's mechanism of dealing with n_jobs parameter:

https://github.com/joblib/joblib/blob/master/joblib/parallel.py#L489

However, please note that we decided to implement our own handling of parallelization of clustering instead of directly using Joblib in clustering/blocking.py files, as on the machines with Intel processors with Math Kernel Library we ran into a well known bug while running beard on 8+ cores using Python 2 (check the issue number 138 on the joblib repo, I don't want to paste the link here, as I've already referenced to this bug some time ago :smile:)

The bug is probably coming from the implementation of MKL. Anyway, by simplifying the code (not using mp.pool but using a constant set of processes) we managed to significantly decrease the frequency of race conditions. And that's the only reason for the queues and mp in our source file.

Obviously, this can be improved.

kaplun commented 8 years ago

The thing is that on production we think of not using parallelism at all, because the parallelism will be already given at the block level (i.e. we are going to send several blocks in parallel to be computed), so every CPU will be very likely busy with a whole block, and then the next one, etc. letting celery take this responsibility (at the moment multiprocessing is conflicting with celery)

MSusik commented 8 years ago

I understood. It just would be nice to still have the feature of running beard parallelly on one computer. The point is that you have to touch the code with multiprocessing in order to enable running the library without any new spawned processes.

As it is not clear why such implementation is present there, I believed a small clarification might be handy.

glouppe commented 8 years ago

See #65 for further details regarding the hanging issues we had.

In any case, it should be easy to modify the current implementation to have a special case when n_jobs=1. The quickest solution is to instantiate queue.Queue objects instead of multiprocessing.SimpleQueue if self.n_jobs == 1 (see lines https://github.com/inveniosoftware/beard/blob/master/beard/clustering/blocking.py#L196 and below), and then call _parallel_fit directly instead of spawning processes.

MSusik commented 8 years ago

The quickest solution is to instantiate queue.Queue

I completely forgot about this one! Indeed, much better :smile:

jacenkow commented 8 years ago

84

jacenkow commented 8 years ago

@MSusik @glouppe thanks for your help, really appreciate it :smile: