josejimenezluna / pyGPGO

Bayesian optimization for Python
http://pygpgo.readthedocs.io
MIT License
241 stars 61 forks source link

Parallelization in GPGO.GPGO returns BrokenProcessPool error #11

Closed RHammond2 closed 5 years ago

RHammond2 commented 6 years ago

Hello,

I am testing a number of Bayesian Optimization packages out for setting up a hyperparameter search, and had quite liked the setup of pyGPGO, but am running into an issue with parallelizing the code. I was using the tutorial from https://github.com/hawk31/pyGPGO/blob/master/tutorials/mlopt.ipynb as a reference.

When modifying gpgo = GPGO(gp, acq, evaluateModel, params) to be gpgo = GPGO(gp, acq, evaluateModel, params, n_jobs=4) with the same example. I get a BrokenProcessPool error with both your sample code and my own similar example. The output for modifying the code as I described is as follows.

I'm not really sure if the underlying issue is with joblib.Parallel or with pyGPGO, any help would be greatly appreciated!

For reference I am running on MacOS with an Anaconda distribution of Python 3.6.

Evaluation   Proposed point       Current eval.      Best eval.
init     [ 3.28824672 -0.91429671].       0.7398555220462545     0.7399868590570241
init     [-0.49380448 -0.48245949].       0.7399868590570241     0.7399868590570241
init     [-3.75686293  1.00979526].       0.7399868590570241     0.7399868590570241

exception calling callback for <Future at 0x1c3b6f2198 state=finished raised BrokenProcessPool>
Traceback (most recent call last):
  File "/Users/[user]/anaconda3/lib/python3.6/site-packages/joblib/externals/loky/_base.py", line 322, in _invoke_callbacks
    callback(self)
  File "/Users/[user]/anaconda3/lib/python3.6/site-packages/joblib/parallel.py", line 375, in __call__
    self.parallel.dispatch_next()
  File "/Users/[user]/anaconda3/lib/python3.6/site-packages/joblib/parallel.py", line 795, in dispatch_next
    if not self.dispatch_one_batch(self._original_iterator):
  File "/Users/[user]/anaconda3/lib/python3.6/site-packages/joblib/parallel.py", line 823, in dispatch_one_batch
    self._dispatch(tasks)
  File "/Users/[user]/anaconda3/lib/python3.6/site-packages/joblib/parallel.py", line 780, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/Users/[user]/anaconda3/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 504, in apply_async
    future = self._workers.submit(SafeFunction(func))
  File "/Users/[user]/anaconda3/lib/python3.6/site-packages/joblib/externals/loky/reusable_executor.py", line 151, in submit
    fn, *args, **kwargs)
  File "/Users/[user]/anaconda3/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py", line 993, in submit
    raise BrokenProcessPool(self._flags.broken)
joblib.externals.loky.process_executor.BrokenProcessPool: A process in the executor was terminated abruptly, the pool is not usable anymore.

---------------------------------------------------------------------------
BrokenProcessPool                         Traceback (most recent call last)
<ipython-input-28-133aafb98ba8> in <module>()
     24 from pyGPGO.GPGO import GPGO
     25 gpgo = GPGO(gp, acq, evaluateModel, params, n_jobs=4)
---> 26 gpgo.run(max_iter = 20)
     27 
     28 

~/anaconda3/lib/python3.6/site-packages/pyGPGO/GPGO.py in run(self, max_iter, init_evals, resume)
    189             self.logger._printInit(self)
    190         for iteration in range(max_iter):
--> 191             self._optimizeAcq()
    192             self.updateGP()
    193             self.logger._printCurrent(self)

~/anaconda3/lib/python3.6/site-packages/pyGPGO/GPGO.py in _optimizeAcq(self, method, n_start)
    136                                                                  method='L-BFGS-B',
    137                                                                  bounds=self.parameter_range) for start_point in
--> 138                                                start_points_arr)
    139             x_best = np.array([res.x for res in opt])
    140             f_best = np.array([np.atleast_1d(res.fun)[0] for res in opt])

~/anaconda3/lib/python3.6/site-packages/joblib/parallel.py in __call__(self, iterable)
    992 
    993             with self._backend.retrieval_context():
--> 994                 self.retrieve()
    995             # Make sure that we get a last message telling us we are done
    996             elapsed_time = time.time() - self._start_time

~/anaconda3/lib/python3.6/site-packages/joblib/parallel.py in retrieve(self)
    895             try:
    896                 if getattr(self._backend, 'supports_timeout', False):
--> 897                     self._output.extend(job.get(timeout=self.timeout))
    898                 else:
    899                     self._output.extend(job.get())

~/anaconda3/lib/python3.6/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
    513         AsyncResults.get from multiprocessing."""
    514         try:
--> 515             return future.result(timeout=timeout)
    516         except LokyTimeoutError:
    517             raise TimeoutError()

~/anaconda3/lib/python3.6/site-packages/joblib/externals/loky/_base.py in result(self, timeout)
    429                 raise CancelledError()
    430             elif self._state == FINISHED:
--> 431                 return self.__get_result()
    432             else:
    433                 raise TimeoutError()

~/anaconda3/lib/python3.6/site-packages/joblib/externals/loky/_base.py in __get_result(self)
    380     def __get_result(self):
    381         if self._exception:
--> 382             raise self._exception
    383         else:
    384             return self._result

BrokenProcessPool: A process in the executor was terminated abruptly while the future was running or pending.

Thank you, Rob

josejimenezluna commented 6 years ago

Hello @RHammond2 ,

I'm unable to replicate under CentOS7, Python 3.6.3 and joblib=0.11.

In [5]: from pyGPGO.GPGO import GPGO
    ...: gpgo = GPGO(gp, acq, evaluateModel, params, n_jobs=4)
    ...: gpgo.run(max_iter = 5)
    ...: 
Evaluation   Proposed point       Current eval.      Best eval.
init     [ 1.39850819 -1.17881649].       0.8493761140819965     0.8493761140819965
init     [ 3.7303667  4.7255962].     0.5148544266191325     0.8493761140819965
init     [-2.35468526  2.4390623 ].       0.6149732620320855     0.8493761140819965
1        [ 0.69315004 -0.49900347].       0.849227569816     0.849376114082
2        [ 1.6262495  -0.23700775].       0.884284016637     0.884284016637
3        [ 0.8895309  0.7722795].     0.889483065954     0.889483065954
4        [-0.26714709  0.84576647].       0.884432560903     0.889483065954
5        [ 0.01668053  1.81505441].       0.834373143197     0.889483065954

Maybe this is related. Which joblib version are you using?

josejimenezluna commented 6 years ago

Hello again @RHammond2,

Was able to replicate and can confirm this is caused by the latest joblib version. Use 0.11 in the meantime.

I will correct dependencies in the repo.

RHammond2 commented 6 years ago

Hello @hawk31

I was able to run the example successfully when reverting to version 0.11.

Thanks so much! Rob