joblib / loky

Robust and reusable Executor for joblib
http://loky.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
529 stars 45 forks source link

Deadlock or over-subscription when running scikit-learn test suite in parallel #224

Closed rth closed 5 years ago

rth commented 5 years ago

With scikit-learn master, joblib 0.13.2 and Python 3.7 on a 12 cores CPU I get a deadlock in loky (or maybe it's just heavy oversubscription not sure) when using pytest-xdist to run tests in parallel,

pytest -n 12 sklearn/
``` ============================================================= test session starts ============================================================= platform linux -- Python 3.7.4, pytest-5.0.1, py-1.8.0, pluggy-0.12.0 rootdir: /home/rth/src/scikit-learn, inifile: setup.cfg plugins: forked-1.0.2, xdist-1.28.0 gw0 [13654] / gw1 [13654] / gw2 [13654] / gw3 [13654] / gw4 [13654] / gw5 ok / gw6 [13654] / gw7 [13654] / gw8 ok / gw9 ok / gw10 ok / gw11 [13gw0 [13654] / gw1 [13654] / gw2 [13654] / gw3 [13654] / gw4 [13654] / gw5 [13654] / gw6 [13654] / gw7 [13654] / gw8 ok / gw9 ok / gw10 ok / gw1gw0 [13654] / gw1 [13654] / gw2 [13654] / gw3 [13654] / gw4 [13654] / gw5 [13654] / gw6 [13654] / gw7 [13654] / gw8 [13654] / gw9 ok / gw10 ok gw0 [13654] / gw1 [13654] / gw2 [13654] / gw3 [13654] / gw4 [13654] / gw5 [13654] / gw6 [13654] / gw7 [13654] / gw8 [13654] / gw9 [13654] / gw1gw0 [13654] / gw1 [13654] / gw2 [13654] / gw3 [13654] / gw4 [13654] / gw5 [13654] / gw6 [13654] / gw7 [13654] / gw8 [13654] / gw9 [13654] / gw10 [13654] / gw11 [13654] ............................s............................................................................................................ [ 0%] ..............................................s.......................s............................ss..................s............... [ 1%] .........s..................s..s........s........s.s..........s.............................s..s...............s........s.........ss.s. [ 2%] ...................s....s........................s............................................................s.......................s. [ 3%] .....s...................................s..................................................................s..............s........... [ 4%] ................s..s..s...........s..................s.......s...........s........................s..................................... [ 5%] .................................................................s.........................s.s......................................... [ 6%] ....................................................................................................................................... [ 7%] ...................................................................x................................................................... [ 8%] ....................................................................................................................................... [ 9%] ............ ^C^CError in atexit._run_exitfuncs: Traceback (most recent call last): File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.7/logging/__init__.py", line 2033, in shutdown Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.7/multiprocessing/util.py", line 265, in _run_finalizers Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.7/multiprocessing/util.py", line 299, in _exit_function Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.7/multiprocessing/util.py", line 265, in _run_finalizers finalizer() File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.7/multiprocessing/util.py", line 189, in __call__ _run_finalizers(0)finalizer() File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.7/multiprocessing/util.py", line 189, in __call__ res = self._callback(*self._args, **self._kwargs)KeyboardInterrupt File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.7/site-packages/joblib/externals/loky/backend/synchronize.py", line 97, in _cleanup h.flush() File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.7/logging/__init__.py", line 1009, in flush res = self._callback(*self._args, **self._kwargs) File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 197, in _python_exit semaphore_tracker.unregister(name) File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.7/site-packages/joblib/externals/loky/backend/semaphore_tracker.py", line 125, in unregister self.ensure_running() File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.7/site-packages/joblib/externals/loky/backend/semaphore_tracker.py", line 66, in ensure_running thread.join() self.stream.flush() File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.7/site-packages/_pytest/capture.py", line 443, in __getattr__ File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.7/threading.py", line 1044, in join if self._check_alive(): File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.7/site-packages/joblib/externals/loky/backend/semaphore_tracker.py", line 112, in _check_alive return getattr(object.__getattribute__(self, "buffer"), name) KeyboardInterrupt self._send('PROBE', '') File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.7/site-packages/joblib/externals/loky/backend/semaphore_tracker.py", line 134, in _send nbytes = os.write(self._fd, msg) KeyboardInterrupt self._wait_for_tstate_lock() File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.7/threading.py", line 1060, in _wait_for_tstate_lock elif lock.acquire(block, timeout): KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! /home/rth/miniconda3/envs/sklearn-dev/lib/python3.7/threading.py:300: KeyboardInterrupt (to show a full traceback on KeyboardInterrupt use --fulltrace) Traceback (most recent call last): File "/home/rth/miniconda3/envs/sklearn-dev/bin/pytest", line 11, in sys.exit(main()) File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.7/site-packages/_pytest/config/__init__.py", line 74, in main return config.hook.pytest_cmdline_main(config=config) File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.7/site-packages/pluggy/hooks.py", line 289, in __call__ return self._hookexec(self, self.get_hookimpls(), kwargs) File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.7/site-packages/pluggy/manager.py", line 87, in _hookexec return self._inner_hookexec(hook, methods, kwargs) File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.7/site-packages/pluggy/manager.py", line 81, in firstresult=hook.spec.opts.get("firstresult") if hook.spec else False, File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.7/site-packages/pluggy/callers.py", line 208, in _multicall return outcome.get_result() File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.7/site-packages/pluggy/callers.py", line 80, in get_result raise ex[1].with_traceback(ex[2]) File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.7/site-packages/pluggy/callers.py", line 187, in _multicall res = hook_impl.function(*args) File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.7/site-packages/_pytest/main.py", line 250, in pytest_cmdline_main return wrap_session(config, _main) File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.7/site-packages/_pytest/main.py", line 243, in wrap_session session=session, exitstatus=session.exitstatus File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.7/site-packages/pluggy/hooks.py", line 289, in __call__ return self._hookexec(self, self.get_hookimpls(), kwargs) File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.7/site-packages/pluggy/manager.py", line 87, in _hookexec return self._inner_hookexec(hook, methods, kwargs) File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.7/site-packages/pluggy/manager.py", line 81, in firstresult=hook.spec.opts.get("firstresult") if hook.spec else False, File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.7/site-packages/pluggy/callers.py", line 203, in _multicall gen.send(outcome) File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.7/site-packages/_pytest/terminal.py", line 649, in pytest_sessionfinish outcome.get_result() File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.7/site-packages/pluggy/callers.py", line 80, in get_result raise ex[1].with_traceback(ex[2]) File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.7/site-packages/pluggy/callers.py", line 187, in _multicall res = hook_impl.function(*args) File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.7/site-packages/xdist/dsession.py", line 89, in pytest_sessionfinish nm.teardown_nodes() File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.7/site-packages/xdist/workermanage.py", line 78, in teardown_nodes self.group.terminate(self.EXIT_TIMEOUT) File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.7/site-packages/execnet/multi.py", line 208, in terminate for gw in self._gateways_to_join File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.7/site-packages/execnet/multi.py", line 297, in safe_terminate reply.get() File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.7/site-packages/execnet/gateway_base.py", line 190, in get self.waitfinish(timeout) File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.7/site-packages/execnet/gateway_base.py", line 197, in waitfinish if not self._result_ready.wait(timeout): File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.7/threading.py", line 552, in wait signaled = self._cond.wait(timeout) File "/home/rth/miniconda3/envs/sklearn-dev/lib/python3.7/threading.py", line 296, in wait waiter.acquire() KeyboardInterrupt ```

the CPU is then fully loaded on all cores, and ~500 threads seem to be spawned by the test suite (12 test jobs x 24 hyper-threads would be 288).

Possibly related to https://github.com/tomMoral/loky/issues/101 that was fixed.

Running with OMP_NUM_THREADS=1 makes this problem go away. Please let me know if you need additional information.

Overall it looks more like over-subscription than a deadlock I think? Still might be another data point in the "default number of threads" discussion from https://github.com/scikit-learn/scikit-learn/issues/14265

tomMoral commented 5 years ago

If the issue goes away when setting OMP_NUM_THREADS=1, this is most likely an over-subscription issue. Another possibility is a bad iteraction between a fork and native threadpools states. To rule this out, an interesting test would be to try and set OMP_NUM_THREADS=2 to see if there is a deadlock or if it runs fine. Also, looking at htop and seeing if there is a lot of system calls (in red in the load report) would prove valuable.

tomMoral commented 5 years ago

Also, running the test suit of loky with pytest-xdist seems to work fine and no cause any deadlocks. image

rth commented 5 years ago

Running loky and joblib tests suites also works fine for me when using pytest-xdist.

Running pytest -n 2 sklearn/ still shows the same behavior. It is indeed oversubscription not a deadlock as the run eventually succeeds, it's just very slow,

I'm not sure what is so fundamentally different between running 1 or 2 test processes but the results are certainly unexpected.

rth commented 5 years ago

OK this seem to be due to OpenMP in the new gradient boosting implementation,

pytest sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py  -v -n 2

is enough to reproduce it. Will open an issue at scikit-learn instead.

rth commented 5 years ago

Continued in https://github.com/scikit-learn/scikit-learn/issues/15078