Closed rth closed 5 years ago
If the issue goes away when setting OMP_NUM_THREADS=1
, this is most likely an over-subscription issue. Another possibility is a bad iteraction between a fork
and native threadpools states. To rule this out, an interesting test would be to try and set OMP_NUM_THREADS=2
to see if there is a deadlock or if it runs fine.
Also, looking at htop
and seeing if there is a lot of system calls (in red in the load report) would prove valuable.
Also, running the test suit of loky
with pytest-xdist
seems to work fine and no cause any deadlocks.
Running loky and joblib tests suites also works fine for me when using pytest-xdist.
Running pytest -n 2 sklearn/
still shows the same behavior. It is indeed oversubscription not a deadlock as the run eventually succeeds, it's just very slow,
pytest sklearn
: 3.6 minOMP_NUM_THREADS=1 pytest sklearn
: 3.4 minpytest sklearn -n 2:
18 minOMP_NUM_THREADS=1 pytest sklearn -n 2
: 1.8 minOMP_NUM_THREADS=2 pytest sklearn -n 2
: 1.8 minI'm not sure what is so fundamentally different between running 1 or 2 test processes but the results are certainly unexpected.
OK this seem to be due to OpenMP in the new gradient boosting implementation,
pytest sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py -v -n 2
is enough to reproduce it. Will open an issue at scikit-learn instead.
With scikit-learn master, joblib 0.13.2 and Python 3.7 on a 12 cores CPU I get a deadlock in loky (or maybe it's just heavy oversubscription not sure) when using
pytest-xdist
to run tests in parallel,the CPU is then fully loaded on all cores, and ~500 threads seem to be spawned by the test suite (12 test jobs x 24 hyper-threads would be 288).
Possibly related to https://github.com/tomMoral/loky/issues/101 that was fixed.
Running with
OMP_NUM_THREADS=1
makes this problem go away. Please let me know if you need additional information.Overall it looks more like over-subscription than a deadlock I think? Still might be another data point in the "default number of threads" discussion from https://github.com/scikit-learn/scikit-learn/issues/14265