Open rth opened 3 years ago
On the contrary an example with Pytorch 1.9 works as expected,
import torch
from threadpoolctl import threadpool_limits
with threadpool_limits(limits=1):
X = torch.randn(10000, 10000)
torch.matmul(X, X
TF relies on MKL for this operation, right ? If they have their own tool to detect the number of cpu cores and explicitely set it to be the number of threads for MKL it will have priority over threadpoolctl and there's nothing we can do about it. It would also explain why the env var doesn't work. But I really don't know what they actually do in TF so maybe there's another reason.
They have multiple build options https://www.tensorflow.org/install/source#optimizations . One is to use https://github.com/oneapi-src/oneDNN which is part of MKL and also seems to support various threading runtimes https://github.com/oneapi-src/oneDNN#linux
What is the output?
python -m threadpoolctl --import tensorflow
The output is,
[
{
"filepath": <...>/lib/python3.8/site-packages/numpy.libs/libopenblasp-r0-09e95953.3.13.so",
"prefix": "libopenblas",
"user_api": "blas",
"internal_api": "openblas",
"version": "0.3.13",
"num_threads": 24,
"threading_layer": "pthreads"
},
{
"filepath": "<...>/lib/python3.8/site-packages/scipy.libs/libopenblasp-r0-085ca80a.3.9.so",
"prefix": "libopenblas",
"user_api": "blas",
"internal_api": "openblas",
"version": "0.3.9",
"num_threads": 24,
"threading_layer": "pthreads"
}
]
so I imagine it's additionally using some other thread system that's not being detected.
This is the openblas used by NumPy and SciPy. Probably that tensorflow is using a linear algebra library of its own (e.g. Eigen?) and its threading layer is not handled by threadpoolctl.
Also, threadpoolctl cannot detect statically linked libraries, only dynamically linked libraries.
As far as I can tell, limiting the number of threads in TensorFlow with threadpoolctl currently doesn't work.
For instance with the following minimal example with Tensorflow 2.5.0, example.py
running,
on a 64 cores CPU, produces,
so the user (CPU) time is still >> real run time, meaning that many CPU are used.
This becomes an issue if people run scikit-learn's
GridSearchCV
orcross_validate
on a Keras or TensorFlow model, since it then results in CPU over-subscription. I'm surprised there are no more issues about it at scikit-learn.Tensorflow also regrettably doesn't recognize any environment variables to limit the number of CPU cores either. The only way I found around it is to set the CPU affinity mask with
taskset
. But then again it wouldn't help for cross-validation for instance, since joblib would then need to set the affinity mask when creating new processes which is currently not supported.Has anyone looked into this in the past by any chance?