joblib / threadpoolctl

Python helpers to limit the number of threads used in native libraries that handle their own internal threadpool (BLAS and OpenMP implementations)
BSD 3-Clause "New" or "Revised" License
336 stars 30 forks source link

Limiting threads in TensorFlow #84

Open rth opened 3 years ago

rth commented 3 years ago

As far as I can tell, limiting the number of threads in TensorFlow with threadpoolctl currently doesn't work.

For instance with the following minimal example with Tensorflow 2.5.0, example.py

import tensorflow as tf
import numpy as np

from threadpoolctl import threadpool_limits

with threadpool_limits(limits=1):
    X = tf.constant(np.arange(0, 5000**2, dtype=np.int32), shape=(5000, 5000))

    tf.matmul(X, X)

running,

time python example.py

on a 64 cores CPU, produces,

real    0m3.781s
user    1m8.685s

so the user (CPU) time is still >> real run time, meaning that many CPU are used.

This becomes an issue if people run scikit-learn's GridSearchCV or cross_validate on a Keras or TensorFlow model, since it then results in CPU over-subscription. I'm surprised there are no more issues about it at scikit-learn.

Tensorflow also regrettably doesn't recognize any environment variables to limit the number of CPU cores either. The only way I found around it is to set the CPU affinity mask with taskset. But then again it wouldn't help for cross-validation for instance, since joblib would then need to set the affinity mask when creating new processes which is currently not supported.

Has anyone looked into this in the past by any chance?

rth commented 3 years ago

On the contrary an example with Pytorch 1.9 works as expected,

import torch

from threadpoolctl import threadpool_limits

with threadpool_limits(limits=1):
    X = torch.randn(10000, 10000)
    torch.matmul(X, X
jeremiedbb commented 3 years ago

TF relies on MKL for this operation, right ? If they have their own tool to detect the number of cpu cores and explicitely set it to be the number of threads for MKL it will have priority over threadpoolctl and there's nothing we can do about it. It would also explain why the env var doesn't work. But I really don't know what they actually do in TF so maybe there's another reason.

rth commented 3 years ago

They have multiple build options https://www.tensorflow.org/install/source#optimizations . One is to use https://github.com/oneapi-src/oneDNN which is part of MKL and also seems to support various threading runtimes https://github.com/oneapi-src/oneDNN#linux

ogrisel commented 3 years ago

What is the output?

python -m threadpoolctl --import tensorflow
rth commented 3 years ago

The output is,

[
  {
    "filepath": <...>/lib/python3.8/site-packages/numpy.libs/libopenblasp-r0-09e95953.3.13.so",
    "prefix": "libopenblas",
    "user_api": "blas",
    "internal_api": "openblas",
    "version": "0.3.13",
    "num_threads": 24,
    "threading_layer": "pthreads"
  },
  {
    "filepath": "<...>/lib/python3.8/site-packages/scipy.libs/libopenblasp-r0-085ca80a.3.9.so",
    "prefix": "libopenblas",
    "user_api": "blas",
    "internal_api": "openblas",
    "version": "0.3.9",
    "num_threads": 24,
    "threading_layer": "pthreads"
  }
]

so I imagine it's additionally using some other thread system that's not being detected.

ogrisel commented 3 years ago

This is the openblas used by NumPy and SciPy. Probably that tensorflow is using a linear algebra library of its own (e.g. Eigen?) and its threading layer is not handled by threadpoolctl.

ogrisel commented 3 years ago

Also, threadpoolctl cannot detect statically linked libraries, only dynamically linked libraries.