oneapi-src / oneTBB

oneAPI Threading Building Blocks (oneTBB)
https://oneapi-src.github.io/oneTBB/
Apache License 2.0
5.57k stars 1.01k forks source link

Oversubscription slowdown when run under container CPU quotas #190

Open ogrisel opened 4 years ago

ogrisel commented 4 years ago

When running an application in a Linux container environment (e.g. docker containers) it is often the case that the orchestrator configuration (kubernetes, docker compose/swarm) puts CPU quotas via Linux cgroup to avoid having one container app to use all the CPU of the neighboring apps running on the same host.

However TBB does not seem to introspect /sys/fs/cgroup/cpu/cpu.cfs_quota_us / /sys/fs/cgroup/cpu/cpu.cfs_period_usto understand how many tasks it can run concurrently resulting in significant slowdown caused by over-subscription. As it is not possible to set a TBB_NUM_THREADS environment variable in the container deployment configuration, this makes it challenging to efficiently deploy TBB-enabled apps on docker-managed servers.

Here is a reproducing settup using numpy from the default anaconda channel on a host machine with 48 threads (24 physical cores):

$ cat oversubscribe_tbb.py
import numpy as np
from time import time

data = np.random.randn(1000, 1000)
print(f"Calling np.linalg.eig shape={data.shape}:",
      end=" ", flush=True)
tic = time()
np.linalg.eig(data)
print(f"{time() - tic:.3f}s")

$ docker run --cpus 2 -ti -v `pwd`:/io continuumio/miniconda3 bash
(base) # conda install -y numpy tbb
(base) # MKL_THREADING_LAYER=tbb python /io/oversubscribe_tbb.py     
one eig, shape=(1000, 1000): 20.227s

By using a sequential execution or OpenMP with appropriately configured environment, the problem disappears:

(base) # MKL_THREADING_LAYER=sequential python /io/oversubscribe_tbb.
py 
Calling np.linalg.eig shape=(1000, 1000): 1.636s
(base) # MKL_THREADING_LAYER=omp OMP_NUM_THREADS=2 python /io/oversub
scribe_tbb.py                                                                           
Calling np.linalg.eig shape=(1000, 1000): 1.484s

Of course if OpenMP is used without setting OMP_NUM_THREADS to match the docker CPU quota, on also get a similar over-subscription problem as encountered with TBB:

(base) # MKL_THREADING_LAYER=omp python /io/oversubscribe_tbb.py     
Calling np.linalg.eig shape=(1000, 1000): 22.703s

Edit: the first version of this report mentioned MKL_THREADING_LAYER=omp instead of MKL_THREADING_LAYER=tbb in the first command (with duration 20.227s). I confirm that we also get 20s+ with MKL_THREADING_LAYER=tbb.

ogrisel commented 4 years ago

Libraries version numbers:

(base) # conda list mkl
# packages in environment at /opt/conda:
#
# Name                    Version                   Build  Channel
mkl                       2019.4                      243  
mkl-service               2.3.0            py37he904b0f_0  
mkl_fft                   1.0.14           py37ha843d7b_0  
mkl_random                1.1.0            py37hd6b4f25_0  
(base) # conda list tbb
# packages in environment at /opt/conda:
#
# Name                    Version                   Build  Channel
tbb                       2019.8               hfd86e86_0 
xvallspl commented 4 years ago

Hi, we ran into the same problem several weeks ago.

Sadly, the IPC solution described in the PR referenced above is not an option for us in terms of performance, and we had to resort back to reading cgroup files (https://github.com/root-project/root/blob/a7495ae4f697f9bf285835f004af3f14f330b0eb/core/imt/src/TPoolManager.cxx#L32).

However, we are reluctant to believe we are the only ones running into this problem when the virtualization of hardware resources is so widespread nowadays. Is this something the TBB team is thinking of addressing in the near future? Do you see it as something to be figured out on the user side, or do you agree TBB should be taking care of it?

ogrisel commented 4 years ago

One possible solution is to launch your program on a restricted number of CPU cores using the taskset command to match the CPU quotas of the docker container. In this case TBB is able to introspect the affinity constraints and limit the number of threads it spawns to match the available hardware resources.

ogrisel commented 4 years ago

However I am not sure sure how to select the CPU ids to make sure that containers running on the same host do not use overlapping affinity masks.

xvallspl commented 4 years ago

One possible solution is to launch your program on a restricted number of CPU cores using the taskset command to match the CPU quotas of the docker container. In this case TBB is able to introspect the affinity constraints and limit the number of threads it spawns to match the available hardware resources.

Thanks, @ogrisel. Unfortunately, setting affinity masks is not something we are interested in, only on exceptional cases (for instance, dealing with NUMA domains). For our case, I'd rather read once the cgroup files.

In any case, I am still interested on an answer from the TBB developers on future support for this issue.

anton-malakhov commented 4 years ago

@tbbdev please pay attention

alexey-katranov commented 4 years ago

Can you clarify, please, how TBB can interpret CPU quotas? As I understand, the mentioned quotas specify the part of CPU time available for a particular application. It does not set the number of threads or affinity constraints. So, it is responsible OS to schedule threads in accordance with the quota.

xvallspl commented 4 years ago

Hi, @alexey-katranov.

In this context, my suggestion would be to interpret the CPU quota/period ratio to decide the default number of threads. If you read a quota of two times the period, that indicates you can use up to 2 CPUs worth of runtime for that period, which would map to two threads assuming they are busy 100% of the time. You could round up real numbers to the nearest integer.

A common example nowadays: You have 2 containers sharing a node with 8 cores, both of them running multithreaded workloads. These containers aren't pinned to particular CPUs but are assigned a fraction of the bandwidth of the machine, using CFS Bandwidth Control (for instance, launching the containers with Docker’s --cpu option).

The scheduler implements this by running all threads of a control group for a fraction (cfs_quota) of an execution period (cfs_period), and when they used up all the quota, it will stop them. They remain stopped until the start of the next cycle. The problem arises when, in multithreaded workloads, the quota/period ratio is lower than the number of logical cores (TBB's default). For instance, if in this 8 core machine, the cfs_quota is 800ms and the period is 200ms, we get a quota/period ratio of 4. This means that we can use up to 4 CPUs worth of runtime. To use this quota efficiently, it would be great if we spawned 4 threads in each container, but tbb will spawn 8. In this case, each thread will run in a different CPU, but only for 50% of an execution period, as we can only use up to 4 CPUs worth of runtime. After half of the period, all 8 threads will be yanked out of execution by the operating system and put to wait until they can run again during the next period. These context changes turn out to be very costly.

Just scale up the above example to a machine with 100 logical cores, and each container gets assigned two CPUs worth of quota. Now, 100 threads will be allowed to use as much CPU as two threads at full speed. That means that they constantly have to be switched out and switched in, and they are waiting 98% of the cfs_period. That's not even counting the overhead due to context-switching 100 threads.

xvallspl commented 4 years ago

Hi, @alexey-katranov

Didn't get an answer on on future support for this issue :)

jeremyong commented 4 years ago

You have tbb::global_control, so why not observe the quota yourself, pass the desired concurrency limit as a command line argument/environment variable and just set it yourself?

alexey-katranov commented 4 years ago

@jeremyong, as I understand, the issue is to deploy existing TBB-based application, so there is no possibility to recompile the application. @xvallspl, I am thinking about the downside that TBB cannot utilize the system if other dockers do not use their quotas. E.g., if we have two dockers with equal quotas but the only one docker requires parallelism (e.g. with TBB) we will have only about a half of possible utilization. Is it expected?

ogrisel commented 4 years ago

You have tbb::global_control, so why not observe the quota yourself, pass the desired concurrency limit as a command line argument/environment variable and just set it yourself?

Would it be possible to have an environment variable to set the equivalent of tbb::global_control? This would make it possible for sysadmin / devops in charge of the kubernetes / docker deployment configuration to tune their services for their production infrastructure without having to change the code of the applications they deploy.

ogrisel commented 4 years ago

@xvallspl, I am thinking about the downside that TBB cannot utilize the system if other dockers do not use their quotas. E.g., if we have two dockers with equal quotas but the only one docker requires parallelism (e.g. with TBB) we will have only about a half of possible utilization. Is it expected?

If you have 2 apps (A & B) deployed in containers and each app/container has a quota of 50% of the total CPUs of the host, app A will never be able to use more than 50% of the CPUs even if B is idle in any case.

So I don't see how the solution proposed by @xvallspl could be detrimental.

alexey-katranov commented 4 years ago

app A will never be able to use more than 50% of the CPUs even if B is idle in any case

Is it desired behavior that an application cannot utilize more than 50% of the CPU? At first glance, it seems inefficient.

ogrisel commented 4 years ago

Is it desired behavior that an application cannot utilize more than 50% of the CPU? At first glance, it seems inefficient.

Yes because they can be different services running concurrently in different containers on the same host and you don't one to have one service that can degrade the performance of another.

In compute intensive scenarios, you could have many spark / dask workers, each allocated with 4 CPU threads via CFS quotas running on a cluster of big machines with 48 physical cores each. The containers would be scheduled by a spark or dask scheduler orchestrator that talks to kubernetes to dynamically allocate more or less workers based on the cluster-wise load (pending tasks).

So it's the job of the spark / dask scheduler to allocate and release the resource efficiently and of kubernetes to pack pods / docker containers on the physical machines (and possibly to talk to the underlying cloud infrastructure to dynamically provision or release new machines to be part of the cluster). But the workers should not trigger over-subscription by trying to each use 48 threads when they are only allowed to use 4 CPU each.

alexey-katranov commented 4 years ago

Yes because they can be different services running concurrently in different containers on the same host and you don't one to have one service that can degrade the performance of another.

Sure I do not want one service to degrade the performance of another. If each service can utilize its quota it is Ok but my concern what if one service needs more but another service do not need all resources. In that case, can OS increase quota allocation for the first service for some time? Or the quota mechanism can lead to system under utilization if some process do not use its quota.

xvallspl commented 4 years ago

@alexey-katranov The problem you are describing has to be solved when configuring the quota. Once the quota has been set, your service won't be able to go over it, independently of being executed across multiple processors or not.

As @ogrisel pointed out, this is a desired, intended behaviour. It can lead to resource under utilization, but that's not the point. The idea behind CPU bandwith control is to limit the amount of resources a task or a group of tasks can consume.

Sorry for the late reply, I missed this conversation.

ogrisel commented 3 years ago

For reference, loky (an alternative to concurrent.futures from the Python standard library) is CPU quota aware using the following logic:

https://github.com/joblib/loky/blob/f15594a44420abfa9b398be7ff3c9180c2858bf4/loky/backend/context.py#L181-L195

arunparkugan commented 3 weeks ago

@alexey-katranov is this issue still relevant?