pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
82.93k stars 22.36k forks source link

Excessively high CPU usage in small multithreaded CPU ops #80777

Closed zplizzi closed 1 year ago

zplizzi commented 2 years ago

šŸ› Describe the bug

When running the following code sample, the process CPU usage (measured with htop) is <5% with torch.set_num_threads(1), ~300% with torch.set_num_threads(4), and 3000% with the default number of threads on my machine, torch.get_num_threads() == 32. The same operation in numpy consumes <5% CPU. I would expect <5% CPU usage in the default configuration, without having to manually set the number of threads to 1 (which presumably would harm other operations which would benefit from more threads). Torch should detect that this is a small operation that won't benefit from multithreading.

import torch
import time
torch.set_num_threads(4)

a = torch.zeros((4, 96, 96))
b = torch.zeros((1, 4, 96, 96))
while True:
    b[0] = a
    time.sleep(.001)

Versions

Collecting environment information... PyTorch version: 1.12.0+cu113 Is debug build: False CUDA used to build PyTorch: 11.3 ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.4 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.31

Python version: 3.9.5 (default, Nov 23 2021, 15:27:38) [GCC 9.3.0] (64-bit runtime) Python platform: Linux-5.13.0-30-generic-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: Could not collect GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090 GPU 1: NVIDIA GeForce RTX 3090 GPU 2: NVIDIA GeForce RTX 3090 GPU 3: NVIDIA GeForce RTX 3090

Nvidia driver version: 470.103.01 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

Versions of relevant libraries: [pip3] geotorch==0.2.0 [pip3] mypy==0.931 [pip3] mypy-boto3-ec2==1.17.41.0 [pip3] mypy-boto3-s3==1.17.41.0 [pip3] mypy-extensions==0.4.3 [pip3] numpy==1.22.2 [pip3] pytorch-lightning==1.2.6 [pip3] torch==1.12.0+cu113 [pip3] torchaudio==0.12.0+cu113 [pip3] torchmetrics==0.7.0 [pip3] torchvision==0.13.0+cu113 [conda] Could not collect

cc @VitalyFedyunin @ngimel

vadimkantorov commented 2 years ago

Maybe related: https://github.com/pytorch/pytorch/issues/42959

jingxu10 commented 1 year ago

This is because the workload is too light, and you have a sleep function invocation. You can either remove the sleep function invocation or set an environment variable to disable GOMP spin time. export OMP_WAIT_POLICY=PASSIVE

zplizzi commented 1 year ago

Ah, thanks @jingxu10, I understand what's happening.

From here:

Note that the default behavior is implementation defined, for libgomp, it is documented to be active for a certain amount of time, and then switch to passive. This time can be tuned via GOMP_SPINCOUNT. If you see decremental performance with passive, try to use a lower value of GOMP_SPINCOUNT instead.

So in the case of the example I gave above, 32 threads are started, quickly perform the work, and then enter a short period of "active waiting" where they consume a lot of CPU. Typically they would switch to passive waiting and CPU usage would return to normal, but the example that I gave must have been perfectly timed to keep them always in "active waiting" mode.

I tried out the suggestion of using export OMP_WAIT_POLICY=PASSIVE, which did fix the issue in this example. It did increase runtime ~20% when using 32 threads compared to 32 threads actively waiting or just 1 thread actively waiting, though.

Feel free to close this issue if there's no further desire to improve/document this edge case.

zplizzi commented 1 year ago

Oh, I remembered why this is a real issue. Imagine you have something similar to the above example, but happening in a number of parallel processes at the same time. Each process will grab 32 threads, even though it's just doing some tiny operation, and all 32 threads will spend some time busy-waiting. So if each process is doing a lot of small, fast CPU ops, quickly the entire machine's CPU is saturated and the whole thing bogs down, even though the total amount of real computation happening is negligible.

Basically this makes running many small CPU ops in parallel extremely slow in certain situations.

jgong5 commented 1 year ago

Oh, I remembered why this is a real issue. Imagine you have something similar to the above example, but happening in a number of parallel processes at the same time. Each process will grab 32 threads, even though it's just doing some tiny operation, and all 32 threads will spend some time busy-waiting. So if each process is doing a lot of small, fast CPU ops, quickly the entire machine's CPU is saturated and the whole thing bogs down, even though the total amount of real computation happening is negligible.

Basically this makes running many small CPU ops in parallel extremely slow in certain situations.

Yes, things get worse with multi-tenant situations. But thread oversubscription would happen anyway with or without busy spin if you let each process uses all threads at the same time. Usually, it is recommended to allocate CPU cores properly for each process via numactl or taskset to avoid oversubscription of cores.