llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.1k stars 11.6k forks source link

[OpenMP] excessive power consumption for waiting threads #78485

Open marioroy opened 8 months ago

marioroy commented 8 months ago

Re-posting from https://forums.developer.nvidia.com/t/openmp-excessive-power-consumption-for-waiting-threads/279272

"The OpenMP power consumption test is with the -p argument to primes1 or primes3 which involves ordered output or one thread writing output at a time. Other threads wait their turn, orderly. I expect for the waiting threads to be idle or consume low CPU utilization. That is not the case and seeing full 6400% CPU utilization (AMD Threadripper 3970X - 64 logical CPU threads) for printing prime numbers to /dev/null. Nothing like GNU GCC consuming just173% for the same test."

I see also, near 6400% CPU utilization using clang for the power consumption test, during orderly output.

Prime Demos

gcc -o primes1.gcc -O3 -fopenmp -I../src primes1.c -lm
clang -o primes1.clang -O3 -fopenmp -I../src primes1.c -lm
nvc -o primes1.nvc -O3 -mp=multicore -I../src primes1.c -lm

gcc -o primes3.gcc -O3 -fopenmp -I../src primes3.c -L/usr/local/lib64 -lprimesieve -lm
clang -o primes3.clang -O3 -fopenmp -I../src primes3.c -L/usr/local/lib64 -lprimesieve -lm
nvc -o primes3.nvc -O3 -mp=multicore -I../src primes3.c -L/usr/local/lib64 -lprimesieve -lm

OpenMP Ordered Power Consumption Test

Threadripper 3970X idle (browser NV forums page)  120 watts

./primes1.gcc   1e10 -p >/dev/null   10.173 secs, 201 watts
./primes1.clang 1e10 -p >/dev/null   12.729 secs, 288 watts
./primes1.nvc   1e10 -p >/dev/null   21.346 secs, 322 watts

./primes3.gcc   1e10 -p >/dev/null    7.092 secs, 181 watts
./primes3.clang 1e10 -p >/dev/null    8.876 secs, 274 watts
./primes3.nvc   1e10 -p >/dev/null   11.080 secs, 361 watts

OpenMP Performance Test

Threadripper 3970X idle (browser NV forums page)  120 watts

./primes1.gcc   1e12                 16.168 secs, 399 watts
./primes1.clang 1e12                 16.274 secs, 395 watts
./primes1.nvc   1e12                 14.780 secs, 393 watts

./primes3.gcc   1e12                  5.762 secs, 437 watts
./primes3.clang 1e12                  6.277 secs, 434 watts
./primes3.nvc   1e12                  5.755 secs, 442 watts

I first witnessed the power consumption issue using Codon.

https://github.com/exaloop/codon/issues/456

Is it okay for waiting threads to be spinning the CPU during ordered or exclusive blocks? I wonder about cloud customers possibly paying extra power consumption simply for threads waiting their turn. The Intel oneAPI compilers are also impacted.

llvmbot commented 8 months ago

@llvm/issue-subscribers-openmp

Author: Mario Roy (marioroy)

Re-posting from https://forums.developer.nvidia.com/t/openmp-excessive-power-consumption-for-waiting-threads/279272 "The OpenMP power consumption test is with the `-p` argument to `primes1` or `primes3` which involves ordered output or one thread writing output at a time. Other threads wait their turn, orderly. I expect for the waiting threads to be idle or consume low CPU utilization. That is not the case and seeing full 6400% CPU utilization (AMD Threadripper 3970X - 64 logical CPU threads) for printing prime numbers to /dev/null. Nothing like GNU GCC consuming just173% for the same test." I see also, near 6400% CPU utilization using clang for the power consumption test, during orderly output. **[Prime Demos](https://github.com/marioroy/mce-sandbox/tree/main/demos)** ```text gcc -o primes1.gcc -O3 -fopenmp -I../src primes1.c -lm clang -o primes1.clang -O3 -fopenmp -I../src primes1.c -lm nvc -o primes1.nvc -O3 -mp=multicore -I../src primes1.c -lm gcc -o primes3.gcc -O3 -fopenmp -I../src primes3.c -L/usr/local/lib64 -lprimesieve -lm clang -o primes3.clang -O3 -fopenmp -I../src primes3.c -L/usr/local/lib64 -lprimesieve -lm nvc -o primes3.nvc -O3 -mp=multicore -I../src primes3.c -L/usr/local/lib64 -lprimesieve -lm ``` **OpenMP Ordered Power Consumption Test** ```text Threadripper 3970X idle (browser NV forums page) 120 watts ./primes1.gcc 1e10 -p >/dev/null 10.173 secs, 201 watts ./primes1.clang 1e10 -p >/dev/null 12.729 secs, 288 watts ./primes1.nvc 1e10 -p >/dev/null 21.346 secs, 322 watts ./primes3.gcc 1e10 -p >/dev/null 7.092 secs, 181 watts ./primes3.clang 1e10 -p >/dev/null 8.876 secs, 274 watts ./primes3.nvc 1e10 -p >/dev/null 11.080 secs, 361 watts ``` **OpenMP Performance Test** ```text Threadripper 3970X idle (browser NV forums page) 120 watts ./primes1.gcc 1e12 16.168 secs, 399 watts ./primes1.clang 1e12 16.274 secs, 395 watts ./primes1.nvc 1e12 14.780 secs, 393 watts ./primes3.gcc 1e12 5.762 secs, 437 watts ./primes3.clang 1e12 6.277 secs, 434 watts ./primes3.nvc 1e12 5.755 secs, 442 watts ``` I first witnessed the power consumption issue using Codon. https://github.com/exaloop/codon/issues/456 Is it okay for waiting threads to be spinning the CPU during ordered or exclusive blocks? I wonder about cloud customers possibly paying extra power consumption simply for threads waiting their turn. The Intel oneAPI compilers are also impacted.
ye-luo commented 8 months ago

Did you explore OMP_WAIT_POLICY?

marioroy commented 8 months ago

Just now. Thank you, for the suggestion. I'm unable to see any difference for clang including nvc with power consumption.

OMP_WAIT_POLICY=passive ./primes1.clang 1e10 -p >/dev/null
OMP_WAIT_POLICY=passive ./primes1.nvc 1e10 -p >/dev/null

Still seeing near 6400% CPU utilization versus less than 200% running primes1.gcc.

marioroy commented 8 months ago

Interestingly, primes1.gcc (GNU gcc) supports OMP_WAIT_POLICY and can see active and passive (default) working.

OMP_WAIT_POLICY=active  ./primes1.gcc 1e10 -p >/dev/null  6400% CPU utilization
OMP_WAIT_POLICY=passive ./primes1.gcc 1e10 -p >/dev/null  <200%
mjklemm commented 8 months ago

This is by design. The OpenMP threads spin wait, because they are much more responsive and wake up quicker when needed again. OMP_WAIT_POLICY=passive means that the threads go into a deep-sleep mode where they need to wake up via an OS signal, which has much more latency.

There's a default timeout for thread that are spin-waiting to go to that deep sleep state. The default is about 200ms, but you should be able to change that via KMP_BLOCKTIME=50, which would set the spin-wait timeout to about 50ms.

marioroy commented 8 months ago

Thank you. Unfortunately, using clang/clang++, I'm unable to see a difference setting OMP_WAIT_POLICY=passive. Top reports near 6400% CPU utilization equating to high power consumption. I'm grateful for the ability to set passive, but does not seem to work using clang.

mjklemm commented 8 months ago

Hm, OK. Please try the explicit forms:

OMP_WAIT_POLICY Decides whether threads spin (active) or yield (passive) while they are waiting. OMP_WAIT_POLICY=active is an alias for KMP_LIBRARY=turnaround, and OMP_WAIT_POLICY=passive is an alias for KMP_LIBRARY=throughput.

Does that change things?

marioroy commented 8 months ago

No change.

KMP_LIBRARY=turnaround ./primes1.clang 1e10 -p >/dev/null  6400% CPU utilization
OMP: Warning #182: OMP_WAIT_POLICY: ignored because KMP_LIBRARY has been defined
KMP_LIBRARY=throughput ./primes1.clang 1e10 -p >/dev/null  6400%
OMP: Warning #182: OMP_WAIT_POLICY: ignored because KMP_LIBRARY has been defined
mjklemm commented 8 months ago

@jpeyton52 Could you have a look at this at some point and see if there's a bug?

marioroy commented 8 months ago

[OpenMP] OMP_WAIT_POLICY=PASSIVE still keeps the threads without work running https://github.com/llvm/llvm-project/issues/63732

marioroy commented 8 months ago

I played around with GNU GCC. The OMP_WAIT_POLICY implementation behaves as described.

OMP_WAIT_POLICY – How waiting threads are handled in GNU GCC

" Description:

Specifies whether waiting threads should be active or passive. If the value is PASSIVE, waiting threads should not consume CPU power while waiting; while the value is ACTIVE specifies that they should. If undefined, threads wait actively for a short time before waiting passively. "

                        ./primes1.gcc 1e10 -p >/dev/null   172% CPU Utilization
OMP_WAIT_POLICY=passive ./primes1.gcc 1e10 -p >/dev/null   133%
OMP_WAIT_POLICY=active  ./primes1.gcc 1e10 -p >/dev/null  6400%