OMP multithreading gives very unstable and suboptimal throughputs on pmpe04 (eemumu, May 2021 software)

valassi commented 3 years ago

So far I have been mainly checking OMP performances on my GPU-aware VM with 4 virtual cores. The throughput always seemed to scale more or less as expected, almost a factor 4.

See https://github.com/madgraph5/madgraph4gpu/commit/6f4916ec5552f57d8b58520f924c28e1e495673c, from itscrd70:

-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.290318e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.248269 sec
real    0m7.258s
=Symbols in CPPProcess.o= (~sse4:  620) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 4 / 4
EvtsPerSec[MatrixElems] (3) = ( 5.112854e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.630541 sec
real    0m3.640s
-------------------------------------------------------------------------

and

-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 4.753192e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.764206 sec
real    0m3.774s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2604) (512y:   97) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 4 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.830711e+07                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     2.645810 sec
real    0m2.655s
-------------------------------------------------------------------------

In other words, going to 4 threads is increasing:

with no SIMD, from 1.29E6 to 5.11E6 (x3.96)
with SIMD 512y, from 4.75E6 to 1.83E7 (x2.59... already less, actually)

So, actually already above one issue can be seen, namely OMP scaling is suboptimal when SIMD is also enabled. This should be cross checked, but I would not be surprised if it is also some processor slowdown (check the clock...).

HOWEVER. The situation looks quite different on a machine with many more nodes. I did the following tests on pmpe04 (16 physical cores with 2x hyper threading) and @lfield also did some tests on a machine with 64 logical cores, getting similar results.

FIRST issue: this is very unstable: this is with no SIMD, you can see that the result with 32 threads fluctuates wildly between 3.0E6 and 1.3E7 (more than a factor 4 fluctuation!). Something to do with NUMA?...

for n in 1 2 4 8 16 32; do export OMP_NUM_THREADS=$n; ./build.none/check.exe -p 256 256 1 | egrep '(OMP threads|EvtsPerSec\[MatrixElems)'; done
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MatrixElems] (3) = ( 8.485243e+05                 )  sec^-1
OMP threads / `nproc --all` = 2 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.422320e+06                 )  sec^-1
OMP threads / `nproc --all` = 4 / 32
EvtsPerSec[MatrixElems] (3) = ( 2.223990e+06                 )  sec^-1
OMP threads / `nproc --all` = 8 / 32
EvtsPerSec[MatrixElems] (3) = ( 4.133416e+06                 )  sec^-1
OMP threads / `nproc --all` = 16 / 32
EvtsPerSec[MatrixElems] (3) = ( 4.900442e+06                 )  sec^-1
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.266690e+07                 )  sec^-1

for n in 1 2 4 8 16 32; do export OMP_NUM_THREADS=$n; ./build.none/check.exe -p 256 256 1 | egrep '(OMP threads|EvtsPerSec\[MatrixElems)'; done
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MatrixElems] (3) = ( 8.786840e+05                 )  sec^-1
OMP threads / `nproc --all` = 2 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.195912e+06                 )  sec^-1
OMP threads / `nproc --all` = 4 / 32
EvtsPerSec[MatrixElems] (3) = ( 2.220566e+06                 )  sec^-1
OMP threads / `nproc --all` = 8 / 32
EvtsPerSec[MatrixElems] (3) = ( 4.128356e+06                 )  sec^-1
OMP threads / `nproc --all` = 16 / 32
EvtsPerSec[MatrixElems] (3) = ( 7.927956e+06                 )  sec^-1
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MatrixElems] (3) = ( 2.979217e+06                 )  sec^-1

for n in 1 2 4 8 16 32; do export OMP_NUM_THREADS=$n; ./build.none/check.exe -p 256 256 1 | egrep '(OMP threads|EvtsPerSec\[MatrixElems)'; done
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MatrixElems] (3) = ( 8.948936e+05                 )  sec^-1
OMP threads / `nproc --all` = 2 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.202562e+06                 )  sec^-1
OMP threads / `nproc --all` = 4 / 32
EvtsPerSec[MatrixElems] (3) = ( 2.184464e+06                 )  sec^-1
OMP threads / `nproc --all` = 8 / 32
EvtsPerSec[MatrixElems] (3) = ( 4.095331e+06                 )  sec^-1
OMP threads / `nproc --all` = 16 / 32
EvtsPerSec[MatrixElems] (3) = ( 4.776658e+06                 )  sec^-1
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MatrixElems] (3) = ( 7.585580e+06                 )  sec^-1

SECOND issue: the actual scaling is suboptimal. In the example above, the best scaling at 16 threads (the number of physical cores) is 7.9E6 divided by 8.9E5, which is less than a factor 10, while I would expect a solid factor 16 here...

The numbers get better with a larger number of events, even without OMP (why?!...), but they are still suboptimal and fluctuating, on pmpe04

for n in 1 2 4 8 16 32; do export OMP_NUM_THREADS=$n; ./build.none/check.exe -p 2048 256 12 | egrep '(OMP threads|EvtsPerSec\[MatrixElems)'; done
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.310010e+06                 )  sec^-1
OMP threads / `nproc --all` = 2 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.703815e+06                 )  sec^-1
OMP threads / `nproc --all` = 4 / 32
EvtsPerSec[MatrixElems] (3) = ( 3.086431e+06                 )  sec^-1
OMP threads / `nproc --all` = 8 / 32
EvtsPerSec[MatrixElems] (3) = ( 5.266153e+06                 )  sec^-1
OMP threads / `nproc --all` = 16 / 32
EvtsPerSec[MatrixElems] (3) = ( 6.276633e+06                 )  sec^-1
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MatrixElems] (3) = ( 8.829229e+06                 )  sec^-1

for n in 1 2 4 8 16 32; do export OMP_NUM_THREADS=$n; ./build.none/check.exe -p 2048 256 12 | egrep '(OMP threads|EvtsPerSec\[MatrixElems)'; done
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.305356e+06                 )  sec^-1
OMP threads / `nproc --all` = 2 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.521695e+06                 )  sec^-1
OMP threads / `nproc --all` = 4 / 32
EvtsPerSec[MatrixElems] (3) = ( 3.100592e+06                 )  sec^-1
OMP threads / `nproc --all` = 8 / 32
EvtsPerSec[MatrixElems] (3) = ( 5.261600e+06                 )  sec^-1
OMP threads / `nproc --all` = 16 / 32
EvtsPerSec[MatrixElems] (3) = ( 9.154927e+06                 )  sec^-1
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.007438e+07                 )  sec^-1

for n in 1 2 4 8 16 32; do export OMP_NUM_THREADS=$n; ./build.none/check.exe -p 2048 256 12 | egrep '(OMP threads|EvtsPerSec\[MatrixElems)'; done
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.293470e+06                 )  sec^-1
OMP threads / `nproc --all` = 2 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.466667e+06                 )  sec^-1
OMP threads / `nproc --all` = 4 / 32
EvtsPerSec[MatrixElems] (3) = ( 3.104600e+06                 )  sec^-1
OMP threads / `nproc --all` = 8 / 32
EvtsPerSec[MatrixElems] (3) = ( 5.308291e+06                 )  sec^-1
OMP threads / `nproc --all` = 16 / 32
EvtsPerSec[MatrixElems] (3) = ( 9.185191e+06                 )  sec^-1
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.001564e+07                 )  sec^-1

Compare to the same on itscrd70:

for n in 1 2 4 8 16 32; do export OMP_NUM_THREADS=$n; ./build.none/check.exe -p 2048 256 12 | egrep '(OMP threads|EvtsPerSec\[MatrixElems)'; done
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.317429e+06                 )  sec^-1
OMP threads / `nproc --all` = 2 / 4
EvtsPerSec[MatrixElems] (3) = ( 2.621517e+06                 )  sec^-1
OMP threads / `nproc --all` = 4 / 4
EvtsPerSec[MatrixElems] (3) = ( 5.183589e+06                 )  sec^-1
OMP threads / `nproc --all` = 8 / 4
EvtsPerSec[MatrixElems] (3) = ( 4.730554e+06                 )  sec^-1
OMP threads / `nproc --all` = 16 / 4
EvtsPerSec[MatrixElems] (3) = ( 5.070596e+06                 )  sec^-1
OMP threads / `nproc --all` = 32 / 4
EvtsPerSec[MatrixElems] (3) = ( 5.027287e+06                 )  sec^-1

This is even more obvious with SIMD, 32 threads gain a maximum factor 6?...

for n in 1 2 4 8 16 32; do export OMP_NUM_THREADS=$n; ./build.avx2/check.exe -p 2048 256 12 | egrep '(OMP threads|EvtsPerSec\[MatrixElems)'; done
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MatrixElems] (3) = ( 4.052396e+06                 )  sec^-1
OMP threads / `nproc --all` = 2 / 32
EvtsPerSec[MatrixElems] (3) = ( 4.547816e+06                 )  sec^-1
OMP threads / `nproc --all` = 4 / 32
EvtsPerSec[MatrixElems] (3) = ( 7.733957e+06                 )  sec^-1
OMP threads / `nproc --all` = 8 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.381696e+07                 )  sec^-1
OMP threads / `nproc --all` = 16 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.946232e+07                 )  sec^-1
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MatrixElems] (3) = ( 2.525760e+07                 )  sec^-1

for n in 1 2 4 8 16 32; do export OMP_NUM_THREADS=$n; ./build.avx2/check.exe -p 2048 256 12 | egrep '(OMP threads|EvtsPerSec\[MatrixElems)'; done
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MatrixElems] (3) = ( 4.034425e+06                 )  sec^-1
OMP threads / `nproc --all` = 2 / 32
EvtsPerSec[MatrixElems] (3) = ( 4.668128e+06                 )  sec^-1
OMP threads / `nproc --all` = 4 / 32
EvtsPerSec[MatrixElems] (3) = ( 7.894043e+06                 )  sec^-1
OMP threads / `nproc --all` = 8 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.383712e+07                 )  sec^-1
OMP threads / `nproc --all` = 16 / 32
EvtsPerSec[MatrixElems] (3) = ( 2.351971e+07                 )  sec^-1
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MatrixElems] (3) = ( 2.520908e+07                 )  sec^-1

for n in 1 2 4 8 16 32; do export OMP_NUM_THREADS=$n; ./build.avx2/check.exe -p 2048 256 12 | egrep '(OMP threads|EvtsPerSec\[MatrixElems)'; done
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MatrixElems] (3) = ( 4.035978e+06                 )  sec^-1
OMP threads / `nproc --all` = 2 / 32
EvtsPerSec[MatrixElems] (3) = ( 4.721879e+06                 )  sec^-1
OMP threads / `nproc --all` = 4 / 32
EvtsPerSec[MatrixElems] (3) = ( 7.800313e+06                 )  sec^-1
OMP threads / `nproc --all` = 8 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.382836e+07                 )  sec^-1
OMP threads / `nproc --all` = 16 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.917745e+07                 )  sec^-1
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MatrixElems] (3) = ( 2.157618e+07                 )  sec^-1

TODO:

make plots about this (eventually, one thing for @Andyku12)
study this better on pmpe04... why is this unstable? why is this suboptimal? NUMA?...
compare throughputs from multithreading with throughputs from multiple copies (the usual HEPIX-benchmarks way, see issue #157)
possibly try a home-made std::thread implementation and see if it is better than OMP?

valassi commented 3 years ago

I made a few quick plots because I want them for the talk and because I reused the infrastructure I already had for benchmarking/HPC from last year. Eventually this stuff could be cleaned up and integraed by @Andykyu12 in his infrastructure.

https://github.com/madgraph5/madgraph4gpu/blob/51d7f52bf34d7ce35533cdb9a1bc67daa6ef4ee7/tools/benchmarking/plots/pmpe-nosimd.png https://github.com/madgraph5/madgraph4gpu/blob/51d7f52bf34d7ce35533cdb9a1bc67daa6ef4ee7/tools/benchmarking/plots/pmpe-simd.png

These are VERY PRELIMINARY - I am somewhatconfident that the scaling with multiple copies is correct and not just an artefact of how these short scripts are run, but this should be thoroughly cross checked...

Indeed/instead the OMP multithreading is suboptimal. I had a quick look at numactl but it does not seem to improve things. I would maybe reconsider an alternative to OMP, this is such a basic one loop that we can do it by hand with custom threads. Maybe we also need to split up the memory ourselves in advance across threads.

Anyway, the good news are:

with multiple copies, also on pmpe you easiy reach 16x the single thread version... and even a bit beyond 16x, going to 2xHT and in overcommit
the performance gain from multiple copies remains also in SIMD modes
especially, as expected, the SIMD gain and the multicore gains are independent and they sum up (rather, they multiply): we get almost 4x from AVX2 and almost 16x from the physical cores on this machine, a total of almost 64x with respect to single core no-simd

valassi commented 3 years ago

PS the memory plots are just a sanity check here, not the main point of these plots - not clear where each applicaton spends most of its memory, but it may well be in the allocation of the large pages

PPS forgot to mention on the plots, these are all 'check.exe -p 2048 256 40'... around 20-30 seconds for each test

valassi commented 1 year ago

I am closing this old ticket for eemumu OMP mult threading with an old software version.

I now have a newer software version and reenabled OMP MT there. I still gete suboptimal results also for ggttgg, but the results get better as I process more and more events. Anyway, I will follow up in this ticket: #575

Closing.

madgraph5 / madgraph4gpu

OMP multithreading gives very unstable and suboptimal throughputs on pmpe04 (eemumu, May 2021 software) #196