madgraph5 / madgraph4gpu

GPU development for the Madgraph5_aMC@NLO event generator software package
30 stars 32 forks source link

OMP multithreading gives very unstable and suboptimal throughputs on pmpe04 (eemumu, May 2021 software) #196

Closed valassi closed 1 year ago

valassi commented 3 years ago

So far I have been mainly checking OMP performances on my GPU-aware VM with 4 virtual cores. The throughput always seemed to scale more or less as expected, almost a factor 4.

See https://github.com/madgraph5/madgraph4gpu/commit/6f4916ec5552f57d8b58520f924c28e1e495673c, from itscrd70:

-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.290318e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.248269 sec
real    0m7.258s
=Symbols in CPPProcess.o= (~sse4:  620) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 4 / 4
EvtsPerSec[MatrixElems] (3) = ( 5.112854e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.630541 sec
real    0m3.640s
-------------------------------------------------------------------------

and

-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 4.753192e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.764206 sec
real    0m3.774s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2604) (512y:   97) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 4 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.830711e+07                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     2.645810 sec
real    0m2.655s
-------------------------------------------------------------------------

In other words, going to 4 threads is increasing:

So, actually already above one issue can be seen, namely OMP scaling is suboptimal when SIMD is also enabled. This should be cross checked, but I would not be surprised if it is also some processor slowdown (check the clock...).

HOWEVER. The situation looks quite different on a machine with many more nodes. I did the following tests on pmpe04 (16 physical cores with 2x hyper threading) and @lfield also did some tests on a machine with 64 logical cores, getting similar results.

FIRST issue: this is very unstable: this is with no SIMD, you can see that the result with 32 threads fluctuates wildly between 3.0E6 and 1.3E7 (more than a factor 4 fluctuation!). Something to do with NUMA?...

for n in 1 2 4 8 16 32; do export OMP_NUM_THREADS=$n; ./build.none/check.exe -p 256 256 1 | egrep '(OMP threads|EvtsPerSec\[MatrixElems)'; done
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MatrixElems] (3) = ( 8.485243e+05                 )  sec^-1
OMP threads / `nproc --all` = 2 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.422320e+06                 )  sec^-1
OMP threads / `nproc --all` = 4 / 32
EvtsPerSec[MatrixElems] (3) = ( 2.223990e+06                 )  sec^-1
OMP threads / `nproc --all` = 8 / 32
EvtsPerSec[MatrixElems] (3) = ( 4.133416e+06                 )  sec^-1
OMP threads / `nproc --all` = 16 / 32
EvtsPerSec[MatrixElems] (3) = ( 4.900442e+06                 )  sec^-1
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.266690e+07                 )  sec^-1

for n in 1 2 4 8 16 32; do export OMP_NUM_THREADS=$n; ./build.none/check.exe -p 256 256 1 | egrep '(OMP threads|EvtsPerSec\[MatrixElems)'; done
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MatrixElems] (3) = ( 8.786840e+05                 )  sec^-1
OMP threads / `nproc --all` = 2 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.195912e+06                 )  sec^-1
OMP threads / `nproc --all` = 4 / 32
EvtsPerSec[MatrixElems] (3) = ( 2.220566e+06                 )  sec^-1
OMP threads / `nproc --all` = 8 / 32
EvtsPerSec[MatrixElems] (3) = ( 4.128356e+06                 )  sec^-1
OMP threads / `nproc --all` = 16 / 32
EvtsPerSec[MatrixElems] (3) = ( 7.927956e+06                 )  sec^-1
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MatrixElems] (3) = ( 2.979217e+06                 )  sec^-1

for n in 1 2 4 8 16 32; do export OMP_NUM_THREADS=$n; ./build.none/check.exe -p 256 256 1 | egrep '(OMP threads|EvtsPerSec\[MatrixElems)'; done
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MatrixElems] (3) = ( 8.948936e+05                 )  sec^-1
OMP threads / `nproc --all` = 2 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.202562e+06                 )  sec^-1
OMP threads / `nproc --all` = 4 / 32
EvtsPerSec[MatrixElems] (3) = ( 2.184464e+06                 )  sec^-1
OMP threads / `nproc --all` = 8 / 32
EvtsPerSec[MatrixElems] (3) = ( 4.095331e+06                 )  sec^-1
OMP threads / `nproc --all` = 16 / 32
EvtsPerSec[MatrixElems] (3) = ( 4.776658e+06                 )  sec^-1
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MatrixElems] (3) = ( 7.585580e+06                 )  sec^-1

SECOND issue: the actual scaling is suboptimal. In the example above, the best scaling at 16 threads (the number of physical cores) is 7.9E6 divided by 8.9E5, which is less than a factor 10, while I would expect a solid factor 16 here...

The numbers get better with a larger number of events, even without OMP (why?!...), but they are still suboptimal and fluctuating, on pmpe04

for n in 1 2 4 8 16 32; do export OMP_NUM_THREADS=$n; ./build.none/check.exe -p 2048 256 12 | egrep '(OMP threads|EvtsPerSec\[MatrixElems)'; done
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.310010e+06                 )  sec^-1
OMP threads / `nproc --all` = 2 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.703815e+06                 )  sec^-1
OMP threads / `nproc --all` = 4 / 32
EvtsPerSec[MatrixElems] (3) = ( 3.086431e+06                 )  sec^-1
OMP threads / `nproc --all` = 8 / 32
EvtsPerSec[MatrixElems] (3) = ( 5.266153e+06                 )  sec^-1
OMP threads / `nproc --all` = 16 / 32
EvtsPerSec[MatrixElems] (3) = ( 6.276633e+06                 )  sec^-1
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MatrixElems] (3) = ( 8.829229e+06                 )  sec^-1

for n in 1 2 4 8 16 32; do export OMP_NUM_THREADS=$n; ./build.none/check.exe -p 2048 256 12 | egrep '(OMP threads|EvtsPerSec\[MatrixElems)'; done
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.305356e+06                 )  sec^-1
OMP threads / `nproc --all` = 2 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.521695e+06                 )  sec^-1
OMP threads / `nproc --all` = 4 / 32
EvtsPerSec[MatrixElems] (3) = ( 3.100592e+06                 )  sec^-1
OMP threads / `nproc --all` = 8 / 32
EvtsPerSec[MatrixElems] (3) = ( 5.261600e+06                 )  sec^-1
OMP threads / `nproc --all` = 16 / 32
EvtsPerSec[MatrixElems] (3) = ( 9.154927e+06                 )  sec^-1
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.007438e+07                 )  sec^-1

for n in 1 2 4 8 16 32; do export OMP_NUM_THREADS=$n; ./build.none/check.exe -p 2048 256 12 | egrep '(OMP threads|EvtsPerSec\[MatrixElems)'; done
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.293470e+06                 )  sec^-1
OMP threads / `nproc --all` = 2 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.466667e+06                 )  sec^-1
OMP threads / `nproc --all` = 4 / 32
EvtsPerSec[MatrixElems] (3) = ( 3.104600e+06                 )  sec^-1
OMP threads / `nproc --all` = 8 / 32
EvtsPerSec[MatrixElems] (3) = ( 5.308291e+06                 )  sec^-1
OMP threads / `nproc --all` = 16 / 32
EvtsPerSec[MatrixElems] (3) = ( 9.185191e+06                 )  sec^-1
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.001564e+07                 )  sec^-1

Compare to the same on itscrd70:

for n in 1 2 4 8 16 32; do export OMP_NUM_THREADS=$n; ./build.none/check.exe -p 2048 256 12 | egrep '(OMP threads|EvtsPerSec\[MatrixElems)'; done
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.317429e+06                 )  sec^-1
OMP threads / `nproc --all` = 2 / 4
EvtsPerSec[MatrixElems] (3) = ( 2.621517e+06                 )  sec^-1
OMP threads / `nproc --all` = 4 / 4
EvtsPerSec[MatrixElems] (3) = ( 5.183589e+06                 )  sec^-1
OMP threads / `nproc --all` = 8 / 4
EvtsPerSec[MatrixElems] (3) = ( 4.730554e+06                 )  sec^-1
OMP threads / `nproc --all` = 16 / 4
EvtsPerSec[MatrixElems] (3) = ( 5.070596e+06                 )  sec^-1
OMP threads / `nproc --all` = 32 / 4
EvtsPerSec[MatrixElems] (3) = ( 5.027287e+06                 )  sec^-1

This is even more obvious with SIMD, 32 threads gain a maximum factor 6?...

for n in 1 2 4 8 16 32; do export OMP_NUM_THREADS=$n; ./build.avx2/check.exe -p 2048 256 12 | egrep '(OMP threads|EvtsPerSec\[MatrixElems)'; done
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MatrixElems] (3) = ( 4.052396e+06                 )  sec^-1
OMP threads / `nproc --all` = 2 / 32
EvtsPerSec[MatrixElems] (3) = ( 4.547816e+06                 )  sec^-1
OMP threads / `nproc --all` = 4 / 32
EvtsPerSec[MatrixElems] (3) = ( 7.733957e+06                 )  sec^-1
OMP threads / `nproc --all` = 8 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.381696e+07                 )  sec^-1
OMP threads / `nproc --all` = 16 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.946232e+07                 )  sec^-1
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MatrixElems] (3) = ( 2.525760e+07                 )  sec^-1

for n in 1 2 4 8 16 32; do export OMP_NUM_THREADS=$n; ./build.avx2/check.exe -p 2048 256 12 | egrep '(OMP threads|EvtsPerSec\[MatrixElems)'; done
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MatrixElems] (3) = ( 4.034425e+06                 )  sec^-1
OMP threads / `nproc --all` = 2 / 32
EvtsPerSec[MatrixElems] (3) = ( 4.668128e+06                 )  sec^-1
OMP threads / `nproc --all` = 4 / 32
EvtsPerSec[MatrixElems] (3) = ( 7.894043e+06                 )  sec^-1
OMP threads / `nproc --all` = 8 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.383712e+07                 )  sec^-1
OMP threads / `nproc --all` = 16 / 32
EvtsPerSec[MatrixElems] (3) = ( 2.351971e+07                 )  sec^-1
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MatrixElems] (3) = ( 2.520908e+07                 )  sec^-1

for n in 1 2 4 8 16 32; do export OMP_NUM_THREADS=$n; ./build.avx2/check.exe -p 2048 256 12 | egrep '(OMP threads|EvtsPerSec\[MatrixElems)'; done
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MatrixElems] (3) = ( 4.035978e+06                 )  sec^-1
OMP threads / `nproc --all` = 2 / 32
EvtsPerSec[MatrixElems] (3) = ( 4.721879e+06                 )  sec^-1
OMP threads / `nproc --all` = 4 / 32
EvtsPerSec[MatrixElems] (3) = ( 7.800313e+06                 )  sec^-1
OMP threads / `nproc --all` = 8 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.382836e+07                 )  sec^-1
OMP threads / `nproc --all` = 16 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.917745e+07                 )  sec^-1
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MatrixElems] (3) = ( 2.157618e+07                 )  sec^-1

TODO:

valassi commented 3 years ago

I made a few quick plots because I want them for the talk and because I reused the infrastructure I already had for benchmarking/HPC from last year. Eventually this stuff could be cleaned up and integraed by @Andykyu12 in his infrastructure.

https://github.com/madgraph5/madgraph4gpu/blob/51d7f52bf34d7ce35533cdb9a1bc67daa6ef4ee7/tools/benchmarking/plots/pmpe-nosimd.png https://github.com/madgraph5/madgraph4gpu/blob/51d7f52bf34d7ce35533cdb9a1bc67daa6ef4ee7/tools/benchmarking/plots/pmpe-simd.png

These are VERY PRELIMINARY - I am somewhatconfident that the scaling with multiple copies is correct and not just an artefact of how these short scripts are run, but this should be thoroughly cross checked...

Indeed/instead the OMP multithreading is suboptimal. I had a quick look at numactl but it does not seem to improve things. I would maybe reconsider an alternative to OMP, this is such a basic one loop that we can do it by hand with custom threads. Maybe we also need to split up the memory ourselves in advance across threads.

Anyway, the good news are:

valassi commented 3 years ago

PS the memory plots are just a sanity check here, not the main point of these plots - not clear where each applicaton spends most of its memory, but it may well be in the allocation of the large pages

PPS forgot to mention on the plots, these are all 'check.exe -p 2048 256 40'... around 20-30 seconds for each test

valassi commented 1 year ago

I am closing this old ticket for eemumu OMP mult threading with an old software version.

I now have a newer software version and reenabled OMP MT there. I still gete suboptimal results also for ggttgg, but the results get better as I process more and more events. Anyway, I will follow up in this ticket: #575

Closing.