madgraph5 / madgraph4gpu

GPU development for the Madgraph5_aMC@NLO event generator software package
30 stars 33 forks source link

Retry OMP multithreading in cudacpp (and prototype custom multithreading, and compare to MP) - suboptimal results in ggttgg (Dec 2022) #575

Open valassi opened 1 year ago

valassi commented 1 year ago

With the changes for the random choice of helicity (#403, MR #570 and especially #415), the OMP multithreading loop has moved inside cudacpp. It is now in a place where maybe it could work better out of the box.

Note in fact that also Fortran OMP is now quite good (see #561), so I would expect something similar in cudacpp.

While doing the code move I disabled (commented out) the OMP pragmas. They should be reenabled and tested..

https://github.com/madgraph5/madgraph4gpu/blob/3780502a369c9583aa86cb878d50a9d7b9d491aa/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc#L878

#ifdef _OPENMP
    // (NB gcc9 or higher, or clang, is required)
    // - default(none): no variables are shared by default
    // - shared: as the name says
    // - private: give each thread its own copy, without initialising
    // - firstprivate: give each thread its own copy, and initialise with value from outside
#pragma omp parallel for default( none ) shared( allmomenta, allcouplings, allMEs, channelId, allNumerators, allDenominators )
#endif // _OPENMP
    */
    for( int ipagV2 = 0; ipagV2 < npagV2; ++ipagV2 )
    {
valassi commented 1 year ago

The idea is essentially the following:

For instance on a 4-core machine with AVX2

valassi commented 1 year ago

In particular this hould be tested against pmpe04 or another node with 30+ cores. See the previous suboptimal results in #196

valassi commented 1 year ago

Rather than open a new issue, I add a few ideas here.

OMP is one solution for MT in cudacpp. But custom multithreading is another possibility. What I am thinking of is the following

valassi commented 1 year ago

I reenabled OMP MT and I did a few tests.

It works, but I still get suboptimal results. I will followup here with ggttgg on the previous results in #196 for eemumu (and I will close that ticket).

My observations

Things to do

Anyway, below are the numbers. On pmpe04 (16 physical cores with avx2, 2xHT so 32 maximum threads). There is no cuda, cso built essentially with CUDA_HOME=none. These are not systematic tests, they ar emore or less the first numbers I got...

Without SIMD, 16k events

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=1 ./build.none_d_inl0_hrd0/check.exe -p 64 256 1 | egrep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 1.856625e+03                 )  sec^-1
MeanMatrixElemValue         = ( 4.197467e-01 +- 3.250467e-01 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=4 ./build.none_d_inl0_hrd0/check.exe -p 64 256 1 | egrep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 4 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 6.716730e+03                 )  sec^-1
MeanMatrixElemValue         = ( 4.197467e-01 +- 3.250467e-01 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=16 ./build.none_d_inl0_hrd0/check.exe -p 64 256 1 | egrep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 16 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 2.159144e+04                 )  sec^-1
MeanMatrixElemValue         = ( 4.197467e-01 +- 3.250467e-01 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=32 ./build.none_d_inl0_hrd0/check.exe -p 64 256 1 | egr
ep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 1.938153e+04                 )  sec^-1
MeanMatrixElemValue         = ( 4.197467e-01 +- 3.250467e-01 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=32 ./build.none_d_inl0_hrd0/check.exe -p 64 256 1 | egr
ep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 2.257169e+04                 )  sec^-1
MeanMatrixElemValue         = ( 4.197467e-01 +- 3.250467e-01 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=1 ./build.none_d_inl0_hrd0/check.exe -p 256 256 1 | egrep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 1.888137e+03                 )  sec^-1
MeanMatrixElemValue         = ( 9.878420e+02 +- 9.874419e+02 )  GeV^-4

Without SIMD, more events

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=1 ./build.none_d_inl0_hrd0/check.exe -p 64 256 1 | egre
p '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 1.857226e+03                 )  sec^-1
MeanMatrixElemValue         = ( 4.197467e-01 +- 3.250467e-01 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=1 ./build.none_d_inl0_hrd0/check.exe -p 256 256 1 | egrep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 1.888137e+03                 )  sec^-1
MeanMatrixElemValue         = ( 9.878420e+02 +- 9.874419e+02 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=4 ./build.none_d_inl0_hrd0/check.exe -p 256 256 1 | egrep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 4 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 6.885676e+03                 )  sec^-1
MeanMatrixElemValue         = ( 9.878420e+02 +- 9.874419e+02 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=16 ./build.none_d_inl0_hrd0/check.exe -p 256 256 1 | egrep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 16 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 2.356782e+04                 )  sec^-1
MeanMatrixElemValue         = ( 9.878420e+02 +- 9.874419e+02 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=32 ./build.none_d_inl0_hrd0/check.exe -p 256 256 1 | egrep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 2.474947e+04                 )  sec^-1
MeanMatrixElemValue         = ( 9.878420e+02 +- 9.874419e+02 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=32 ./build.none_d_inl0_hrd0/check.exe -p 256 256 1 | eg
rep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 2.487260e+04                 )  sec^-1
MeanMatrixElemValue         = ( 9.878420e+02 +- 9.874419e+02 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=32 ./build.none_d_inl0_hrd0/check.exe -p 256 1024 1 | egrep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 2.515134e+04                 )  sec^-1
MeanMatrixElemValue         = ( 2.475533e+02 +- 2.468621e+02 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=32 ./build.none_d_inl0_hrd0/check.exe -p 64 256 16 | egrep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 2.536720e+04                 )  sec^-1
MeanMatrixElemValue         = ( 8.334117e+00 +- 6.373555e+00 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=32 ./build.none_d_inl0_hrd0/check.exe -p 256 1024 4 | egrep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 2.508096e+04                 )  sec^-1
MeanMatrixElemValue         = ( 6.551217e+01 +- 6.174046e+01 )  GeV^-4

With AVX2 SIMD, 16k events

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=1 ./build.avx2_d_inl0_hrd0/check.exe -p 64 256 1 | egre
p '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 7.108860e+03                 )  sec^-1
MeanMatrixElemValue         = ( 4.197467e-01 +- 3.250467e-01 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=4 ./build.avx2_d_inl0_hrd0/check.exe -p 64 256 1 | egrep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 4 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 2.226741e+04                 )  sec^-1
MeanMatrixElemValue         = ( 4.197467e-01 +- 3.250467e-01 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=16 ./build.avx2_d_inl0_hrd0/check.exe -p 64 256 1 | egrep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 16 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 7.540322e+04                 )  sec^-1
MeanMatrixElemValue         = ( 4.197467e-01 +- 3.250467e-01 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=16 ./build.avx2_d_inl0_hrd0/check.exe -p 64 256 1 | egr
ep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 16 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 5.587768e+04                 )  sec^-1
MeanMatrixElemValue         = ( 4.197467e-01 +- 3.250467e-01 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=32 ./build.avx2_d_inl0_hrd0/check.exe -p 64 256 1 | egrep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 8.366525e+04                 )  sec^-1
MeanMatrixElemValue         = ( 4.197467e-01 +- 3.250467e-01 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=32 ./build.avx2_d_inl0_hrd0/check.exe -p 64 256 1 | egr
ep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 8.459457e+04                 )  sec^-1
MeanMatrixElemValue         = ( 4.197467e-01 +- 3.250467e-01 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=1 ./build.avx2_d_inl0_hrd0/check.exe -p 64 256 1 | egrep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 7.319292e+03                 )  sec^-1
MeanMatrixElemValue         = ( 4.197467e-01 +- 3.250467e-01 )  GeV^-4

With AVX2 SIMD, more events

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=1 ./build.avx2_d_inl0_hrd0/check.exe -p 256 256 1 | egrep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 7.289656e+03                 )  sec^-1
MeanMatrixElemValue         = ( 9.878420e+02 +- 9.874419e+02 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=1 ./build.avx2_d_inl0_hrd0/check.exe -p 256 256 1 | egr
ep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 7.566718e+03                 )  sec^-1
MeanMatrixElemValue         = ( 9.878420e+02 +- 9.874419e+02 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=4 ./build.avx2_d_inl0_hrd0/check.exe -p 256 256 1 | egrep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 4 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 2.755945e+04                 )  sec^-1
MeanMatrixElemValue         = ( 9.878420e+02 +- 9.874419e+02 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=16 ./build.avx2_d_inl0_hrd0/check.exe -p 256 256 1 | egrep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 16 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 7.972108e+04                 )  sec^-1
MeanMatrixElemValue         = ( 9.878420e+02 +- 9.874419e+02 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=16 ./build.avx2_d_inl0_hrd0/check.exe -p 256 256 1 | egrep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 16 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 8.783962e+04                 )  sec^-1
MeanMatrixElemValue         = ( 9.878420e+02 +- 9.874419e+02 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=32 ./build.avx2_d_inl0_hrd0/check.exe -p 256 256 1 | eg
rep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 9.849974e+04                 )  sec^-1
MeanMatrixElemValue         = ( 9.878420e+02 +- 9.874419e+02 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=32 ./build.avx2_d_inl0_hrd0/check.exe -p 256 256 4 | egrep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 9.510871e+04                 )  sec^-1
MeanMatrixElemValue         = ( 2.558300e+02 +- 2.469487e+02 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=32 ./build.avx2_d_inl0_hrd0/check.exe -p 256 256 4 | eg
rep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 9.855484e+04                 )  sec^-1
MeanMatrixElemValue         = ( 2.558300e+02 +- 2.469487e+02 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=32 ./build.avx2_d_inl0_hrd0/check.exe -p 256 256 16 | e
grep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 1.013572e+05                 )  sec^-1
MeanMatrixElemValue         = ( 6.863526e+01 +- 6.177879e+01 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=32 ./build.avx2_d_inl0_hrd0/check.exe -p 256 256 16 | e
grep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 9.954735e+04                 )  sec^-1
MeanMatrixElemValue         = ( 6.863526e+01 +- 6.177879e+01 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=32 ./build.avx2_d_inl0_hrd0/check.exe -p 256 1024 1 | egrep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 1.019225e+05                 )  sec^-1
MeanMatrixElemValue         = ( 2.475533e+02 +- 2.468621e+02 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=32 ./build.avx2_d_inl0_hrd0/check.exe -p 256 1024 1 | e
grep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 1.080913e+05                 )  sec^-1
MeanMatrixElemValue         = ( 2.475533e+02 +- 2.468621e+02 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=32 ./build.avx2_d_inl0_hrd0/check.exe -p 256 1024 16 | egrep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 1.103730e+05                 )  sec^-1
MeanMatrixElemValue         = ( 2.476884e+06 +- 2.476607e+06 )  GeV^-4

Note also that 'top' shows a varying load on the system. in some of the fastest tests it was 100% (3200 load) at points but then falling temporarely to 70%. In other tests it was showing 92% constant... So in summary,

Aagain, all this should be compared to several independent processes single-threaded (and or eventually to home-made MT)

valassi commented 1 year ago

I will create and merge a MR

NB One thing that I have not done is to reenable OMP tests in tmad/tput scripts. You need very large number of events and long tests to get meaningful results

Maybe something for @Jooorgen to test in your infrastructure?

valassi commented 1 year ago

I have reenabled this in gcc, but failed in icpx and clang, see #578

Anyway this one stays open for more performance studies