ggttgg timing performance studies

valassi commented 3 years ago

I am dumping a few numbers from my gcc9/cuda11.0 centos7 itscrd70 for future reference, from a first look at ggttgg. This should eventually become our baseline replacing eemumu.

I have fixed a minimal issue in PR #145 and added some timings in the log of commit https://github.com/madgraph5/madgraph4gpu/commit/dd8711d0aa22c85429802b23de3169a68a97f298

As expected, this is very different from eemumu, in the sense that the ME part is clearly dominant, and the ME throughput is the same as the throughput of the total workflow. The numbers I get so far on my machine:

C++ 1.3E3 events/sec (gcc9; I would expect maybe a factor 1.5 to 2 better when we turn on fast math, an extra factor 4 from vectorization, and a factor 4 when I turn on multithreading - all on the way for eemumu)
CUDA 4.4E5 events/sec with 64 blocks, 256 threads
CUDA 5.1E5 events/sec with 2048 blocks, 256 threads

Clearly this is a different beast from eemumu on the GPU, for instance increasing blocks from 64 to 2048 has minimal impact, only 15% improvement.

I open this as an epic as it would be useful to have detailed studies, eg of register pressure, memory access etc, to see how we can optimize the cuda

time ./check.exe -p 64 256 1
***********************************************************************
NumBlocksPerGrid           = 64
NumThreadsPerBlock         = 256
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 1.286574e+01                 )  sec
TotalTime[Rambo+ME]    (23)= ( 1.286227e+01                 )  sec
TotalTime[RndNumGen]    (1)= ( 3.472183e-03                 )  sec
TotalTime[Rambo]        (2)= ( 1.216353e-02                 )  sec
TotalTime[MatrixElems]  (3)= ( 1.285011e+01                 )  sec
MeanTimeInMatrixElems      = ( 1.285011e+01                 )  sec
[Min,Max]TimeInMatrixElems = [ 1.285011e+01 ,  1.285011e+01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 16384
EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.273459e+03                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 1.273803e+03                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 1.275009e+03                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 16384
MeanMatrixElemValue        = ( 1.698216e+00 +- 1.688421e+00 )  GeV^-4
[Min,Max]MatrixElemValue   = [ 1.576145e-09 ,  2.766358e+04 ]  GeV^-4
StdDevMatrixElemValue      = ( 2.161179e+02                 )  GeV^-4
MeanWeight                 = ( 2.812272e+01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 2.812272e+01 ,  2.812272e+01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
real    0m12.877s
user    0m12.865s
sys     0m0.010s

time ./gcheck.exe -p 64 256 1
***********************************************************************
NumBlocksPerGrid           = 64
NumThreadsPerBlock         = 256
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = THRUST::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Wavefunction GPU memory    = LOCAL
Random number generation   = CURAND DEVICE (CUDA code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 3.806485e-02                 )  sec
TotalTime[Rambo+ME]    (23)= ( 3.744829e-02                 )  sec
TotalTime[RndNumGen]    (1)= ( 6.165550e-04                 )  sec
TotalTime[Rambo]        (2)= ( 4.457960e-04                 )  sec
TotalTime[MatrixElems]  (3)= ( 3.700249e-02                 )  sec
MeanTimeInMatrixElems      = ( 3.700249e-02                 )  sec
[Min,Max]TimeInMatrixElems = [ 3.700249e-02 ,  3.700249e-02 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 16384
EvtsPerSec[Rnd+Rmb+ME](123)= ( 4.304234e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 4.375100e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 4.427810e+05                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 16384
MeanMatrixElemValue        = ( 6.793453e+00 +- 6.755076e+00 )  GeV^-4
[Min,Max]MatrixElemValue   = [ 1.109086e-08 ,  1.106771e+05 ]  GeV^-4
StdDevMatrixElemValue      = ( 8.646498e+02                 )  GeV^-4
MeanWeight                 = ( 2.812272e+01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 2.812272e+01 ,  2.812272e+01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
real    0m0.916s
user    0m0.096s
sys     0m0.765s

time ./gcheck.exe -p 2048 256 12
***********************************************************************
NumBlocksPerGrid           = 2048
NumThreadsPerBlock         = 256
NumIterations              = 12
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = THRUST::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Wavefunction GPU memory    = LOCAL
Random number generation   = CURAND DEVICE (CUDA code)
-----------------------------------------------------------------------
NumberOfEntries            = 12
TotalTime[Rnd+Rmb+ME] (123)= ( 1.255412e+01                 )  sec
TotalTime[Rambo+ME]    (23)= ( 1.254601e+01                 )  sec
TotalTime[RndNumGen]    (1)= ( 8.112079e-03                 )  sec
TotalTime[Rambo]        (2)= ( 1.292305e-01                 )  sec
TotalTime[MatrixElems]  (3)= ( 1.241678e+01                 )  sec
MeanTimeInMatrixElems      = ( 1.034732e+00                 )  sec
[Min,Max]TimeInMatrixElems = [ 1.033975e+00 ,  1.035068e+00 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 6291456
EvtsPerSec[Rnd+Rmb+ME](123)= ( 5.011467e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 5.014707e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 5.066899e+05                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 6291456
MeanMatrixElemValue        = ( 2.486199e+00 +- 1.519583e+00 )  GeV^-4
[Min,Max]MatrixElemValue   = [ 9.367790e-09 ,  8.569965e+06 ]  GeV^-4
StdDevMatrixElemValue      = ( 3.811536e+03                 )  GeV^-4
MeanWeight                 = ( 2.812272e+01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 2.812272e+01 ,  2.812272e+01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
real    0m14.683s
user    0m8.392s
sys     0m6.237s

valassi commented 3 years ago

Example of related register studies in eemumu: https://github.com/madgraph5/madgraph4gpu/issues/26

valassi commented 3 years ago

In PR #204 I have done:

a timing study for fortran, ggttgg appears to be 4.0E3/s with fast math
I switched on fast math, as expected I got a factor 1.5 (I was hoping maybe 2), this is now around 1.9E3/s

This means that fortran seems a factor two faster than C++. In eemumu for comparison it is only around 15% faster. With the vectorization work, in eemumu even the scalar c++ was improved. To be followed up.

I copy the notes below from PR #205

PR #205 addresses yesterday's discussion with @jtchilders @roiser @oliviermattelaer

Note that the Fortran throughout is a factor 2 higher than C++, to be understood

Without fast math, fortran: https://github.com/madgraph5/madgraph4gpu/commit/b8779b0d691a840cb492ca52d81ab405f03a9632 TOTAL MATRIX1 : 2.0968s for 6511 calls => throughput is 3.11E+03 calls/s

With fast math, fortran: https://github.com/madgraph5/madgraph4gpu/commit/0e9b503e8e3b057090f1d94d468d3194fb095239 TOTAL MATRIX1 : 3.0558s for 12090 calls => throughput is 3.96E+03 calls/s

Without fast math, epoch2 c++: https://github.com/madgraph5/madgraph4gpu/commit/0e9b503e8e3b057090f1d94d468d3194fb095239 EvtsPerSec[MatrixElems] (3)= ( 1.345116e+03 ) sec^-1

With fast math, epoch2 c++: https://github.com/madgraph5/madgraph4gpu/commit/19292e92f177bf95e11d17f51f2ba83f77e47ffe EvtsPerSec[MatrixElems] (3)= ( 1.906572e+03 ) sec^-1

(NB ALL NUMBERS ABOVE ON PMPE04 - NOT TRIED ON ITSCRD70)

These numbers are VERY PRELIMINARY

Clearly some things to be understood in the c++. Note that in eemumu the vectorization work had improved throughput even without switching on SIMD. The above numbers are without SIMD.

If one imagines a factor almost 4 from SIMD in C++, this would still be almost a factor 2 faster than fortran... but probably we can do better.

valassi commented 3 years ago

PS Note that the gridpack is from MG 3.1.0, with some algorithmic optimizations by Olivier and Kiran that may be missing in the C++, derived from 2.9.x... maybe this also explains. We should check with epoch3.

valassi commented 3 years ago

Other issues related to this epic:

issue #183 (mentioned by @jtchilders in the June 21 meeting https://indico.cern.ch/event/1028452): need rambo to be massive, it is now massless, so it gives wrong physics as t is not massless (and gives different physics from kokkos)
issue #100 (mentioned by @oliviermattelaer in the June 21 meeting https://indico.cern.ch/event/1028452): need to add some selection cuts to ggttgg, otherwise all cross sections are divergent

valassi commented 3 years ago

Updates: look at the nice results from @cvuosalo, for instance on June 28 https://indico.cern.ch/event/1053713/ This includes in particular maxregcount tests for issue #26

madgraph5 / madgraph4gpu

ggttgg timing performance studies #146