madgraph5 / madgraph4gpu

GPU development for the Madgraph5_aMC@NLO event generator software package
30 stars 32 forks source link

ggttgg timing performance studies #146

Open valassi opened 3 years ago

valassi commented 3 years ago

I am dumping a few numbers from my gcc9/cuda11.0 centos7 itscrd70 for future reference, from a first look at ggttgg. This should eventually become our baseline replacing eemumu.

I have fixed a minimal issue in PR #145 and added some timings in the log of commit https://github.com/madgraph5/madgraph4gpu/commit/dd8711d0aa22c85429802b23de3169a68a97f298

As expected, this is very different from eemumu, in the sense that the ME part is clearly dominant, and the ME throughput is the same as the throughput of the total workflow. The numbers I get so far on my machine:

Clearly this is a different beast from eemumu on the GPU, for instance increasing blocks from 64 to 2048 has minimal impact, only 15% improvement.

I open this as an epic as it would be useful to have detailed studies, eg of register pressure, memory access etc, to see how we can optimize the cuda

time ./check.exe -p 64 256 1
***********************************************************************
NumBlocksPerGrid           = 64
NumThreadsPerBlock         = 256
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 1.286574e+01                 )  sec
TotalTime[Rambo+ME]    (23)= ( 1.286227e+01                 )  sec
TotalTime[RndNumGen]    (1)= ( 3.472183e-03                 )  sec
TotalTime[Rambo]        (2)= ( 1.216353e-02                 )  sec
TotalTime[MatrixElems]  (3)= ( 1.285011e+01                 )  sec
MeanTimeInMatrixElems      = ( 1.285011e+01                 )  sec
[Min,Max]TimeInMatrixElems = [ 1.285011e+01 ,  1.285011e+01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 16384
EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.273459e+03                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 1.273803e+03                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 1.275009e+03                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 16384
MeanMatrixElemValue        = ( 1.698216e+00 +- 1.688421e+00 )  GeV^-4
[Min,Max]MatrixElemValue   = [ 1.576145e-09 ,  2.766358e+04 ]  GeV^-4
StdDevMatrixElemValue      = ( 2.161179e+02                 )  GeV^-4
MeanWeight                 = ( 2.812272e+01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 2.812272e+01 ,  2.812272e+01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
real    0m12.877s
user    0m12.865s
sys     0m0.010s

time ./gcheck.exe -p 64 256 1
***********************************************************************
NumBlocksPerGrid           = 64
NumThreadsPerBlock         = 256
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = THRUST::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Wavefunction GPU memory    = LOCAL
Random number generation   = CURAND DEVICE (CUDA code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 3.806485e-02                 )  sec
TotalTime[Rambo+ME]    (23)= ( 3.744829e-02                 )  sec
TotalTime[RndNumGen]    (1)= ( 6.165550e-04                 )  sec
TotalTime[Rambo]        (2)= ( 4.457960e-04                 )  sec
TotalTime[MatrixElems]  (3)= ( 3.700249e-02                 )  sec
MeanTimeInMatrixElems      = ( 3.700249e-02                 )  sec
[Min,Max]TimeInMatrixElems = [ 3.700249e-02 ,  3.700249e-02 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 16384
EvtsPerSec[Rnd+Rmb+ME](123)= ( 4.304234e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 4.375100e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 4.427810e+05                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 16384
MeanMatrixElemValue        = ( 6.793453e+00 +- 6.755076e+00 )  GeV^-4
[Min,Max]MatrixElemValue   = [ 1.109086e-08 ,  1.106771e+05 ]  GeV^-4
StdDevMatrixElemValue      = ( 8.646498e+02                 )  GeV^-4
MeanWeight                 = ( 2.812272e+01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 2.812272e+01 ,  2.812272e+01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
real    0m0.916s
user    0m0.096s
sys     0m0.765s

time ./gcheck.exe -p 2048 256 12
***********************************************************************
NumBlocksPerGrid           = 2048
NumThreadsPerBlock         = 256
NumIterations              = 12
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = THRUST::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Wavefunction GPU memory    = LOCAL
Random number generation   = CURAND DEVICE (CUDA code)
-----------------------------------------------------------------------
NumberOfEntries            = 12
TotalTime[Rnd+Rmb+ME] (123)= ( 1.255412e+01                 )  sec
TotalTime[Rambo+ME]    (23)= ( 1.254601e+01                 )  sec
TotalTime[RndNumGen]    (1)= ( 8.112079e-03                 )  sec
TotalTime[Rambo]        (2)= ( 1.292305e-01                 )  sec
TotalTime[MatrixElems]  (3)= ( 1.241678e+01                 )  sec
MeanTimeInMatrixElems      = ( 1.034732e+00                 )  sec
[Min,Max]TimeInMatrixElems = [ 1.033975e+00 ,  1.035068e+00 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 6291456
EvtsPerSec[Rnd+Rmb+ME](123)= ( 5.011467e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 5.014707e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 5.066899e+05                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 6291456
MeanMatrixElemValue        = ( 2.486199e+00 +- 1.519583e+00 )  GeV^-4
[Min,Max]MatrixElemValue   = [ 9.367790e-09 ,  8.569965e+06 ]  GeV^-4
StdDevMatrixElemValue      = ( 3.811536e+03                 )  GeV^-4
MeanWeight                 = ( 2.812272e+01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 2.812272e+01 ,  2.812272e+01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
real    0m14.683s
user    0m8.392s
sys     0m6.237s
valassi commented 3 years ago

Example of related register studies in eemumu: https://github.com/madgraph5/madgraph4gpu/issues/26

valassi commented 3 years ago

In PR #204 I have done:

This means that fortran seems a factor two faster than C++. In eemumu for comparison it is only around 15% faster. With the vectorization work, in eemumu even the scalar c++ was improved. To be followed up.

I copy the notes below from PR #205


PR #205 addresses yesterday's discussion with @jtchilders @roiser @oliviermattelaer

Note that the Fortran throughout is a factor 2 higher than C++, to be understood

Without fast math, fortran: https://github.com/madgraph5/madgraph4gpu/commit/b8779b0d691a840cb492ca52d81ab405f03a9632 TOTAL MATRIX1 : 2.0968s for 6511 calls => throughput is 3.11E+03 calls/s

With fast math, fortran: https://github.com/madgraph5/madgraph4gpu/commit/0e9b503e8e3b057090f1d94d468d3194fb095239 TOTAL MATRIX1 : 3.0558s for 12090 calls => throughput is 3.96E+03 calls/s

Without fast math, epoch2 c++: https://github.com/madgraph5/madgraph4gpu/commit/0e9b503e8e3b057090f1d94d468d3194fb095239 EvtsPerSec[MatrixElems] (3)= ( 1.345116e+03 ) sec^-1

With fast math, epoch2 c++: https://github.com/madgraph5/madgraph4gpu/commit/19292e92f177bf95e11d17f51f2ba83f77e47ffe EvtsPerSec[MatrixElems] (3)= ( 1.906572e+03 ) sec^-1

(NB ALL NUMBERS ABOVE ON PMPE04 - NOT TRIED ON ITSCRD70)

These numbers are VERY PRELIMINARY

Clearly some things to be understood in the c++. Note that in eemumu the vectorization work had improved throughput even without switching on SIMD. The above numbers are without SIMD.

If one imagines a factor almost 4 from SIMD in C++, this would still be almost a factor 2 faster than fortran... but probably we can do better.

valassi commented 3 years ago

PS Note that the gridpack is from MG 3.1.0, with some algorithmic optimizations by Olivier and Kiran that may be missing in the C++, derived from 2.9.x... maybe this also explains. We should check with epoch3.

valassi commented 3 years ago

Other issues related to this epic:

valassi commented 3 years ago

Updates: look at the nice results from @cvuosalo, for instance on June 28 https://indico.cern.ch/event/1053713/ This includes in particular maxregcount tests for issue #26