Open valassi opened 3 years ago
Example of related register studies in eemumu: https://github.com/madgraph5/madgraph4gpu/issues/26
In PR #204 I have done:
This means that fortran seems a factor two faster than C++. In eemumu for comparison it is only around 15% faster. With the vectorization work, in eemumu even the scalar c++ was improved. To be followed up.
I copy the notes below from PR #205
PR #205 addresses yesterday's discussion with @jtchilders @roiser @oliviermattelaer
Note that the Fortran throughout is a factor 2 higher than C++, to be understood
Without fast math, fortran: https://github.com/madgraph5/madgraph4gpu/commit/b8779b0d691a840cb492ca52d81ab405f03a9632 TOTAL MATRIX1 : 2.0968s for 6511 calls => throughput is 3.11E+03 calls/s
With fast math, fortran: https://github.com/madgraph5/madgraph4gpu/commit/0e9b503e8e3b057090f1d94d468d3194fb095239 TOTAL MATRIX1 : 3.0558s for 12090 calls => throughput is 3.96E+03 calls/s
Without fast math, epoch2 c++: https://github.com/madgraph5/madgraph4gpu/commit/0e9b503e8e3b057090f1d94d468d3194fb095239 EvtsPerSec[MatrixElems] (3)= ( 1.345116e+03 ) sec^-1
With fast math, epoch2 c++: https://github.com/madgraph5/madgraph4gpu/commit/19292e92f177bf95e11d17f51f2ba83f77e47ffe EvtsPerSec[MatrixElems] (3)= ( 1.906572e+03 ) sec^-1
(NB ALL NUMBERS ABOVE ON PMPE04 - NOT TRIED ON ITSCRD70)
These numbers are VERY PRELIMINARY
Clearly some things to be understood in the c++. Note that in eemumu the vectorization work had improved throughput even without switching on SIMD. The above numbers are without SIMD.
If one imagines a factor almost 4 from SIMD in C++, this would still be almost a factor 2 faster than fortran... but probably we can do better.
PS Note that the gridpack is from MG 3.1.0, with some algorithmic optimizations by Olivier and Kiran that may be missing in the C++, derived from 2.9.x... maybe this also explains. We should check with epoch3.
Other issues related to this epic:
issue #183 (mentioned by @jtchilders in the June 21 meeting https://indico.cern.ch/event/1028452): need rambo to be massive, it is now massless, so it gives wrong physics as t is not massless (and gives different physics from kokkos)
issue #100 (mentioned by @oliviermattelaer in the June 21 meeting https://indico.cern.ch/event/1028452): need to add some selection cuts to ggttgg, otherwise all cross sections are divergent
Updates: look at the nice results from @cvuosalo, for instance on June 28 https://indico.cern.ch/event/1053713/ This includes in particular maxregcount tests for issue #26
I am dumping a few numbers from my gcc9/cuda11.0 centos7 itscrd70 for future reference, from a first look at ggttgg. This should eventually become our baseline replacing eemumu.
I have fixed a minimal issue in PR #145 and added some timings in the log of commit https://github.com/madgraph5/madgraph4gpu/commit/dd8711d0aa22c85429802b23de3169a68a97f298
As expected, this is very different from eemumu, in the sense that the ME part is clearly dominant, and the ME throughput is the same as the throughput of the total workflow. The numbers I get so far on my machine:
Clearly this is a different beast from eemumu on the GPU, for instance increasing blocks from 64 to 2048 has minimal impact, only 15% improvement.
I open this as an epic as it would be useful to have detailed studies, eg of register pressure, memory access etc, to see how we can optimize the cuda