madgraph5 / madgraph4gpu

GPU development for the Madgraph5_aMC@NLO event generator software package
30 stars 33 forks source link

Define improved performance metrics #22

Closed valassi closed 1 year ago

valassi commented 4 years ago

As our prooof of concept prototype advances (see issue #19), we should make sure we define better metrics for performance, to compare GPU implementations to one another and also to CPU implementations in C++ and Fortran.

This is a typical printout for our baseline now (see https://github.com/roiser/madgraph4gpu/commit/812f7438462c01e1052c4be48cafcdaef104195c):

time ./gcheck.exe -p 16384 32 12
***************************************
NumIterations             = 12
NumThreadsPerBlock        = 32
NumBlocksPerGrid          = 16384
---------------------------------------
FP precision              = DOUBLE (nan=0)
Complex type              = THRUST::COMPLEX
Momenta memory layout     = AOSOA[32]
Wavefunction GPU memory   = LOCAL
Curand generation         = DEVICE (CUDA code)
---------------------------------------
NumberOfEntries           = 12
TotalTimeInWaveFuncs      = 9.991298e-03 sec
MeanTimeInWaveFuncs       = 8.326082e-04 sec
StdDevTimeInWaveFuncs     = 1.905112e-05 sec
MinTimeInWaveFuncs        = 8.232730e-04 sec
MaxTimeInWaveFuncs        = 8.263180e-04 sec
---------------------------------------
NumMatrixElementsComputed = 6291456
MatrixElementsPerSec      = 6.296935e+08 sec^-1
***************************************
NumMatrixElements(notNan) = 6291456
MeanMatrixElemValue       = 1.394735e-02 GeV^0
StdErrMatrixElemValue     = 3.034488e-06 GeV^0
StdDevMatrixElemValue     = 7.611337e-03 GeV^0
MinMatrixElemValue        = 6.071582e-03 GeV^0
MaxMatrixElemValue        = 3.374925e-02 GeV^0
***************************************
00 CudaFree : 0.142011 sec
0a ProcInit : 0.002600 sec
0b MemAlloc : 0.075228 sec
0c GenCreat : 0.014557 sec
1a GenSeed  : 0.000014 sec
1b GenRnGen : 0.007914 sec
2a RamboIni : 0.000115 sec
2b RamboFin : 0.000058 sec
2c CpDTHwgt : 0.006667 sec
2d CpDTHmom : 0.071620 sec
3a SigmaKin : 0.000090 sec
3b CpDTHmes : 0.009901 sec
4a DumpLoop : 0.022485 sec
9a DumpAll  : 0.046493 sec
9b GenDestr : 0.000235 sec
9c MemFree  : 0.021625 sec
9d CudReset : 0.040737 sec
TOTAL       : 0.462352 sec
TOTAL(n-2)  : 0.279603 sec
***************************************
real    0m0.473s
user    0m0.189s
sys     0m0.280s

There are two metrics here essentially (see also the discussion in https://docs.google.com/document/d/1g2xwJ2FsSlxHvSUdPZjCyFW7zhsblMQ4g8UHlrkWyVw and in https://indico.cern.ch/event/945769/attachments/2086317/).

Note that our baseline is ~100 faster on GPU than CPP for total time, and ~1500 faster on GPU than CPP for ME throughput alone. The difference clearly indicates that we need to understand which metric is more relevant.

As discussed in issue #19, it would be useful to have a full cross section calculation, and also unweighted event generation. The latter however, especially for low-arithmetic-intenmsity processes like eemumu, will be dominated by event I/O. It would probably be useful to have at least a total cross section calculation performance metric, based on a large enough number of events to remove side effects.

In any case, the more complex the process, eg ggttgg, the less random number generation and rambo or other phase space sampling will matter, so the "ME trhoughput" metric should probably converge to be representative of the "total time" metric for complex processes.

valassi commented 4 years ago

Two minor improvements in https://github.com/madgraph5/madgraph4gpu/commit/c84756f8bff435a2fcedbedf3a13cd5629bdde7f

time ./gcheck.exe -p 16384 32 12
***************************************
NumIterations             = 12
NumThreadsPerBlock        = 32
NumBlocksPerGrid          = 16384
---------------------------------------
FP precision              = DOUBLE (nan=0)
Complex type              = THRUST::COMPLEX
RanNumb memory layout     = AOSOA[32]
Momenta memory layout     = AOSOA[32]
Wavefunction GPU memory   = LOCAL
Curand generation         = DEVICE (CUDA code)
---------------------------------------
NumberOfEntries           = 12
TotalTimeInWaveFuncs      = 9.979362e-03 sec
MeanTimeInWaveFuncs       = 8.316135e-04 sec
StdDevTimeInWaveFuncs     = 1.851690e-05 sec
MinTimeInWaveFuncs        = 8.200980e-04 sec
MaxTimeInWaveFuncs        = 8.920910e-04 sec
---------------------------------------
TotalEventsComputed       = 6291456
RamboEventsPerSec         = 8.141070e+07 sec^-1
MatrixElemEventsPerSec    = 6.304467e+08 sec^-1
***************************************
NumMatrixElements(notNan) = 6291456
MeanMatrixElemValue       = 1.371972e-02 GeV^0
StdErrMatrixElemValue     = 3.270361e-06 GeV^0
StdDevMatrixElemValue     = 8.202972e-03 GeV^0
MinMatrixElemValue        = 6.071582e-03 GeV^0
MaxMatrixElemValue        = 3.374925e-02 GeV^0
***************************************
00 CudaFree : 0.140339 sec
0a ProcInit : 0.000539 sec
0b MemAlloc : 0.076777 sec
0c GenCreat : 0.014873 sec
1a GenSeed  : 0.000013 sec
1b GenRnGen : 0.007986 sec
2a RamboIni : 0.000110 sec
2b RamboFin : 0.000056 sec
2c CpDTHwgt : 0.006727 sec
2d CpDTHmom : 0.070388 sec
3a SigmaKin : 0.000083 sec
3b CpDTHmes : 0.009896 sec
4a DumpLoop : 0.022503 sec
9a DumpAll  : 0.023664 sec
9b GenDestr : 0.000271 sec
9c MemFree  : 0.021042 sec
9d CudReset : 0.042018 sec
TOTAL       : 0.437285 sec
TOTAL(n-2)  : 0.254928 sec
***************************************

real    0m0.448s
user    0m0.158s
sys     0m0.288s
valassi commented 4 years ago

Just to mark a fixed point (also in view of the LHCC talk), here's the latest numbers from a test of the current code https://github.com/madgraph5/madgraph4gpu/commit/199412390cd7484115d868b684de2d31e46ad2eb

For GPU -p blocks threads iterations throughput total(n-2) total events
-p 16384 32 12 6.2E8 0.32s 6M
-p 16384 32 120 6.3E8 2.4s 62M
-p 16384 32 1200 6.3E8 23s 620M
-p 65536 128 1 6.5E8 1.3s 8M
-p 65536 128 12 6.5E8 4.0s 100M
-p 65536 128 60 6.3E8 17s 500M
-p 65536 128 80 6.3E8 22s 670M
-p 65536 128 120 6.5E8 116s* 1G

*Here the last result clearly has a problem, not necessarily in the GPU (swapping?). Out of 116s, there are 25s for dumploop and 62s for dumpall, so the effective time is really in the 40s for 1G events, in line with previous results.

For CPP -p blocks threads iterations throughput total(n-2) total events
-p 16384 32 1 3.6E5 1.7s 520k
-p 16384 32 12 3.7E5 19s 6M
-p 16384 32 120 3.7E5 197s* 62M

*Here 197s includes 172s for sigmakin and 20s for rambofin, so clearly CPP results are dominated by the two kernels.

One example for the same "-p 16384 32 120":

time ./gcheck.exe -p 16384 32 120
***************************************
NumIterations             = 120
NumThreadsPerBlock        = 32
NumBlocksPerGrid          = 16384
---------------------------------------
FP precision              = DOUBLE (nan=0)
Complex type              = THRUST::COMPLEX
RanNumb memory layout     = AOSOA[4]
Momenta memory layout     = AOSOA[4]
Wavefunction GPU memory   = LOCAL
Curand generation         = DEVICE (CUDA code)
---------------------------------------
NumberOfEntries           = 120
TotalTimeInWaveFuncs      = 1.118124e-01 sec
MeanTimeInWaveFuncs       = 9.317698e-04 sec
StdDevTimeInWaveFuncs     = 6.033148e-05 sec
MinTimeInWaveFuncs        = 9.169860e-04 sec
MaxTimeInWaveFuncs        = 1.588512e-03 sec
---------------------------------------
TotalEventsComputed       = 62914560
RamboEventsPerSec         = 6.094670e+07 sec^-1
MatrixElemEventsPerSec    = 5.626797e+08 sec^-1
***************************************
NumMatrixElements(notNan) = 62914560
MeanMatrixElemValue       = 1.371786e-02 GeV^0
StdErrMatrixElemValue     = 1.033868e-06 GeV^0
StdDevMatrixElemValue     = 8.200504e-03 GeV^0
MinMatrixElemValue        = 6.071582e-03 GeV^0
MaxMatrixElemValue        = 3.374925e-02 GeV^0
***************************************
00 CudaFree : 0.155957 sec
0a ProcInit : 0.000602 sec
0b MemAlloc : 0.450605 sec
0c GenCreat : 0.014945 sec
1a GenSeed  : 0.000099 sec
1b GenRnGen : 0.080919 sec
2a RamboIni : 0.000846 sec
2b RamboFin : 0.000549 sec
2c CpDTHwgt : 0.085258 sec
2d CpDTHmom : 0.945635 sec
3a SGoodHel : 0.001758 sec
3b SigmaKin : 0.000826 sec
3c CpDTHmes : 0.110987 sec
4a DumpLoop : 0.234899 sec
9a DumpAll  : 0.235393 sec
9b GenDestr : 0.000079 sec
9c MemFree  : 0.074370 sec
9d CudReset : 0.043488 sec
TOTAL       : 2.437213 sec
TOTAL(n-2)  : 2.237769 sec
***************************************

real    0m2.454s
user    0m1.422s
sys     0m1.029s

and

time ./check.exe -p 16384 32 120
***************************************
NumIterations             = 120
NumThreadsPerBlock        = 32
NumBlocksPerGrid          = 16384
---------------------------------------
FP precision              = DOUBLE (nan=0)
Complex type              = STD::COMPLEX
RanNumb memory layout     = AOSOA[4]
Momenta memory layout     = AOSOA[4]
Curand generation         = HOST (C++ code)
---------------------------------------
NumberOfEntries           = 120
TotalTimeInWaveFuncs      = 1.720326e+02 sec
MeanTimeInWaveFuncs       = 1.433605e+00 sec
StdDevTimeInWaveFuncs     = 3.431106e-03 sec
MinTimeInWaveFuncs        = 1.431553e+00 sec
MaxTimeInWaveFuncs        = 1.464243e+00 sec
---------------------------------------
TotalEventsComputed       = 62914560
RamboEventsPerSec         = 2.936736e+06 sec^-1
MatrixElemEventsPerSec    = 3.657131e+05 sec^-1
***************************************
NumMatrixElements(notNan) = 62914560
MeanMatrixElemValue       = 1.371786e-02 GeV^0
StdErrMatrixElemValue     = 1.033868e-06 GeV^0
StdDevMatrixElemValue     = 8.200504e-03 GeV^0
MinMatrixElemValue        = 6.071582e-03 GeV^0
MaxMatrixElemValue        = 3.374925e-02 GeV^0
***************************************
0a ProcInit : 0.000435 sec
0b MemAlloc : 0.460748 sec
0c GenCreat : 0.000999 sec
1a GenSeed  : 0.000212 sec
1b GenRnGen : 3.531995 sec
2a RamboIni : 1.158091 sec
2b RamboFin : 20.265198 sec
3b SigmaKin : 172.032593 sec
4a DumpLoop : 0.146312 sec
9a DumpAll  : 0.242435 sec
9b GenDestr : 0.000148 sec
9c MemFree  : 0.070157 sec
TOTAL       : 197.909332 sec
TOTAL(n-2)  : 197.838730 sec
***************************************

real    3m17.921s
user    3m17.461s
sys     0m0.457s

It probably makes sense to compare the total timings for steps 1, 2 and 3, or maybe even only 2 and 3*, since here in CPP we are using curand anyway and it makes sense to keep it out of the equation?

valassi commented 4 years ago

Ok I have now added two metrics TOTAL(23) and TOTAL(3). It is quite clear that throughput ratios are around ~200 and ~1600 fr the former and the latter (which is equivalent to Stefan's old throughput, more or less). Note that the bottleneck in TOTAL(23) is the copy of momenta, that both for an integration and for an unweighted event generation would NOT be done for every event, so this will still go MUCH faster on a GPU...

time ./gcheck.exe -p 16384 32 120
...
MatrixElemEventsPerSec    = 6.395433e+08 sec^-1
...
TOTAL       : 2.672413 sec
TOTAL(123)  : 1.008676 sec
TOTAL(23)   : 0.927415 sec
TOTAL(3)    : 0.100132 sec

and

time ./check.exe -p 16384 32 120
...
MatrixElemEventsPerSec    = 3.649238e+05 sec^-1
...
TOTAL       : 198.178268 sec
TOTAL(123)  : 197.277374 sec
TOTAL(23)   : 193.774826 sec
TOTAL(3)    : 172.404663 sec
valassi commented 3 years ago

I have made a few improvements in this area in the last couple of weeks. Most recently, PR #46 fixed several issues

./check.exe -p 16384 32 12 TotalTime[RndNumGen] (1)= ( 3.389900e-01 ) sec TotalTime[Rambo] (2)= ( 1.205101e+00 ) sec TotalTime[MatrixElems] (3)= ( 1.634346e+01 ) sec TotalEventsComputed = 6291456 EvtsPerSecRnd+Rmb+ME= ( 3.517226e+05 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 3.585169e+05 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 3.849525e+05 ) sec^-1 TOTAL (1) : 0.338990 sec TOTAL (2) : 1.205101 sec TOTAL (3) : 16.343462 sec


- The "RamboEventsPerSec" metric described in the previous posts from August 2020 is misleading if not useless. That metric is just the number of events divided by the time spent in rambo _alone_ (i.e. "step 2"). This is especially misleading in the C++, where rambo is very fast and ME is much slower. I have removed this metric. (It is even possible that I intended to define this as the "Rambo+ME"EventsPerSec, and I did an implementation bug?). 

- In my fix I now also made sure that the sums printed upfront, and used to compute throughputs, match excatly those printed out at the end by the timer map. Previously there was a bug which prevented this, as the computation of good helicities (SGoodHel), which is done only once in the first oteration and as such should be considered an initialisation step, was included in the ME sum in the timer dump. This was step 3a, it is now step 0d.

- Another point to note is that I added a dump of the Rambo weights. This I had added weeks ago, but in PR #46 I improved the dump of the relevent stats. By definition, in fact, Rambo is a democratic sampler that defines a uniform distribution over th ephase space, so each event has the same Rambo weight. The standard deviation however was not 0 initially. This is now fixed. The calculation of min, average and stddev used to be done in a single loop, now it is split in three: first I compute the min, then I compute the average from the average difference to min, then I compute stddev from the average of the squared difference to the average. This is a simple algorithminc improvement to avoid numerical precision issues. I had also considered the [Kahan summation algorithm](https://en.wikipedia.org/wiki/Kahan_summation_algorithm) but in the end this seemed a better choice.

The numbers above are taken from the following full printout, using the code from https://github.com/madgraph5/madgraph4gpu/commit/6360c713b6700b64a29c8876ce07e32ee18b394c

./gcheck.exe -p 16384 32 12


NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 12

FP precision = DOUBLE (nan=0) Complex type = THRUST::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Wavefunction GPU memory = LOCAL Random number generation = CURAND DEVICE (CUDA code)

NumberOfEntries = 12 TotalTime[Rnd+Rmb+ME] (123)= ( 8.869284e-02 ) sec TotalTime[Rambo+ME] (23)= ( 8.103600e-02 ) sec TotalTime[RndNumGen] (1)= ( 7.656840e-03 ) sec TotalTime[Rambo] (2)= ( 7.178315e-02 ) sec TotalTime[MatrixElems] (3)= ( 9.252852e-03 ) sec MeanTimeInMatrixElems = ( 7.710710e-04 ) sec [Min,Max]TimeInMatrixElems = [ 7.640640e-04 , 7.886470e-04 ] sec

TotalEventsComputed = 6291456 EvtsPerSecRnd+Rmb+ME= ( 7.093533e+07 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 7.763779e+07 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 6.799478e+08 ) sec^-1


NumMatrixElements(notNan) = 6291456 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374925e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.200854e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 )


00 CudaFree : 0.695882 sec 0a ProcInit : 0.000472 sec 0b MemAlloc : 0.035988 sec 0c GenCreat : 0.012723 sec 0d SGoodHel : 0.001747 sec 1a GenSeed : 0.000100 sec 1b GenRnGen : 0.007557 sec 2a RamboIni : 0.000228 sec 2b RamboFin : 0.000132 sec 2c CpDTHwgt : 0.006080 sec 2d CpDTHmom : 0.065344 sec 3a SigmaKin : 0.000199 sec 3b CpDTHmes : 0.009054 sec 4a DumpLoop : 0.041213 sec 8a CompStat : 0.045585 sec 9a GenDestr : 0.000073 sec 9d DumpScrn : 0.000243 sec 9e DumpJson : 0.000007 sec TOTAL : 0.922628 sec TOTAL (123) : 0.088693 sec TOTAL (23) : 0.081036 sec TOTAL (1) : 0.007657 sec TOTAL (2) : 0.071783 sec TOTAL (3) : 0.009253 sec


./check.exe -p 16384 32 12


NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 12

FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Random number generation = CURAND (C++ code)

NumberOfEntries = 12 TotalTime[Rnd+Rmb+ME] (123)= ( 1.788755e+01 ) sec TotalTime[Rambo+ME] (23)= ( 1.754856e+01 ) sec TotalTime[RndNumGen] (1)= ( 3.389900e-01 ) sec TotalTime[Rambo] (2)= ( 1.205101e+00 ) sec TotalTime[MatrixElems] (3)= ( 1.634346e+01 ) sec MeanTimeInMatrixElems = ( 1.361955e+00 ) sec [Min,Max]TimeInMatrixElems = [ 1.360818e+00 , 1.364657e+00 ] sec

TotalEventsComputed = 6291456 EvtsPerSecRnd+Rmb+ME= ( 3.517226e+05 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 3.585169e+05 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 3.849525e+05 ) sec^-1


NumMatrixElements(notNan) = 6291456 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374925e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.200854e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 )


0a ProcInit : 0.000416 sec 0b MemAlloc : 0.000041 sec 0c GenCreat : 0.000864 sec 1a GenSeed : 0.000092 sec 1b GenRnGen : 0.338898 sec 2a RamboIni : 0.092733 sec 2b RamboFin : 1.112368 sec 3a SigmaKin : 16.343462 sec 4a DumpLoop : 0.035383 sec 8a CompStat : 0.040054 sec 9a GenDestr : 0.000076 sec 9d DumpScrn : 0.000258 sec 9e DumpJson : 0.000020 sec TOTAL : 17.964664 sec TOTAL (123) : 17.887552 sec TOTAL (23) : 17.548563 sec TOTAL (1) : 0.338990 sec TOTAL (2) : 1.205101 sec TOTAL (3) : 16.343462 sec


valassi commented 3 years ago

The next steps in my plan to get better metrics are to get a more realistic workload

These two workloads should probably be two separate command line options - or they can be done sequentially (in any case the "phase space integration" is needed to precompute the maximum ME for the unweighting algorithm)

valassi commented 3 years ago

@roiser, about the discussion we are having offline about the plot for tomorrow's workshop, see these numbers:

> ./gcheck.exe -p 16384 32 12
> EvtsPerSec[Rnd+Rmb+ME](123)= ( 7.093533e+07                 )  sec^-1
> EvtsPerSec[Rmb+ME]     (23)= ( 7.763779e+07                 )  sec^-1
> EvtsPerSec[MatrixElems] (3)= ( 6.799478e+08                 )  sec^-1
> 
> ./check.exe -p 16384 32 12
> EvtsPerSec[Rnd+Rmb+ME](123)= ( 3.517226e+05                 )  sec^-1
> EvtsPerSec[Rmb+ME]     (23)= ( 3.585169e+05                 )  sec^-1
> EvtsPerSec[MatrixElems] (3)= ( 3.849525e+05                 )  sec^-1

For this single point on the plot, ie NumBlocksPerGrid=16384 and NumThreadsPerBlock=32, the ratios are essentially:

EvtsPerSec[Rnd+Rmb+ME](123)= ( 7.1e+07 / 3.5e+05 ) = 202
EvtsPerSec[Rmb+ME]     (23)= ( 7.8e+07 / 3.6e+05 ) = 217
EvtsPerSec[MatrixElems] (3)= ( 6.8e+08 / 3.8e+05 ) = 1766

So in summary, the GPU is 1800 faster than a single CPU core for ME alone (including the copy of the ME), but only 220 for ME+rambo or 200 for ME+rambo+random. Is this consistent with what you see?

valassi commented 3 years ago

I discussed with @roiser and we understood that there is one point where we were finding slightly different results: for Rnd+Rmb+ME workflows, he gets a GPU/CPU factor that is closer to 1000 than to 200.

What he does differently however is that he does not run ./check.exe with three parameters, but he runs it with a single one. (He actually means it as number of events while it is a number of iterations, but this is another story, see issue #56 ). What this does in practice is that it runs the software with "1 32" as the first two parameters.

My understanding is that this changes the behaviour significantly because a much higher number of iterations is executed. In the old time, with the madgraph own random generator, this did not make a difference, the computing time for generating random numbers was proportional to the number of events. Now however, when using curand, my understanding is that curand does some precalculation and precaching of random numbers for evey new seed, so the computing time is not proportional to the number of random numbers, unless a sufficiently high number of random numbers is generated in each seed. What must be kept in mind is that we seed the generator (hence forcing a new random seuqnece precalculation) at every new iteration.

The following examples in my opinion show this quite clearly:

[avalassi@itscrd03 bash] > ./check.exe -p 1 32 16 | egrep '\]\(123| GenSeed|GenRn|Computed'
TotalEventsComputed        = 512
EvtsPerSec[Rnd+Rmb+ME](123)= ( 6.772325e+04                 )  sec^-1
1a GenSeed  :     0.000103 sec
1b GenRnGen :     0.005606 sec

[avalassi@itscrd03 bash] > ./check.exe -p 4 32 16 | egrep '\]\(123| GenSeed|GenRn|Computed'
TotalEventsComputed        = 2048
EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.711470e+05                 )  sec^-1
1a GenSeed  :     0.000108 sec
1b GenRnGen :     0.005661 sec

[avalassi@itscrd03 bash] > ./check.exe -p 16 32 16 | egrep '\]\(123| GenSeed|GenRn|Computed'
TotalEventsComputed        = 8192
EvtsPerSec[Rnd+Rmb+ME](123)= ( 2.788849e+05                 )  sec^-1
1a GenSeed  :     0.000099 sec
1b GenRnGen :     0.005909 sec

[avalassi@itscrd03 bash] > ./check.exe -p 64 32 16 | egrep '\]\(123| GenSeed|GenRn|Computed'
TotalEventsComputed        = 32768
EvtsPerSec[Rnd+Rmb+ME](123)= ( 3.299350e+05                 )  sec^-1
1a GenSeed  :     0.000102 sec
1b GenRnGen :     0.007235 sec

[avalassi@itscrd03 bash] > ./check.exe -p 256 32 16 | egrep '\]\(123| GenSeed|GenRn|Computed'
TotalEventsComputed        = 131072
EvtsPerSec[Rnd+Rmb+ME](123)= ( 3.460718e+05                 )  sec^-1
1a GenSeed  :     0.000106 sec
1b GenRnGen :     0.012580 sec

[avalassi@itscrd03 bash] > ./check.exe -p 1024 32 16 | egrep '\]\(123| GenSeed|GenRn|Computed'
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 3.512415e+05                 )  sec^-1
1a GenSeed  :     0.000118 sec
1b GenRnGen :     0.033943 sec

[avalassi@itscrd03 bash] > ./check.exe -p 4096 32 16 | egrep '\]\(123| GenSeed|GenRn|Computed'
TotalEventsComputed        = 2097152
EvtsPerSec[Rnd+Rmb+ME](123)= ( 3.515863e+05                 )  sec^-1
1a GenSeed  :     0.000156 sec
1b GenRnGen :     0.117907 sec

The above tests show that the time spent in CPU c++ in the actual random number generation (GenRnGen) only starts being proportional to the number of computed events when there are at least 32k events per iteration. For less than 500 events, instead, the generation time is practically constant (~0.006 sec) irrespective of the number of events.

Accordingly, the EvtsPerSec[Rnd+Rmb+ME] throughput only reaches a stable value if at least 32k events per iteration are generated.

In general, I would recommend doing comparisons of c++ and gpu using exactly the same three command line arguments, which by design with ciurand will also provide exactly the same random numbers and excatly the same "physics" values.

roiser commented 3 years ago

@valassi I think its first important to understand why the curand behavior changes with those 2 additional parameters. Then we can fix the arguments in whatever way is useful.

valassi commented 3 years ago

Hi @roiser thanks. I think I understand it quite clearly, as mentioned above: the product of the first two parameters is the number of events per iteration. For curand, this number (times 16 I think, there are 16 random numbers per event) is the number of random numbers which are used downstream in the code, in each iteration, i.e. for each seeeded sequence. Remembere that each iteration sets a new seed, so starts a new random sequence.

The important point is that generating few random numbers per iteration (i.e. per sequence) has a large overhead: I do not know if it is because there is some precaching/pregenerating, or simply some initialization overhead, but clearly generating few random numbers is slow. This is clearly statedt in the curand manual: https://docs.nvidia.com/cuda/curand/host-api-overview.html#performance-notes2 "In general you will get the best performance from the cuRAND library by generating blocks of random numbers that are as large as possible. Fewer calls to generate many random numbers is more efficient than many calls generating only a few random numbers."

Does this clarify?

About the arguments, as mentioned in #56, I would keep the option of passing three arguments, with the guarantee that the same three arguments give the same physics on c++ and cuda. I would do this even if on c++ the first two arguments individually have no meaning, and only their product has a meaning.

valassi commented 3 years ago

About one of the issues I mentioned at the beginning of this thread, the time spent in device to host copies: I just merged #201. I include here some of the comments.

Essentially, I have added a cudaDeviceSynchronize between sigmakin and the copy of ME from device to host

The overall throughput does not change (*), while the relative split of sigmakin and copyDtoH changes enormously. Now it is clear that the copy takes around the same time as the sigmakin calculation for eemumu, while for ggttgg it is totally negligible with respect to it.

(*) NB: I am observing big fluctuations on the VM, and they actually come from the copy, while the calculation is rather stable. For eemumu this transaltes into large fluctuations (between 6.0 and 6.5E8, lower than the usual 7.3E8 that it should be). This is discussed in https://cern.service-now.com/service-portal?id=ticket&table=u_request_fulfillment&n=RQF1789593

Before adding cudaDeviceSynchronzie, https://github.com/madgraph5/madgraph4gpu/commit/ce6fd6574d0518a26f3660ae65c1cfeefa5d9426

On itscrd70.cern.ch (V100S-PCIE-32GB):
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 6.568678e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
3a SigmaKin :     0.000086 sec
3b CpDTHmes :     0.009492 sec
TOTAL       :     0.742937 sec
     2,594,268,110      cycles                    #    2.658 GHz
     3,534,129,010      instructions              #    1.36  insn per cycle
       1.045660455 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
-------------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
EvtsPerSec[MatrixElems] (3)= ( 4.397884e+05                 )  sec^-1
MeanMatrixElemValue        = ( 5.532387e+01 +- 5.501866e+01 )  GeV^-4
3a SigmaKin :     0.000009 sec
3b CpDTHmes :     0.037245 sec
TOTAL       :     0.606138 sec
     2,201,855,033      cycles                    #    2.656 GHz
     2,955,778,217      instructions              #    1.34  insn per cycle
       0.890392160 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 255
=========================================================================

After adding cudaDeviceSynchronize, https://github.com/madgraph5/madgraph4gpu/commit/ff306ff1225a10aa9c881d6282d6004aff3d12f2

On itscrd70.cern.ch (V100S-PCIE-32GB):
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 6.480230e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
3a SigmaKin :     0.004613 sec
3b CpDTHmes :     0.005095 sec
TOTAL       :     0.743895 sec
     2,601,684,149      cycles                    #    2.657 GHz
     3,532,142,253      instructions              #    1.36  insn per cycle
       1.049697294 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
-------------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
EvtsPerSec[MatrixElems] (3)= ( 4.427943e+05                 )  sec^-1
MeanMatrixElemValue        = ( 5.532387e+01 +- 5.501866e+01 )  GeV^-4
3a SigmaKin :     0.036976 sec
3b CpDTHmes :     0.000026 sec
TOTAL       :     0.651707 sec
     2,224,902,887      cycles                    #    2.657 GHz
     2,976,495,038      instructions              #    1.34  insn per cycle
       0.941813933 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 255
=========================================================================

By the way, I am using 2048 256 12 for eemumu, against 64 256 1 for ggttgg. This means I am copying 32x12 ie 384 times less data in ggttgg. In eemumu it takes 5095 usec, in ggttgg 26 usec to copy the data, so a factor 200 difference, this looks vaguely reasonable.

I will create a few plots with nsight systems also fthe vchep talk

valassi commented 3 years ago

Hi @roiser @oliviermattelaer @hageboeck I have added a new throughput metric. In CUDA, I compute events per second by only including the ME calculation time, not the copy time device to host of the ME. As discussed, for ggttgg and other complex processes this should be irrelevant, but for eemumu this gives almost a factor two (so the ratio GPU to CPU single core jumps from 600 to 1200). There is no difference for c++ of course (no device to host copies). I believe this is a more relevant metric as we go towards fortran integration and focus more on complex processes.

Also, it should make throughput measurements much more stable: on the CERN VM, we see that the large fluctuations we observe ar emainly in the copying time, not in the calculation time. See https://cern.service-now.com/service-portal?id=ticket&table=u_request_fulfillment&n=RQF1789593

See https://github.com/madgraph5/madgraph4gpu/commit/fcd226b81fc3d240605f9559a6cdfc6491a10213

On itscrd70.cern.ch (V100S-PCIE-32GB):
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 7.203627e+08                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.362887e+09                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.753617 sec
     2,534,760,915      cycles                    #    2.648 GHz
     3,444,528,039      instructions              #    1.36  insn per cycle
       1.051825861 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
=========================================================================
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 1.309049e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.154623 sec
    19,149,345,085      cycles                    #    2.674 GHz
    48,585,100,801      instructions              #    2.54  insn per cycle
       7.164814148 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:  614) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 4.911659e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.582398 sec
     9,078,069,473      cycles                    #    2.529 GHz
    16,497,598,830      instructions              #    1.82  insn per cycle
       3.592922555 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2572) (512y:   95) (512z:    0)
=========================================================================
valassi commented 3 years ago

PS The reason I do this now is that I will now work on the neppV/neppM decoupling and other AOSOA things, and I want to do a few tests where I prefer to hav emore stability. In any case the old metric is still available.

valassi commented 1 year ago

Many of the issues here are still pending, but on the other hand we are using more solid performance metrics than in 2020. This can be closed. We can then follow up the issues of current performance metrics in other tickets as needed.