Trivial improvements for xbin_min and xbin_max may lead to speedups in sample_get_x

valassi commented 2 months ago

I am doing a few tests with sample_get_x towards vectorising it, see https://github.com/madgraph5/madgraph4gpu/issues/963

Apart from the issue reported in #968, I think I identified another two trivial but useful improvements in sample_get_x

One, some minor streamlining of xbin_min and xbin_max calculations seems to be useful
Two, I checked that in a case like CMS DY+3j #943, the function is most often called with xmin=0 or xmax=1, and it is possible to cache these values

This is WIP to be confirmed.

valassi commented 2 months ago

Two, I checked that in a case like CMS DY+3j, the function is most often called with xmin=0 or xmax=1, and it is possible to cache these values

This is https://github.com/madgraph5/madgraph4gpu/pull/970/commits/291bcf5be4b96cd420aa40d227121a48c377e497

valassi commented 2 months ago

One, some minor streamlining of xbin_min and xbin_max calculations seems to be useful

This might be this, but is seems too silly to have an effect, maybe it was elsewhere https://github.com/madgraph5/madgraph4gpu/pull/970/commits/23a1358aba77753a02079979df09ea664a75cd89

valassi commented 2 months ago

See the difference between the default https://github.com/madgraph5/madgraph4gpu/pull/946/commits/079207d4b17410b6ac29e91c9f8370eb7e83a5d5

CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp
 Found          997  events.
 Wrote           59  events.
 Actual xsec    5.9274488566377981
 [COUNTERS] PROGRAM TOTAL                         :    4.6537s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1603s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0673s
 [COUNTERS] Fortran Random2Momenta         (  3 ) :    3.4183s for  1170103 events => throughput is 2.92E-06 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.1002s for    49152 events => throughput is 2.04E-06 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1307s for    16384 events => throughput is 7.98E-06 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0505s for    16384 events => throughput is 3.08E-06 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0657s for    16384 events => throughput is 4.01E-06 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1321s for  1170103 events => throughput is 1.13E-07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4682s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0257s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0346s for    16384 events => throughput is 2.11E-06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 21 ) :    4.6191s
 [COUNTERS] OVERALL MEs                    ( 22 ) :    0.0346s for    16384 events => throughput is 2.11E-06 events/s

And then the change 1, removing a few xbin calls https://github.com/madgraph5/madgraph4gpu/pull/946/commits/b69c61cfe05cee73a8c78d0b5119d1fbd8c51837

CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL                         :    4.5494s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1688s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0669s
 [COUNTERS] Fortran Random2Momenta         (  3 ) :    3.2830s for  1170103 events => throughput is 2.81E-06 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.1061s for    49152 events => throughput is 2.16E-06 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1361s for    16384 events => throughput is 8.31E-06 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0519s for    16384 events => throughput is 3.17E-06 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0649s for    16384 events => throughput is 3.96E-06 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1366s for  1170103 events => throughput is 1.17E-07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4745s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0257s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0349s for    16384 events => throughput is 2.13E-06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 21 ) :    4.5145s
 [COUNTERS] OVERALL MEs                    ( 22 ) :    0.0349s for    16384 events => throughput is 2.13E-06 events/s

And then caching the xbin values https://github.com/madgraph5/madgraph4gpu/pull/946/commits/a6d57a841897f1bfd2ec5ead80dfa54ec4579ee7


CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL                         :    4.2184s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1695s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0672s
 [COUNTERS] Fortran Random2Momenta         (  3 ) :    2.9293s for  1170103 events => throughput is 2.50E-06 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.1094s for    49152 events => throughput is 2.23E-06 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1379s for    16384 events => throughput is 8.42E-06 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0560s for    16384 events => throughput is 3.42E-06 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0707s for    16384 events => throughput is 4.31E-06 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1447s for  1170103 events => throughput is 1.24E-07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4719s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0267s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0350s for    16384 events => throughput is 2.13E-06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 21 ) :    4.1834s
 [COUNTERS] OVERALL MEs                    ( 22 ) :    0.0350s for    16384 events => throughput is 2.13E-06 events/s

I think this could become a small standalone PR. To discuss with @oliviermattelaer

madgraph5 / madgraph4gpu

Trivial improvements for xbin_min and xbin_max may lead to speedups in sample_get_x #969