Open valassi opened 2 months ago
Two, I checked that in a case like CMS DY+3j, the function is most often called with xmin=0 or xmax=1, and it is possible to cache these values
This is https://github.com/madgraph5/madgraph4gpu/pull/970/commits/291bcf5be4b96cd420aa40d227121a48c377e497
One, some minor streamlining of xbin_min and xbin_max calculations seems to be useful
This might be this, but is seems too silly to have an effect, maybe it was elsewhere https://github.com/madgraph5/madgraph4gpu/pull/970/commits/23a1358aba77753a02079979df09ea664a75cd89
See the difference between the default https://github.com/madgraph5/madgraph4gpu/pull/946/commits/079207d4b17410b6ac29e91c9f8370eb7e83a5d5
CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp
Found 997 events.
Wrote 59 events.
Actual xsec 5.9274488566377981
[COUNTERS] PROGRAM TOTAL : 4.6537s
[COUNTERS] Fortran Other ( 0 ) : 0.1603s
[COUNTERS] Fortran Initialise(I/O) ( 1 ) : 0.0673s
[COUNTERS] Fortran Random2Momenta ( 3 ) : 3.4183s for 1170103 events => throughput is 2.92E-06 events/s
[COUNTERS] Fortran PDFs ( 4 ) : 0.1002s for 49152 events => throughput is 2.04E-06 events/s
[COUNTERS] Fortran UpdateScaleCouplings ( 5 ) : 0.1307s for 16384 events => throughput is 7.98E-06 events/s
[COUNTERS] Fortran Reweight ( 6 ) : 0.0505s for 16384 events => throughput is 3.08E-06 events/s
[COUNTERS] Fortran Unweight(LHE-I/O) ( 7 ) : 0.0657s for 16384 events => throughput is 4.01E-06 events/s
[COUNTERS] Fortran SamplePutPoint ( 8 ) : 0.1321s for 1170103 events => throughput is 1.13E-07 events/s
[COUNTERS] CudaCpp Initialise ( 11 ) : 0.4682s
[COUNTERS] CudaCpp Finalise ( 12 ) : 0.0257s
[COUNTERS] CudaCpp MEs ( 19 ) : 0.0346s for 16384 events => throughput is 2.11E-06 events/s
[COUNTERS] OVERALL NON-MEs ( 21 ) : 4.6191s
[COUNTERS] OVERALL MEs ( 22 ) : 0.0346s for 16384 events => throughput is 2.11E-06 events/s
And then the change 1, removing a few xbin calls https://github.com/madgraph5/madgraph4gpu/pull/946/commits/b69c61cfe05cee73a8c78d0b5119d1fbd8c51837
CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp
[COUNTERS] PROGRAM TOTAL : 4.5494s
[COUNTERS] Fortran Other ( 0 ) : 0.1688s
[COUNTERS] Fortran Initialise(I/O) ( 1 ) : 0.0669s
[COUNTERS] Fortran Random2Momenta ( 3 ) : 3.2830s for 1170103 events => throughput is 2.81E-06 events/s
[COUNTERS] Fortran PDFs ( 4 ) : 0.1061s for 49152 events => throughput is 2.16E-06 events/s
[COUNTERS] Fortran UpdateScaleCouplings ( 5 ) : 0.1361s for 16384 events => throughput is 8.31E-06 events/s
[COUNTERS] Fortran Reweight ( 6 ) : 0.0519s for 16384 events => throughput is 3.17E-06 events/s
[COUNTERS] Fortran Unweight(LHE-I/O) ( 7 ) : 0.0649s for 16384 events => throughput is 3.96E-06 events/s
[COUNTERS] Fortran SamplePutPoint ( 8 ) : 0.1366s for 1170103 events => throughput is 1.17E-07 events/s
[COUNTERS] CudaCpp Initialise ( 11 ) : 0.4745s
[COUNTERS] CudaCpp Finalise ( 12 ) : 0.0257s
[COUNTERS] CudaCpp MEs ( 19 ) : 0.0349s for 16384 events => throughput is 2.13E-06 events/s
[COUNTERS] OVERALL NON-MEs ( 21 ) : 4.5145s
[COUNTERS] OVERALL MEs ( 22 ) : 0.0349s for 16384 events => throughput is 2.13E-06 events/s
And then caching the xbin values https://github.com/madgraph5/madgraph4gpu/pull/946/commits/a6d57a841897f1bfd2ec5ead80dfa54ec4579ee7
CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp
[COUNTERS] PROGRAM TOTAL : 4.2184s
[COUNTERS] Fortran Other ( 0 ) : 0.1695s
[COUNTERS] Fortran Initialise(I/O) ( 1 ) : 0.0672s
[COUNTERS] Fortran Random2Momenta ( 3 ) : 2.9293s for 1170103 events => throughput is 2.50E-06 events/s
[COUNTERS] Fortran PDFs ( 4 ) : 0.1094s for 49152 events => throughput is 2.23E-06 events/s
[COUNTERS] Fortran UpdateScaleCouplings ( 5 ) : 0.1379s for 16384 events => throughput is 8.42E-06 events/s
[COUNTERS] Fortran Reweight ( 6 ) : 0.0560s for 16384 events => throughput is 3.42E-06 events/s
[COUNTERS] Fortran Unweight(LHE-I/O) ( 7 ) : 0.0707s for 16384 events => throughput is 4.31E-06 events/s
[COUNTERS] Fortran SamplePutPoint ( 8 ) : 0.1447s for 1170103 events => throughput is 1.24E-07 events/s
[COUNTERS] CudaCpp Initialise ( 11 ) : 0.4719s
[COUNTERS] CudaCpp Finalise ( 12 ) : 0.0267s
[COUNTERS] CudaCpp MEs ( 19 ) : 0.0350s for 16384 events => throughput is 2.13E-06 events/s
[COUNTERS] OVERALL NON-MEs ( 21 ) : 4.1834s
[COUNTERS] OVERALL MEs ( 22 ) : 0.0350s for 16384 events => throughput is 2.13E-06 events/s
I think this could become a small standalone PR. To discuss with @oliviermattelaer
I am doing a few tests with sample_get_x towards vectorising it, see https://github.com/madgraph5/madgraph4gpu/issues/963
Apart from the issue reported in #968, I think I identified another two trivial but useful improvements in sample_get_x
This is WIP to be confirmed.