Open valassi opened 1 month ago
The functionality is in principle completed including the backport to CODEGEN. I will run some functionality and performance tests.
There are some test failures in the CI only for some processes, I will need to take a look.
In any case the speed tests are surprisingly interesting (I refer to HELINL=L which is HelAmps.o built as a separate object in C++ and cuda, and in cuda this needs RDC):
This means that one could imagine a Best mixed mode where HELINL=0 is used for C++ but HELINL=L is used for CUDA.
I will run some madevent tests tonight too to compare HELINL=0 and HELINL=L. And then I should time the build times, without ccache, separately for cuda and each C++.
Removing inlining by hand is an option, but small tests I have done in the past were really bad for performance.
Note, this is related to #348 about reducing build times. The comment above is from https://github.com/madgraph5/madgraph4gpu/issues/348#issue-1114070762 (well, from Jan 2022...)
The motivation for doing these RDC tests is that the move of FFV functions to template functions (whether with an explicit inline parameter or not) increased build times with cuda 11.1 enormously.
And note, this is related to #51 about assessing RDC. The comment above is from https://github.com/madgraph5/madgraph4gpu/issues/51#issuecomment-1015141703 (again, from Jan 2022)
In any case the speed tests are surprisingly interesting (I refer to HELINL=L which is HelAmps.o built as a separate object in C++ and cuda, and in cuda this needs RDC):
* the build time for ggttggg is a factor 2 faster for HELINL=L than for the default HELINL=0: I assume that the speedup comes from cuda and not C++ (this is the time for 'make bldall' that builds all backends, I should test them separately) * for C++, the new HELINL=L mode is actually a bit slower at runtime for ggttggg (and I assume that it is not much faster at build time) * for CUDA, the new HELINL=L mode, which uses RDC, is surprisingly 5-10% faster than the default?! and I assume that it is the cuda build that is a factor 2 faster...
This is from https://github.com/madgraph5/madgraph4gpu/pull/978/commits/bc897191933a894bd7d141dbfeb2378e42d41d26
diff -u --color tput/logs_ggttggg_mad/log_ggttggg_mad_d_inlL_hrd0.txt tput/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt
...
On itscrd90.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]:
=========================================================================
-runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inlL_hrd0/check_cuda.exe -p 1 256 2 OMP=
+runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inl0_hrd0/check_cuda.exe -p 1 256 2 OMP=
INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW
-Process = SIGMA_SM_GG_TTXGGG_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=L] [hardcodePARAM=0]
+Process = SIGMA_SM_GG_TTXGGG_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=0] [hardcodePARAM=0]
Workflow summary = CUD:DBL+THX:CURDEV+RMBDEV+MESDEV/none+NAVBRK
FP precision = DOUBLE (NaN/abnormal=0, zero=0)
-EvtsPerSec[Rmb+ME] (23) = ( 4.338149e+02 ) sec^-1
-EvtsPerSec[MatrixElems] (3) = ( 4.338604e+02 ) sec^-1
-EvtsPerSec[MECalcOnly] (3a) = ( 4.338867e+02 ) sec^-1
-MeanMatrixElemValue = ( 1.187066e-05 +- 9.825549e-06 ) GeV^-6
-TOTAL : 2.242693 sec
-INFO: No Floating Point Exceptions have been reported
- 7,348,976,543 cycles # 2.902 GHz
- 16,466,315,526 instructions # 2.24 insn per cycle
- 2.591057214 seconds time elapsed
-runNcu /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inlL_hrd0/check_cuda.exe -p 1 256 1
+EvtsPerSec[Rmb+ME] (23) = ( 4.063038e+02 ) sec^-1
+EvtsPerSec[MatrixElems] (3) = ( 4.063437e+02 ) sec^-1
+EvtsPerSec[MECalcOnly] (3a) = ( 4.063626e+02 ) sec^-1
+MeanMatrixElemValue = ( 1.187066e-05 +- 9.825549e-06 ) GeV^-6
+TOTAL : 2.552546 sec
+INFO: No Floating Point Exceptions have been reported
+ 7,969,059,552 cycles # 2.893 GHz
+ 17,401,037,642 instructions # 2.18 insn per cycle
+ 2.954791685 seconds time elapsed
+runNcu /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inl0_hrd0/check_cuda.exe -p 1 256 1
==PROF== Profiling "sigmaKin": launch__registers_per_thread 255
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
...
=========================================================================
-runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.512y_d_inlL_hrd0/check_cpp.exe -p 1 256 2 OMP=
+runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.512y_d_inl0_hrd0/check_cpp.exe -p 1 256 2 OMP=
INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW
-Process = SIGMA_SM_GG_TTXGGG_CPP [gcc 11.3.1] [inlineHel=L] [hardcodePARAM=0]
+Process = SIGMA_SM_GG_TTXGGG_CPP [gcc 11.3.1] [inlineHel=0] [hardcodePARAM=0]
Workflow summary = CPP:DBL+CXS:CURHST+RMBHST+MESHST/512y+CXVBRK
FP precision = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
-EvtsPerSec[Rmb+ME] (23) = ( 3.459662e+02 ) sec^-1
-EvtsPerSec[MatrixElems] (3) = ( 3.460086e+02 ) sec^-1
-EvtsPerSec[MECalcOnly] (3a) = ( 3.460086e+02 ) sec^-1
+EvtsPerSec[Rmb+ME] (23) = ( 3.835352e+02 ) sec^-1
+EvtsPerSec[MatrixElems] (3) = ( 3.836003e+02 ) sec^-1
+EvtsPerSec[MECalcOnly] (3a) = ( 3.836003e+02 ) sec^-1
MeanMatrixElemValue = ( 1.187066e-05 +- 9.825549e-06 ) GeV^-6
-TOTAL : 1.528240 sec
+TOTAL : 1.378567 sec
INFO: No Floating Point Exceptions have been reported
- 4,140,408,789 cycles # 2.703 GHz
- 9,072,597,595 instructions # 2.19 insn per cycle
- 1.532357792 seconds time elapsed
-=Symbols in CPPProcess_cpp.o= (~sse4: 0) (avx2:94048) (512y: 91) (512z: 0)
+ 3,738,350,469 cycles # 2.705 GHz
+ 8,514,195,736 instructions # 2.28 insn per cycle
+ 1.382567882 seconds time elapsed
+=Symbols in CPPProcess_cpp.o= (~sse4: 0) (avx2:80619) (512y: 89) (512z: 0)
-------------------------------------------------------------------------
There were some issues for ee_mumu, which I have now fixed. Let's see how the CI goes now.
Note, a more recent test on madevent (rather than standalone) showed that actually there is a runtime penalty of arounbd 10-15% in both C++ and CUDA (which is more in line with what I thought I had observed in the past). However this could be quite interesting if it does significantly reduce build times for very complex processes.
https://github.com/madgraph5/madgraph4gpu/pull/978/commits/125b7b49e42578c8c15f54f2e92ddf37cf666fcb
diff -u --color tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inlL_hrd0.txt tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt
-Executing ' ./build.512y_d_inlL_hrd0/madevent_cpp < /tmp/avalassi/input_ggttggg_x10_cudacpp > /tmp/avalassi/output_ggttggg_x10_cudacpp'
+Executing ' ./build.512y_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggttggg_x10_cudacpp > /tmp/avalassi/output_ggttggg_x10_cudacpp'
[OPENMPTH] omp_get_max_threads/nproc = 1/4
[NGOODHEL] ngoodhel/ncomb = 128/128
[XSECTION] VECSIZE_USED = 8192
@@ -401,10 +401,10 @@
[XSECTION] ChannelId = 1
[XSECTION] Cross section = 2.332e-07 [2.3322993086656014E-007] fbridge_mode=1
[UNWEIGHT] Wrote 303 events (found 1531 events)
- [COUNTERS] PROGRAM TOTAL : 320.6913s
- [COUNTERS] Fortran Overhead ( 0 ) : 4.5138s
- [COUNTERS] CudaCpp MEs ( 2 ) : 316.1312s for 90112 events => throughput is 2.85E+02 events/s
- [COUNTERS] CudaCpp HEL ( 3 ) : 0.0463s
+ [COUNTERS] PROGRAM TOTAL : 288.3304s
+ [COUNTERS] Fortran Overhead ( 0 ) : 4.4909s
+ [COUNTERS] CudaCpp MEs ( 2 ) : 283.7968s for 90112 events => throughput is 3.18E+02 events/s
+ [COUNTERS] CudaCpp HEL ( 3 ) : 0.0426s
-Executing ' ./build.cuda_d_inlL_hrd0/madevent_cuda < /tmp/avalassi/input_ggttggg_x10_cudacpp > /tmp/avalassi/output_ggttggg_x10_cudacpp'
+Executing ' ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggttggg_x10_cudacpp > /tmp/avalassi/output_ggttggg_x10_cudacpp'
[OPENMPTH] omp_get_max_threads/nproc = 1/4
[NGOODHEL] ngoodhel/ncomb = 128/128
[XSECTION] VECSIZE_USED = 8192
@@ -557,10 +557,10 @@
[XSECTION] ChannelId = 1
[XSECTION] Cross section = 2.332e-07 [2.3322993086656006E-007] fbridge_mode=1
[UNWEIGHT] Wrote 303 events (found 1531 events)
- [COUNTERS] PROGRAM TOTAL : 19.6663s
- [COUNTERS] Fortran Overhead ( 0 ) : 4.9649s
- [COUNTERS] CudaCpp MEs ( 2 ) : 13.4667s for 90112 events => throughput is 6.69E+03 events/s
- [COUNTERS] CudaCpp HEL ( 3 ) : 1.2347s
+ [COUNTERS] PROGRAM TOTAL : 18.0242s
+ [COUNTERS] Fortran Overhead ( 0 ) : 4.9891s
+ [COUNTERS] CudaCpp MEs ( 2 ) : 11.9530s for 90112 events => throughput is 7.54E+03 events/s
+ [COUNTERS] CudaCpp HEL ( 3 ) : 1.0821s
I add here now some comments that I had started last week. I have renamed this and put this in WIP. Many features are complete but I am passing to other things and I just want to document this so far before I move elsewhere.
(1) Description so far
Below is an update and a description before I move back to other things.
I added a new HELINL=L mode. This complements the default HELINL=0 mode and the experimental HELINL=1 mode.
HELINL=0 (default) aka "templates with moderate inlining". This has templated helas functions FFV. The templates are in the memory access classes, i.e. essentially the template specialization depends on the AOSOA format used for momenta, wavefunctions and couplings. The sigmakin and calculate_wavefunction functions in CPPProcess.cc use these templated FFV functions, which are then implemented (and possibly inlined). The build times can be long, because the same templates are reevaluated all over the place, but the runtime speed is good.
HELINL=1 aka "templates with aggressive inlining".
This is the mode that I had introduced to mimic -flto
i.e. link time optimizations. The FFV functions (and others) are inlined with always_inline
. This significantly increases the build times because in practice it does the equiavelent of link time optimizations (while compiling CPPProcess.o). The runtime speed can get a signifcant boost for simple processes, where data access is important, but the speedups tend to decrease for complex processes, where arithmetic operations dominate. In a realistic madevent environment, this is probably not interesting: for simple processes, it can be ineresting, but the ME calculation is outnumbered by non-ME fortran parts and so it is not interesting to have faster MEs; in complex processes, the build times become just too large.
HELINL=L aka "linked objects". This is the new mode I introduced here. The FFV functions are pre-compiled for the appropriate templates into .o object files. A technical detail: the HelAmps.cc file is common in Subprocess, but it must be compiled in each P* subdirectory, because the memory access classes may be different: for instance, a subprocess with 3 final state particles and one with 4 particles have different AOSOA, hence different memory access classes. My tests so far show that the build times can decrease/improve by a factor two, while the runtime can increase/degrade by around 10% for complex processes. (More detailed studies should show if it is the cuda or c++ build times that improve, or both). This is work that goes somewhat in the direction of splitting kernels and that I imagined in that context, but it is not exactly the same. It may become interesting for users especially for complex processes, and especially as long as the non-ME part is still important (eg DY+3j where cuda ME becomes 25% and sampling non-ME is over 50%, there having a ME that is 10% slower is acceptable).
(2) To do (non exhaustive list)
This is a non exhaustive list of things pending (unfortunately I was interrupted last week while writing this so I may be forgetting things)
I updated this with the latest master as I am doing on all PRs
- test this mode on HIP (what is the rdc equivalent?
I had some LUMI shell running and I tried this (after also merging in #1007 with various AMD things)
There is a -fgpu-rdc
which succeeds compilation but the issues come at link time.
-fgpu-rdc --hip-link
then it links, but it fails at runtime with #802 Note that #802 is actually a 'shared object initialization failed' error
So the status is
Now including upstream/master with v1.00.00 and also the AMD and v1.00.01 patches https://github.com/madgraph5/madgraph4gpu/pull/1014 and https://github.com/madgraph5/madgraph4gpu/pull/1012
WIP on removing template/inline from helas (related to splitting kernels)