(WIP) HELINL=L (L for linker) helas mode: pre-compile templates into separate .o object files (using RDC for CUDA; still missing HIP)

WIP on removing template/inline from helas (related to splitting kernels)

The functionality is in principle completed including the backport to CODEGEN. I will run some functionality and performance tests.

There are some test failures in the CI only for some processes, I will need to take a look.

In any case the speed tests are surprisingly interesting (I refer to HELINL=L which is HelAmps.o built as a separate object in C++ and cuda, and in cuda this needs RDC):

the build time for ggttggg is a factor 2 faster for HELINL=L than for the default HELINL=0: I assume that the speedup comes from cuda and not C++ (this is the time for 'make bldall' that builds all backends, I should test them separately)
for C++, the new HELINL=L mode is actually a bit slower at runtime for ggttggg (and I assume that it is not much faster at build time)
for CUDA, the new HELINL=L mode, which uses RDC, is surprisingly 5-10% faster than the default?! and I assume that it is the cuda build that is a factor 2 faster...

This means that one could imagine a Best mixed mode where HELINL=0 is used for C++ but HELINL=L is used for CUDA.

I will run some madevent tests tonight too to compare HELINL=0 and HELINL=L. And then I should time the build times, without ccache, separately for cuda and each C++.

Removing inlining by hand is an option, but small tests I have done in the past were really bad for performance.

Note, this is related to #348 about reducing build times. The comment above is from https://github.com/madgraph5/madgraph4gpu/issues/348#issue-1114070762 (well, from Jan 2022...)

The motivation for doing these RDC tests is that the move of FFV functions to template functions (whether with an explicit inline parameter or not) increased build times with cuda 11.1 enormously.

And note, this is related to #51 about assessing RDC. The comment above is from https://github.com/madgraph5/madgraph4gpu/issues/51#issuecomment-1015141703 (again, from Jan 2022)

In any case the speed tests are surprisingly interesting (I refer to HELINL=L which is HelAmps.o built as a separate object in C++ and cuda, and in cuda this needs RDC):

* the build time for ggttggg is a factor 2 faster for HELINL=L than for the default HELINL=0: I assume that the speedup comes from cuda and not C++ (this is the time for 'make bldall' that builds all backends, I should test them separately)

* for C++, the new HELINL=L mode is actually a bit slower at runtime for ggttggg (and I assume that it is not much faster at build time)

* for CUDA, the new HELINL=L mode, which uses RDC, is surprisingly 5-10% faster than the default?! and I assume that it is the cuda build that is a factor 2 faster...

This is from https://github.com/madgraph5/madgraph4gpu/pull/978/commits/bc897191933a894bd7d141dbfeb2378e42d41d26

diff -u --color tput/logs_ggttggg_mad/log_ggttggg_mad_d_inlL_hrd0.txt  tput/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt
...
 On itscrd90.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]:
 =========================================================================
-runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inlL_hrd0/check_cuda.exe -p 1 256 2 OMP=
+runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inl0_hrd0/check_cuda.exe -p 1 256 2 OMP=
 INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW
-Process                     = SIGMA_SM_GG_TTXGGG_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=L] [hardcodePARAM=0]
+Process                     = SIGMA_SM_GG_TTXGGG_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=0] [hardcodePARAM=0]
 Workflow summary            = CUD:DBL+THX:CURDEV+RMBDEV+MESDEV/none+NAVBRK
 FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
-EvtsPerSec[Rmb+ME]     (23) = ( 4.338149e+02                 )  sec^-1
-EvtsPerSec[MatrixElems] (3) = ( 4.338604e+02                 )  sec^-1
-EvtsPerSec[MECalcOnly] (3a) = ( 4.338867e+02                 )  sec^-1
-MeanMatrixElemValue         = ( 1.187066e-05 +- 9.825549e-06 )  GeV^-6
-TOTAL       :     2.242693 sec
-INFO: No Floating Point Exceptions have been reported
-     7,348,976,543      cycles                           #    2.902 GHz
-    16,466,315,526      instructions                     #    2.24  insn per cycle
-       2.591057214 seconds time elapsed
-runNcu /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inlL_hrd0/check_cuda.exe -p 1 256 1
+EvtsPerSec[Rmb+ME]     (23) = ( 4.063038e+02                 )  sec^-1
+EvtsPerSec[MatrixElems] (3) = ( 4.063437e+02                 )  sec^-1
+EvtsPerSec[MECalcOnly] (3a) = ( 4.063626e+02                 )  sec^-1
+MeanMatrixElemValue         = ( 1.187066e-05 +- 9.825549e-06 )  GeV^-6
+TOTAL       :     2.552546 sec
+INFO: No Floating Point Exceptions have been reported
+     7,969,059,552      cycles                           #    2.893 GHz
+    17,401,037,642      instructions                     #    2.18  insn per cycle
+       2.954791685 seconds time elapsed
+runNcu /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inl0_hrd0/check_cuda.exe -p 1 256 1
 ==PROF== Profiling "sigmaKin": launch__registers_per_thread 255
 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
...
 =========================================================================
-runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.512y_d_inlL_hrd0/check_cpp.exe -p 1 256 2 OMP=
+runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.512y_d_inl0_hrd0/check_cpp.exe -p 1 256 2 OMP=
 INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW
-Process                     = SIGMA_SM_GG_TTXGGG_CPP [gcc 11.3.1] [inlineHel=L] [hardcodePARAM=0]
+Process                     = SIGMA_SM_GG_TTXGGG_CPP [gcc 11.3.1] [inlineHel=0] [hardcodePARAM=0]
 Workflow summary            = CPP:DBL+CXS:CURHST+RMBHST+MESHST/512y+CXVBRK
 FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
 Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
-EvtsPerSec[Rmb+ME]     (23) = ( 3.459662e+02                 )  sec^-1
-EvtsPerSec[MatrixElems] (3) = ( 3.460086e+02                 )  sec^-1
-EvtsPerSec[MECalcOnly] (3a) = ( 3.460086e+02                 )  sec^-1
+EvtsPerSec[Rmb+ME]     (23) = ( 3.835352e+02                 )  sec^-1
+EvtsPerSec[MatrixElems] (3) = ( 3.836003e+02                 )  sec^-1
+EvtsPerSec[MECalcOnly] (3a) = ( 3.836003e+02                 )  sec^-1
 MeanMatrixElemValue         = ( 1.187066e-05 +- 9.825549e-06 )  GeV^-6
-TOTAL       :     1.528240 sec
+TOTAL       :     1.378567 sec
 INFO: No Floating Point Exceptions have been reported
-     4,140,408,789      cycles                           #    2.703 GHz
-     9,072,597,595      instructions                     #    2.19  insn per cycle
-       1.532357792 seconds time elapsed
-=Symbols in CPPProcess_cpp.o= (~sse4:    0) (avx2:94048) (512y:   91) (512z:    0)
+     3,738,350,469      cycles                           #    2.705 GHz
+     8,514,195,736      instructions                     #    2.28  insn per cycle
+       1.382567882 seconds time elapsed
+=Symbols in CPPProcess_cpp.o= (~sse4:    0) (avx2:80619) (512y:   89) (512z:    0)
 -------------------------------------------------------------------------

There were some issues for ee_mumu, which I have now fixed. Let's see how the CI goes now.

Note, a more recent test on madevent (rather than standalone) showed that actually there is a runtime penalty of arounbd 10-15% in both C++ and CUDA (which is more in line with what I thought I had observed in the past). However this could be quite interesting if it does significantly reduce build times for very complex processes.

https://github.com/madgraph5/madgraph4gpu/pull/978/commits/125b7b49e42578c8c15f54f2e92ddf37cf666fcb

diff -u --color tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inlL_hrd0.txt tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt

-Executing ' ./build.512y_d_inlL_hrd0/madevent_cpp < /tmp/avalassi/input_ggttggg_x10_cudacpp > /tmp/avalassi/output_ggttggg_x10_cudacpp'
+Executing ' ./build.512y_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggttggg_x10_cudacpp > /tmp/avalassi/output_ggttggg_x10_cudacpp'
  [OPENMPTH] omp_get_max_threads/nproc = 1/4
  [NGOODHEL] ngoodhel/ncomb = 128/128
  [XSECTION] VECSIZE_USED = 8192
@@ -401,10 +401,10 @@
  [XSECTION] ChannelId = 1
  [XSECTION] Cross section = 2.332e-07 [2.3322993086656014E-007] fbridge_mode=1
  [UNWEIGHT] Wrote 303 events (found 1531 events)
- [COUNTERS] PROGRAM TOTAL          :  320.6913s
- [COUNTERS] Fortran Overhead ( 0 ) :    4.5138s
- [COUNTERS] CudaCpp MEs      ( 2 ) :  316.1312s for    90112 events => throughput is 2.85E+02 events/s
- [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0463s
+ [COUNTERS] PROGRAM TOTAL          :  288.3304s
+ [COUNTERS] Fortran Overhead ( 0 ) :    4.4909s
+ [COUNTERS] CudaCpp MEs      ( 2 ) :  283.7968s for    90112 events => throughput is 3.18E+02 events/s
+ [COUNTERS] CudaCpp HEL      ( 3 ) :    0.0426s

-Executing ' ./build.cuda_d_inlL_hrd0/madevent_cuda < /tmp/avalassi/input_ggttggg_x10_cudacpp > /tmp/avalassi/output_ggttggg_x10_cudacpp'
+Executing ' ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggttggg_x10_cudacpp > /tmp/avalassi/output_ggttggg_x10_cudacpp'
  [OPENMPTH] omp_get_max_threads/nproc = 1/4
  [NGOODHEL] ngoodhel/ncomb = 128/128
  [XSECTION] VECSIZE_USED = 8192
@@ -557,10 +557,10 @@
  [XSECTION] ChannelId = 1
  [XSECTION] Cross section = 2.332e-07 [2.3322993086656006E-007] fbridge_mode=1
  [UNWEIGHT] Wrote 303 events (found 1531 events)
- [COUNTERS] PROGRAM TOTAL          :   19.6663s
- [COUNTERS] Fortran Overhead ( 0 ) :    4.9649s
- [COUNTERS] CudaCpp MEs      ( 2 ) :   13.4667s for    90112 events => throughput is 6.69E+03 events/s
- [COUNTERS] CudaCpp HEL      ( 3 ) :    1.2347s
+ [COUNTERS] PROGRAM TOTAL          :   18.0242s
+ [COUNTERS] Fortran Overhead ( 0 ) :    4.9891s
+ [COUNTERS] CudaCpp MEs      ( 2 ) :   11.9530s for    90112 events => throughput is 7.54E+03 events/s
+ [COUNTERS] CudaCpp HEL      ( 3 ) :    1.0821s

I add here now some comments that I had started last week. I have renamed this and put this in WIP. Many features are complete but I am passing to other things and I just want to document this so far before I move elsewhere.

(1) Description so far

Below is an update and a description before I move back to other things.

I added a new HELINL=L mode. This complements the default HELINL=0 mode and the experimental HELINL=1 mode.

HELINL=0 (default) aka "templates with moderate inlining". This has templated helas functions FFV. The templates are in the memory access classes, i.e. essentially the template specialization depends on the AOSOA format used for momenta, wavefunctions and couplings. The sigmakin and calculate_wavefunction functions in CPPProcess.cc use these templated FFV functions, which are then implemented (and possibly inlined). The build times can be long, because the same templates are reevaluated all over the place, but the runtime speed is good.

HELINL=1 aka "templates with aggressive inlining". This is the mode that I had introduced to mimic -flto i.e. link time optimizations. The FFV functions (and others) are inlined with always_inline. This significantly increases the build times because in practice it does the equiavelent of link time optimizations (while compiling CPPProcess.o). The runtime speed can get a signifcant boost for simple processes, where data access is important, but the speedups tend to decrease for complex processes, where arithmetic operations dominate. In a realistic madevent environment, this is probably not interesting: for simple processes, it can be ineresting, but the ME calculation is outnumbered by non-ME fortran parts and so it is not interesting to have faster MEs; in complex processes, the build times become just too large.

HELINL=L aka "linked objects". This is the new mode I introduced here. The FFV functions are pre-compiled for the appropriate templates into .o object files. A technical detail: the HelAmps.cc file is common in Subprocess, but it must be compiled in each P* subdirectory, because the memory access classes may be different: for instance, a subprocess with 3 final state particles and one with 4 particles have different AOSOA, hence different memory access classes. My tests so far show that the build times can decrease/improve by a factor two, while the runtime can increase/degrade by around 10% for complex processes. (More detailed studies should show if it is the cuda or c++ build times that improve, or both). This is work that goes somewhat in the direction of splitting kernels and that I imagined in that context, but it is not exactly the same. It may become interesting for users especially for complex processes, and especially as long as the non-ME part is still important (eg DY+3j where cuda ME becomes 25% and sampling non-ME is over 50%, there having a ME that is 10% slower is acceptable).

(2) To do (non exhaustive list)

This is a non exhaustive list of things pending (unfortunately I was interrupted last week while writing this so I may be forgetting things)

move the ixxx templated functions to linked mode too
perform a more systematic study of build times, BACKEND one by one (now I only now the 'bldall' speedups)
[edited] in particular, measure the build times of HelAmps.o and CppProcess.o separately
test this mode on HIP (what is the rdc equivalent?)
(consider some mixed HELINL Mode where for instance C++ uses the standard mode 0, but cuda uses mode L? not sure)
and then the whole splitting kernel ideas, separate color from Feynman, separate FFVs individually etc

I updated this with the latest master as I am doing on all PRs

test this mode on HIP (what is the rdc equivalent?

I had some LUMI shell running and I tried this (after also merging in #1007 with various AMD things)

There is a -fgpu-rdc which succeeds compilation but the issues come at link time.

Using gfortran to link (as I do now due to #802) I am unable to link '__hip_fatbin'
If I go back to hipcc for linking and add -fgpu-rdc --hip-link then it links, but it fails at runtime with #802
I also tried to compile .f files with hipcc (I guess flang? #804) and this succeeds, but then the 'main' cannot be found at link time, only the 'MAIN_'

Note that #802 is actually a 'shared object initialization failed' error

So the status is

HELINL=L works ok for C++ and (with rdc) for CUDA
HELINL=L does not work for HIP yet

Now including upstream/master with v1.00.00 and also the AMD and v1.00.01 patches https://github.com/madgraph5/madgraph4gpu/pull/1014 and https://github.com/madgraph5/madgraph4gpu/pull/1012

madgraph5 / madgraph4gpu

(WIP) HELINL=L (L for linker) helas mode: pre-compile templates into separate .o object files (using RDC for CUDA; still missing HIP) #978