Investigate link time optimizations (-flto) and inlining

valassi commented 3 years ago

This is a spinoff of the Power9 issue #223.

I realised that adding -flto there (gcc link time optimizations) gains almost a factor 4 for scalar C++ code (and more than a factor 2 for simd code). I am using gcc8 there. Compare:

Amongst the things to be understood:

what does lto really involve in our case
does it also provide a benefit for x86 (or is it only on power9)
disadvantages? keep in mind the LHC experiments have their own linking ways, eg fpic etc
try out a newer gcc compiler (eg is the lto-dump tool only available with gcc10?)
how to interpret objdump when -flto is switcehed on

valassi commented 3 years ago

In PR #230 I checked -flto also on Intel processors: again I get large throughput increases

Compare

Double:

No LTO
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
  EvtsPerSec[MECalcOnly] (3a) = ( 1.315891e+06                 )  sec^-1
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
  EvtsPerSec[MECalcOnly] (3a) = ( 4.960773e+06                 )  sec^-1

LTO
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
  EvtsPerSec[MECalcOnly] (3a) = ( 4.377819e+06                 )  sec^-1
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
  EvtsPerSec[MECalcOnly] (3a) = ( 1.070241e+07                 )  sec^-1

Float:

No LTO 
FP precision                = FLOAT (NaN/abnormal=6, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
  EvtsPerSec[MECalcOnly] (3a) = ( 1.207104e+06                 )  sec^-1
Internal loops fptype_sv    = VECTOR[8] ('512y': AVX512, 256bit) [cxtype_ref=YES]
  EvtsPerSec[MECalcOnly] (3a) = ( 8.852403e+06                 )  sec^-1

LTO
FP precision                = FLOAT (NaN/abnormal=6, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
  EvtsPerSec[MECalcOnly] (3a) = ( 4.575539e+06                 )  sec^-1
Internal loops fptype_sv    = VECTOR[8] ('512y': AVX512, 256bit) [cxtype_ref=YES]
  EvtsPerSec[MECalcOnly] (3a) = ( 2.353434e+07                 )  sec^-1

In summary:

-flto gives a factor ~3-4 speedup for scalar code
-flto gives a factor ~2-2.5 spedup for AVX512y
Note that this decreases the speedup of AVX512y over scalar code (not surprisingly? the overhead of scalar code is more important?)

Note also that again my objdump disassembly fails to give useful results when -flto is used

valassi commented 3 years ago

I tried to build in -flto also with clang12, but it requires the Gold linker, not yet installed. I opened https://sft.its.cern.ch/jira/browse/SPI-1933

valassi commented 3 years ago

Using gcc10, I get similar speedups as in gcc9. In the gcc10, the lto-dump tool is present: try to have look...

valassi commented 3 years ago

I changed the title and added inlining. I created a WIP PR #231

It turns out that large speedups, similar to those of LTO, are possibly by inlining. This makes sense: for such a small code as ours, -flto was giving all its benefits just on check.cc and CPPProcess.cc, so it is enough to study those two. Actually, looks like all happens inside CPPPProcess.cc? This is very similar to RDC optimizations in CUDA (issue #51).

Not all benefits of LTO are yet recovered.

And for instance SSE is slower than scalar after inlining?... Looks very strange. Does inlining actually use some vectorization without asking?...

valassi commented 3 years ago

One first hint, inline does not completely inline.

[avalassi@itscrd70 gcc9.2/cvmfs] ~/GPU2020/madgraph4gpuBis/epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum> ls -l _*/build.*/check.exe.objdump
-rw-r--r--. 1 avalassi zg 2122023 Jul  9 16:24 _INLINE/build.512y_d/check.exe.objdump
-rw-r--r--. 1 avalassi zg 2044850 Jul  9 16:24 _INLINE/build.none_d/check.exe.objdump
-rw-r--r--. 1 avalassi zg 2076861 Jul  9 16:27 ____LTO/build.512y_d/check.exe.objdump
-rw-r--r--. 1 avalassi zg 2044750 Jul  9 16:27 ____LTO/build.none_d/check.exe.objdump
-rw-r--r--. 1 avalassi zg 2241115 Jul  9 16:25 _NO_LTO/build.512y_d/check.exe.objdump
-rw-r--r--. 1 avalassi zg 2214849 Jul  9 16:25 _NO_LTO/build.none_d/check.exe.objdump
[avalassi@itscrd70 gcc9.2/cvmfs] ~/GPU2020/madgraph4gpuBis/epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum> egrep '^+[[:xdigit:]]+ <M.*FFV1P0' _*/build.*/check.exe.objdump
_INLINE/build.512y_d/check.exe.objdump:0000000000413d30 <MG5_sm::FFV1P0_3(mgOnGpu::cxtype_v const*, mgOnGpu::cxtype_v const*, std::complex<double>, double, double, mgOnGpu::cxtype_v*)>:
_NO_LTO/build.512y_d/check.exe.objdump:0000000000415200 <MG5_sm::FFV1P0_3(mgOnGpu::cxtype_v const*, mgOnGpu::cxtype_v const*, std::complex<double>, double, double, mgOnGpu::cxtype_v*)>:
_NO_LTO/build.none_d/check.exe.objdump:0000000000414060 <MG5_sm::FFV1P0_3(std::complex<double> const*, std::complex<double> const*, std::complex<double>, double, double, std::complex<double>*)>:

Try with 'always inline'? https://stackoverflow.com/a/22767621 https://gcc.gnu.org/onlinedocs/gcc/Inline.html

valassi commented 3 years ago

And I now confirm that adding always_inline recovers all advantages of LTO (within 3-5%), see PR #233

I still keep this disabled for the moment.

There are advantages from inlining also in clang (a bit less than gcc, but still a factor 2 or much more)

The benefits of SIMD over scalar code are now obvious also after inlining (even if the speedup due to SIMD is lower after inlining than it was before inlining - some Amdahl at play here?)

valassi commented 3 years ago

Compare

inlining (emulate LTO) https://github.com/madgraph5/madgraph4gpu/commit/7a9000a9b9ccbf49f8ff8c7d702ae119907f2ecf
baseline (no inlining, no LTO) https://github.com/madgraph5/madgraph4gpu/commit/f49006c6ff14292d1ac826258dd7bfda11b30cf5

Double:

Baseline (no inlinining, no LTO)
  Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
    EvtsPerSec[MECalcOnly] (3a) = ( 1.315659e+06                 )  sec^-1
  Internal loops fptype_sv    = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
    EvtsPerSec[MECalcOnly] (3a) = ( 2.542390e+06                 )  sec^-1
  Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
    EvtsPerSec[MECalcOnly] (3a) = ( 4.926921e+06                 )  sec^-1

Inlining (no LTO)
  Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
    EvtsPerSec[MECalcOnly] (3a) = ( 4.583564e+06                 )  sec^-1
  Internal loops fptype_sv    = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
    EvtsPerSec[MECalcOnly] (3a) = ( 6.069095e+06                 )  sec^-1
  Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
    EvtsPerSec[MECalcOnly] (3a) = ( 1.110668e+07                 )  sec^-1

Float

Baseline (no inlinining, no LTO)
  Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
    EvtsPerSec[MECalcOnly] (3a) = ( 1.209677e+06                 )  sec^-1
  Internal loops fptype_sv    = VECTOR[4] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
    EvtsPerSec[MECalcOnly] (3a) = ( 4.534024e+06                 )  sec^-1
  Internal loops fptype_sv    = VECTOR[8] ('512y': AVX512, 256bit) [cxtype_ref=YES]
    EvtsPerSec[MECalcOnly] (3a) = ( 8.871476e+06                 )  sec^-1

Inlining (no LTO)
  Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
    EvtsPerSec[MECalcOnly] (3a) = ( 4.897949e+06                 )  sec^-1
  Internal loops fptype_sv    = VECTOR[4] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
    EvtsPerSec[MECalcOnly] (3a) = ( 1.233099e+07                 )  sec^-1
  Internal loops fptype_sv    = VECTOR[8] ('512y': AVX512, 256bit) [cxtype_ref=YES]
    EvtsPerSec[MECalcOnly] (3a) = ( 2.381717e+07                 )  sec^-1

valassi commented 3 years ago

I think we should switch, eventually, in production. This gives a factor ~2.5 speedup for the best SIMD on gcc.

However

better wait to check also ggttgg
better wait to check what happens when we integrate with Fortran MadEvent
in any case, better keep this configurable

This completes the investigation for the moment... all of the advantages I had seen in LTO were essentially recovered in a different way.

(Ah ok, should test clang Gold anyway, just to see if it makes a difference)

valassi commented 3 years ago

PS By the way, note also that the objdump categorizations also make perfect sense now. As without LTO/inlining, after adding inlining the AVX512y is better than AVX2 because of a few symbols from AVX512VL

valassi commented 3 years ago

In PR #237 I have added the option to decide from outside in make whether to use inlining or not. This allows simultaneous tests of the two options (would have been useful on CORI for instance, with limited time available for tests)

valassi commented 3 years ago

Interestingly, a nice study of always_inline performance was also produced at CERN for the GeantV project in 2015, https://indico.cern.ch/event/386232/sessions/159923/

I found the link in this answer, according to which the gcc doc is incomplete (?) https://stackoverflow.com/a/48212527 Maybe this explains why in my case always_inline has a clear effect also on code that was already compiled with -O3

valassi commented 3 years ago

Note that gcc10.3 builds of the complex ggttgg do take quite some time in aggressive inlining mode. This was somewhat expected, but it is clearly noticeable... my build is stuck since almost one minute on

ccache /cvmfs/sft.cern.ch/lcg/releases/gcc/10.3.0-f5826/x86_64-centos7/bin/g++  -O3  -std=c++17 -I. -I../../src -I../../../../../tools -DUSE_NVTX -Wall -Wshadow -Wextra -fopenmp -ffast-math  -march=skylake-avx512 -DMGONGPU_PVW512  -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_INLINE_HELAMPS -DMGONGPU_CURAND_ONDEVICE -I/usr/local/cuda-11.4/include/ -c CPPProcess.cc -o build.512z_d_inl1/CPPProcess.o

valassi commented 3 years ago

I am repeating here the same comment I made in #173 about AVX512 (and earlier in #71).

A brief update on this issue (copying this from the older issue #71 that I just closed).

All results discussed so far in this issue #173 were about the vectorization of the simple physics process e e to mu mu. In epochX (issue #244) I have now backported vectorization to the python code generating code, and I can now run vectorized c++ not only for the simple eemumu process, but also for the more complex (and more relevant to LHC!) ggttgg process ie g g to t t g g (4 particles in the final state instead of two, with QCD rather than QED - more Feynman diagrams and more complex diagrams, hence more CPU/GPU intensive and slower).

The very good news is that I observe similar speedups there, or even slightly better. With respect to basic c++ with no SIMD, I get a factor 4 (~4.2) in double and a factor 8 (~7.8) in real.

I also tested more precisely the effect of aggressive inlining (issue #229), mimicking LTO link time optimization. This seemed to give large performance boosts for the simpler eemumu (for reasons that I had not fully understood), but for the more complex/realistic ggttgg it seems irrelevant at most, if not counterproductive. This was an optional feature, and I will keep it disabled by default.

The details are below. See for instance the logs in https://github.com/madgraph5/madgraph4gpu/tree/golden_epochX4/epochX/cudacpp/tput/logs_ggttgg_auto

DOUBLE

NO SIMD, NO INLINING
Process                     = SIGMA_SM_GG_TTXGG_CPP [gcc 10.3.0] [inlineHel=0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.809004e+03                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.809004e+03                 )  sec^-1

512y SIMD, NO INLINING
Process                     = SIGMA_SM_GG_TTXGG_CPP [gcc 10.3.0] [inlineHel=0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 7.490002e+03                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 7.490002e+03                 )  sec^-1

For double, INLINING does not pay off, neither without nor with simd, it is worse than no inlining. What is interesting is that 512z is better than 512y in that case.

FLOAT

NO SIMD, NO INLINING
Process                     = SIGMA_SM_GG_TTXGG_CPP [gcc 10.3.0] [inlineHel=0]
FP precision                = FLOAT (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.793838e+03                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.793838e+03                 )  sec^-1

512y SIMD, NO INLINING
Process                     = SIGMA_SM_GG_TTXGG_CPP [gcc 10.3.0] [inlineHel=0]
FP precision                = FLOAT (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.397076e+04                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.397076e+04                 )  sec^-1
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 7775) (512y:   29) (512z:    0)

512z SIMD, INLINING
Process                     = SIGMA_SM_GG_TTXGG_CPP [gcc 10.3.0] [inlineHel=1]
FP precision                = FLOAT (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[16] ('512z': AVX512, 512bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.391933e+04                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.391933e+04                 )  sec^-1
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 4075) (512y:    7) (512z:39296)

By the way, note en passant that I moved from gcc9.2 to gcc10.3 for all these results. But here I am still on a Xeon Silver.

Concerning the issue of AVX512 with 512bit width zmm registers ("512z") discussed in this thread #173, the results are essentially unchanged.

The fastest option is using 512y (256bit width ymm with AVX512) rather than 512z, for both double and float. In both cases, the fastest option is not to use inlining.
For double, without inlining, 512y gives a ~4,2 speedup over no simd. Instead 512z is always worse.
For real, without inlining, 512y gives a ~7.7 speedup over no simd. The 512z is worse without inlining, while 512z with inlining gives the same total throughput as 512y without inlining (but I will still prefer 512y without inlining in this case).
For our eventual production usage, we should compare these numbers to the Fortran calculation (which seemed initially to be twice as fast as C++ with no simd??), also after reviewing all build options like -O3 and fast math. This is planned...

Now that I have ggttgg vectorized, at some point I will rerun the same tests on other machines, including Xeon Platinum or Skylake. I need to document how to run the epochX tests for ggttgg, but it is essentially the same as the epoch1 tests for eemumu.

Concerning this specific issue #229 of inlining and LTO, I would conclude for the moment that it was good to test it and to have it as an option, but it seems that for ggttgg and other complex/realistic processes we are probably better off without it. So I will keep this disabled for now. It also seems much more difficult to predict/understand. But we can still reassess the situation on other processes, with other compilers, and/or on other CPU hardware. So I keep this open for now.

But the one line conclusion for the moment is: KEEP INLINING DISABLED BY DEFAULT.

valassi commented 2 years ago

I have done a few more tests of aggressive inlining in #332 (RDC/cuda11.5/inline tests) and in #328 (templated FFV functions).

The motivation is that that the move to templated FFV functions effectively does move to "more aggressive" inlining of FFV functions. Even if no explicit "inline" keyword ias added, those templated FFVs are effectively considered more seriously for inlining (ansatz).

Certainly, the build time (with cuda 11.1) exploded with templated FFVs, so I considered using explicit "noinline" keywords to spedd up the builds.

I was aso worried that, by moving to templated FFVs, the code woul d"look like" aggressive inlining, which in ggttgg I had seen to be slower.

Anyway, in the end I moved to cuda 11.5, which - without aggressive inlining, ie with inl=0 - solves both the build time issue and the runtime performance when using templated FFVs.

After doing that, ie when one uses cuda 11.5, and one uses templated FFVs, I still investigated whether switching on aggressive inlining makes sense. I found results comparable to the previous ones, namely eemumu c++/512y is faster (even by a factor 2), but for ggtt and especially ggttgg there is a penalty, and generally speaking things look very strange/unpredicatble.

So the one line conclusion remains: KEEP INLINING DISABLED BY DEFAULT.

Eventually I might just remove this whole infrastructure.

Both XXX and FFV are now templated, so they are only in a .h anyway. There is no separate .cc. It seems wiser to clean up the code and remove those INLINE and ALWAYS_INLINE. They seem useless anyway.
Conversely, it may be interesting eventually to repeat "LTO-proper" tests with the LTO flag.

Keep the issue open for the moment.

madgraph5 / madgraph4gpu

Investigate link time optimizations (-flto) and inlining #229