Open valassi opened 3 years ago
In PR #230 I checked -flto also on Intel processors: again I get large throughput increases
Compare
Double:
No LTO
FP precision = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD)
EvtsPerSec[MECalcOnly] (3a) = ( 1.315891e+06 ) sec^-1
Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
EvtsPerSec[MECalcOnly] (3a) = ( 4.960773e+06 ) sec^-1
LTO
FP precision = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD)
EvtsPerSec[MECalcOnly] (3a) = ( 4.377819e+06 ) sec^-1
Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
EvtsPerSec[MECalcOnly] (3a) = ( 1.070241e+07 ) sec^-1
Float:
No LTO
FP precision = FLOAT (NaN/abnormal=6, zero=0)
Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD)
EvtsPerSec[MECalcOnly] (3a) = ( 1.207104e+06 ) sec^-1
Internal loops fptype_sv = VECTOR[8] ('512y': AVX512, 256bit) [cxtype_ref=YES]
EvtsPerSec[MECalcOnly] (3a) = ( 8.852403e+06 ) sec^-1
LTO
FP precision = FLOAT (NaN/abnormal=6, zero=0)
Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD)
EvtsPerSec[MECalcOnly] (3a) = ( 4.575539e+06 ) sec^-1
Internal loops fptype_sv = VECTOR[8] ('512y': AVX512, 256bit) [cxtype_ref=YES]
EvtsPerSec[MECalcOnly] (3a) = ( 2.353434e+07 ) sec^-1
In summary:
Note also that again my objdump disassembly fails to give useful results when -flto is used
I tried to build in -flto also with clang12, but it requires the Gold linker, not yet installed. I opened https://sft.its.cern.ch/jira/browse/SPI-1933
Using gcc10, I get similar speedups as in gcc9. In the gcc10, the lto-dump tool is present: try to have look...
I changed the title and added inlining. I created a WIP PR #231
It turns out that large speedups, similar to those of LTO, are possibly by inlining. This makes sense: for such a small code as ours, -flto was giving all its benefits just on check.cc and CPPProcess.cc, so it is enough to study those two. Actually, looks like all happens inside CPPPProcess.cc? This is very similar to RDC optimizations in CUDA (issue #51).
Not all benefits of LTO are yet recovered.
And for instance SSE is slower than scalar after inlining?... Looks very strange. Does inlining actually use some vectorization without asking?...
One first hint, inline does not completely inline.
[avalassi@itscrd70 gcc9.2/cvmfs] ~/GPU2020/madgraph4gpuBis/epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum> ls -l _*/build.*/check.exe.objdump
-rw-r--r--. 1 avalassi zg 2122023 Jul 9 16:24 _INLINE/build.512y_d/check.exe.objdump
-rw-r--r--. 1 avalassi zg 2044850 Jul 9 16:24 _INLINE/build.none_d/check.exe.objdump
-rw-r--r--. 1 avalassi zg 2076861 Jul 9 16:27 ____LTO/build.512y_d/check.exe.objdump
-rw-r--r--. 1 avalassi zg 2044750 Jul 9 16:27 ____LTO/build.none_d/check.exe.objdump
-rw-r--r--. 1 avalassi zg 2241115 Jul 9 16:25 _NO_LTO/build.512y_d/check.exe.objdump
-rw-r--r--. 1 avalassi zg 2214849 Jul 9 16:25 _NO_LTO/build.none_d/check.exe.objdump
[avalassi@itscrd70 gcc9.2/cvmfs] ~/GPU2020/madgraph4gpuBis/epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum> egrep '^+[[:xdigit:]]+ <M.*FFV1P0' _*/build.*/check.exe.objdump
_INLINE/build.512y_d/check.exe.objdump:0000000000413d30 <MG5_sm::FFV1P0_3(mgOnGpu::cxtype_v const*, mgOnGpu::cxtype_v const*, std::complex<double>, double, double, mgOnGpu::cxtype_v*)>:
_NO_LTO/build.512y_d/check.exe.objdump:0000000000415200 <MG5_sm::FFV1P0_3(mgOnGpu::cxtype_v const*, mgOnGpu::cxtype_v const*, std::complex<double>, double, double, mgOnGpu::cxtype_v*)>:
_NO_LTO/build.none_d/check.exe.objdump:0000000000414060 <MG5_sm::FFV1P0_3(std::complex<double> const*, std::complex<double> const*, std::complex<double>, double, double, std::complex<double>*)>:
Try with 'always inline'? https://stackoverflow.com/a/22767621 https://gcc.gnu.org/onlinedocs/gcc/Inline.html
And I now confirm that adding always_inline recovers all advantages of LTO (within 3-5%), see PR #233
I still keep this disabled for the moment.
There are advantages from inlining also in clang (a bit less than gcc, but still a factor 2 or much more)
The benefits of SIMD over scalar code are now obvious also after inlining (even if the speedup due to SIMD is lower after inlining than it was before inlining - some Amdahl at play here?)
Compare
Double:
Baseline (no inlinining, no LTO)
Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD)
EvtsPerSec[MECalcOnly] (3a) = ( 1.315659e+06 ) sec^-1
Internal loops fptype_sv = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
EvtsPerSec[MECalcOnly] (3a) = ( 2.542390e+06 ) sec^-1
Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
EvtsPerSec[MECalcOnly] (3a) = ( 4.926921e+06 ) sec^-1
Inlining (no LTO)
Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD)
EvtsPerSec[MECalcOnly] (3a) = ( 4.583564e+06 ) sec^-1
Internal loops fptype_sv = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
EvtsPerSec[MECalcOnly] (3a) = ( 6.069095e+06 ) sec^-1
Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
EvtsPerSec[MECalcOnly] (3a) = ( 1.110668e+07 ) sec^-1
Float
Baseline (no inlinining, no LTO)
Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD)
EvtsPerSec[MECalcOnly] (3a) = ( 1.209677e+06 ) sec^-1
Internal loops fptype_sv = VECTOR[4] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
EvtsPerSec[MECalcOnly] (3a) = ( 4.534024e+06 ) sec^-1
Internal loops fptype_sv = VECTOR[8] ('512y': AVX512, 256bit) [cxtype_ref=YES]
EvtsPerSec[MECalcOnly] (3a) = ( 8.871476e+06 ) sec^-1
Inlining (no LTO)
Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD)
EvtsPerSec[MECalcOnly] (3a) = ( 4.897949e+06 ) sec^-1
Internal loops fptype_sv = VECTOR[4] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
EvtsPerSec[MECalcOnly] (3a) = ( 1.233099e+07 ) sec^-1
Internal loops fptype_sv = VECTOR[8] ('512y': AVX512, 256bit) [cxtype_ref=YES]
EvtsPerSec[MECalcOnly] (3a) = ( 2.381717e+07 ) sec^-1
I think we should switch, eventually, in production. This gives a factor ~2.5 speedup for the best SIMD on gcc.
However
This completes the investigation for the moment... all of the advantages I had seen in LTO were essentially recovered in a different way.
(Ah ok, should test clang Gold anyway, just to see if it makes a difference)
PS By the way, note also that the objdump categorizations also make perfect sense now. As without LTO/inlining, after adding inlining the AVX512y is better than AVX2 because of a few symbols from AVX512VL
In PR #237 I have added the option to decide from outside in make whether to use inlining or not. This allows simultaneous tests of the two options (would have been useful on CORI for instance, with limited time available for tests)
Interestingly, a nice study of always_inline performance was also produced at CERN for the GeantV project in 2015, https://indico.cern.ch/event/386232/sessions/159923/
I found the link in this answer, according to which the gcc doc is incomplete (?) https://stackoverflow.com/a/48212527 Maybe this explains why in my case always_inline has a clear effect also on code that was already compiled with -O3
Note that gcc10.3 builds of the complex ggttgg do take quite some time in aggressive inlining mode. This was somewhat expected, but it is clearly noticeable... my build is stuck since almost one minute on
ccache /cvmfs/sft.cern.ch/lcg/releases/gcc/10.3.0-f5826/x86_64-centos7/bin/g++ -O3 -std=c++17 -I. -I../../src -I../../../../../tools -DUSE_NVTX -Wall -Wshadow -Wextra -fopenmp -ffast-math -march=skylake-avx512 -DMGONGPU_PVW512 -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_INLINE_HELAMPS -DMGONGPU_CURAND_ONDEVICE -I/usr/local/cuda-11.4/include/ -c CPPProcess.cc -o build.512z_d_inl1/CPPProcess.o
I am repeating here the same comment I made in #173 about AVX512 (and earlier in #71).
A brief update on this issue (copying this from the older issue #71 that I just closed).
All results discussed so far in this issue #173 were about the vectorization of the simple physics process e e to mu mu. In epochX (issue #244) I have now backported vectorization to the python code generating code, and I can now run vectorized c++ not only for the simple eemumu process, but also for the more complex (and more relevant to LHC!) ggttgg process ie g g to t t g g (4 particles in the final state instead of two, with QCD rather than QED - more Feynman diagrams and more complex diagrams, hence more CPU/GPU intensive and slower).
The very good news is that I observe similar speedups there, or even slightly better. With respect to basic c++ with no SIMD, I get a factor 4 (~4.2) in double and a factor 8 (~7.8) in real.
I also tested more precisely the effect of aggressive inlining (issue #229), mimicking LTO link time optimization. This seemed to give large performance boosts for the simpler eemumu (for reasons that I had not fully understood), but for the more complex/realistic ggttgg it seems irrelevant at most, if not counterproductive. This was an optional feature, and I will keep it disabled by default.
The details are below. See for instance the logs in https://github.com/madgraph5/madgraph4gpu/tree/golden_epochX4/epochX/cudacpp/tput/logs_ggttgg_auto
DOUBLE
NO SIMD, NO INLINING
Process = SIGMA_SM_GG_TTXGG_CPP [gcc 10.3.0] [inlineHel=0]
FP precision = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.809004e+03 ) sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.809004e+03 ) sec^-1
512y SIMD, NO INLINING
Process = SIGMA_SM_GG_TTXGG_CPP [gcc 10.3.0] [inlineHel=0]
FP precision = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 7.490002e+03 ) sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 7.490002e+03 ) sec^-1
For double, INLINING does not pay off, neither without nor with simd, it is worse than no inlining. What is interesting is that 512z is better than 512y in that case.
FLOAT
NO SIMD, NO INLINING
Process = SIGMA_SM_GG_TTXGG_CPP [gcc 10.3.0] [inlineHel=0]
FP precision = FLOAT (NaN/abnormal=0, zero=0)
Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.793838e+03 ) sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.793838e+03 ) sec^-1
512y SIMD, NO INLINING
Process = SIGMA_SM_GG_TTXGG_CPP [gcc 10.3.0] [inlineHel=0]
FP precision = FLOAT (NaN/abnormal=0, zero=0)
Internal loops fptype_sv = VECTOR[8] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.397076e+04 ) sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.397076e+04 ) sec^-1
=Symbols in CPPProcess.o= (~sse4: 0) (avx2: 7775) (512y: 29) (512z: 0)
512z SIMD, INLINING
Process = SIGMA_SM_GG_TTXGG_CPP [gcc 10.3.0] [inlineHel=1]
FP precision = FLOAT (NaN/abnormal=0, zero=0)
Internal loops fptype_sv = VECTOR[16] ('512z': AVX512, 512bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.391933e+04 ) sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.391933e+04 ) sec^-1
=Symbols in CPPProcess.o= (~sse4: 0) (avx2: 4075) (512y: 7) (512z:39296)
By the way, note en passant that I moved from gcc9.2 to gcc10.3 for all these results. But here I am still on a Xeon Silver.
Concerning the issue of AVX512 with 512bit width zmm registers ("512z") discussed in this thread #173, the results are essentially unchanged.
Now that I have ggttgg vectorized, at some point I will rerun the same tests on other machines, including Xeon Platinum or Skylake. I need to document how to run the epochX tests for ggttgg, but it is essentially the same as the epoch1 tests for eemumu.
Concerning this specific issue #229 of inlining and LTO, I would conclude for the moment that it was good to test it and to have it as an option, but it seems that for ggttgg and other complex/realistic processes we are probably better off without it. So I will keep this disabled for now. It also seems much more difficult to predict/understand. But we can still reassess the situation on other processes, with other compilers, and/or on other CPU hardware. So I keep this open for now.
But the one line conclusion for the moment is: KEEP INLINING DISABLED BY DEFAULT.
I have done a few more tests of aggressive inlining in #332 (RDC/cuda11.5/inline tests) and in #328 (templated FFV functions).
The motivation is that that the move to templated FFV functions effectively does move to "more aggressive" inlining of FFV functions. Even if no explicit "inline" keyword ias added, those templated FFVs are effectively considered more seriously for inlining (ansatz).
Certainly, the build time (with cuda 11.1) exploded with templated FFVs, so I considered using explicit "noinline" keywords to spedd up the builds.
I was aso worried that, by moving to templated FFVs, the code woul d"look like" aggressive inlining, which in ggttgg I had seen to be slower.
Anyway, in the end I moved to cuda 11.5, which - without aggressive inlining, ie with inl=0 - solves both the build time issue and the runtime performance when using templated FFVs.
After doing that, ie when one uses cuda 11.5, and one uses templated FFVs, I still investigated whether switching on aggressive inlining makes sense. I found results comparable to the previous ones, namely eemumu c++/512y is faster (even by a factor 2), but for ggtt and especially ggttgg there is a penalty, and generally speaking things look very strange/unpredicatble.
So the one line conclusion remains: KEEP INLINING DISABLED BY DEFAULT.
Eventually I might just remove this whole infrastructure.
Keep the issue open for the moment.
This is a spinoff of the Power9 issue #223.
I realised that adding -flto there (gcc link time optimizations) gains almost a factor 4 for scalar C++ code (and more than a factor 2 for simd code). I am using gcc8 there. Compare:
Amongst the things to be understood: