madgraph5 / madgraph4gpu

GPU development for the Madgraph5_aMC@NLO event generator software package
30 stars 32 forks source link

kernel launchers and SIMD vectorization #71

Closed valassi closed 3 years ago

valassi commented 3 years ago

I finally found some time to pursue some earlier tests on an idea I had from the beginning, namely, trying to implement SIMD vectorization in the C++ code as the same time as SIMT/SPMD parallelisation on the GPU in cuda,

The idea is always the same: event-level parallelism, with execution in lockstep (all events gop through exactly the same sequence of computations).

I pushed a few initial tests in https://github.com/valassi/madgraph4gpu/tree/klas, I will create a WIP PR about that. @roiser , @oliviermattelaer , @hageboeck , I would especially be interested to have some feedback from you :-)

Implementing SIMD in the C++ is closely linked to the API of kernel launchers (and of the methods the kernels internally call) on the GPU. In my previous eemumu_AV implementation, the signature of some c++ methods was modified by adding nevt (a number of events), or instead ievt (an index over events) with respect to the cuda signature, but some lines of code (eg loops on ievt=1..nevt) were commented out as they were just reminders of possible future changes.

The main idea behind all the changes I did is simple: bring the event loop more and more towards the inside. Eventually, the event loop must be the innermost loop. This is because you eventually want to perform every single floating point addition or multiplication in parallel over several events. In practice, one concrete consequence of this is that I had to invert the order of the helicity loop: so far, there was an outer event loop, with an inner loop over helicities within each event, while now there is an outer helicity loop, with an inner loop over events for each helicity.

One limitation of the present code (possible in a simple eemumu calculation) is that there is no loop over nprocesses, because nprocesses=1. This was already assumed, but now I made it much more explicit, removing all dead code and adding FIXME warnings.

So far, I got to this point

A lot is still missing

These changes may result in significant changes in the current interfaces, but I think they would normally lead to a better interface and structure also on the GPU. I'll continue in the next few days...

valassi commented 3 years ago

Repeating a few points I noted in https://github.com/madgraph5/madgraph4gpu/issues/82#issuecomment-738014896 (where multithreading is discussed, eg using openmp), the general parallelization strategy would be:

valassi commented 3 years ago

While playing with pragma omp parallel for, I also saw there is a pragma omp simd, https://bisqwit.iki.fi/story/howto/openmp/#SimdConstructOpenmp%204%200. Maybe that can be a simpler alternative to vector sompiler extensions (but I still think I need to pass vectors somehow in and out). To be kept in mind.

lfield commented 3 years ago

There is an interesting chapter in the Data Parallel C++ book on 'Programing for CPUs'. There is a specific subsection 'SIMD Vectorization on CPU'. You may be interested to take a look.

lfield commented 3 years ago

There is an interesting chapter in the Data Parallel C++ book on 'Programing for CPUs'. There is a specific subsection 'SIMD Vectorization on CPU'. You may be interested to take a look.

valassi commented 3 years ago

There is an interesting chapter in the Data Parallel C++ book on 'Programing for CPUs'. There is a specific subsection 'SIMD Vectorization on CPU'. You may be interested to take a look.

Thanks Laurence. I assume you mean https://link.springer.com/chapter/10.1007/978-1-4842-5574-2_11. It is interesting.

I prefer another reference by Sebastien however, it contains many more practical details, http://sponce.web.cern.ch/sponce/CSC/slides/PracticalVectorization.booklet.pdf. I also got from him a few useful headers and papers on intrinsics (but I hope I do not need to go that way).

valassi commented 3 years ago

And... it is finally paying off :-)

I am now at around a factor 4 speedup gained through SIMD vectorization in c++, from 0.5E5 to 2E6 for MEs. (Maybe more? I am not 100% sure where I was starting from).

Now: https://github.com/valassi/madgraph4gpu/commit/5f9ff6dc2c31d21456bae6d6b592f7efc7f490bf

TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.384488e+06                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 1.493997e+06                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 2.082485e+06                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0

One week ago: https://github.com/valassi/madgraph4gpu/commit/3178e95dffa663a5d080f139d8d5f878954f0096

TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 3.472268e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 3.553968e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 3.843094e+05                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0

Using also some good advice from @sponce I debugged the issues. Then I have slowly added more and more stuff to vectors. I managed to stay with autovectorization on compiler vector extensions, no Vc/VecCore or others.

Many more things to clean up and analyse, but this definitely very promising!

It was faster than I thought to achieve this, bits and pieces of 10 days. This would not have been possible without the work on AOSOA this summer however.

oliviermattelaer commented 3 years ago

Excellent, this is a really good news and impressive progress. We should have a small chat about this to see how to move forward on this.

sponce commented 3 years ago

This is excellent news ! Really impressive I must say. As far as I know, you have been running with AVX2 and with doubles, is that correct ? So a factor 4 would mean perfect speed-up in that context, which is amazing. Did you also change something else, like performing some cleanup or improving memory structures ?

valassi commented 3 years ago

Thanks a lot Olivier and Sebastien! I was not sure how to answer your question, so I have done a bit more analysis and prototyping with compiler flags, and some cleanup in the code.

I have decided to cleanly hardcode and support only three scenarios: scalar, AVX2 and AVX512 (maybe I will add SSE, but whats the point today). Sebastien, I was a bit inspired by Arthur's work, which we had discussed in the past, https://gitlab.cern.ch/lhcb/LHCb/-/blob/master/Kernel/LHCbMath/LHCbMath/SIMDWrapper.h. What I took is the use of ifdefs, with AVX512F and AVX2. So now I have

Then I tried several combinations of Makefiles, all with -O3, and measured the matrix element throughput:

So all in all I would say:

All these numbers must be taken with a pinch of salt as there may be some fluctuations due to the load on the VM (in principle I was told this should not happen, but it's not completely ruled out). Anyway I repeated these tests a few times in the same time frame, there should be no fluctation.

I think that being close to perfect speedup is excellent news, but I am not completely surprised, as I am only timing the number crunching which is perfectly parallelizable: all events go through excatly the same calculation, so the calculation is fully in lockstep. I might even recover a tiny bit of what is missing between 3.82 and 4, when the last bits and pieces are also vectorized (the amp[2]). Note that on the GPU we have no evidence of thread divergence, and this is exactly the same thing.

If you have any comments about AVX512, please let me know! But all I have heard from @sponce and @hageboeck sounds like it is better to stay at AVX2 and not bother further. I might try a KNL for fun at some point (see https://colfaxresearch.com/knl-avx512), but it is probably pointless.

Ah, another question I have was whether alignas can make any difference here. I have the impression that the double __attribute__ ((vector_size (32))) is already an aligned RRRR. I have added an align as to the complex vector just in case, but now that I think of it it is irrelevant by definition (the operations are other on RRRR or on IIII).

oliviermattelaer commented 3 years ago

My (small) experience with AVX512 is to let the compiler decide when to use it. It seems that (at least intel) compiler knows pretty well when it provides a speed boost (and therefore it avoid it quite often).

I indeed hear a lot of negative feedback about it.

Cheers,

Olivier

On 8 Dec 2020, at 13:40, Andrea Valassi notifications@github.com<mailto:notifications@github.com> wrote:

Thanks a lot Olivier and Sebastien! I was not sure how to answer your question, so I have done a bit more analysis and prototyping with compiler flags, and some cleanup in the code.

I have decided to cleanly hardcode and support only three scenarios: scalar, AVX2 and AVX512 (maybe I will add SSE, but whats the point today). Sebastien, I was a bit inspired by Arthur's work, which we had discussed in the past, https://gitlab.cern.ch/lhcb/LHCb/-/blob/master/Kernel/LHCbMath/LHCbMath/SIMDWrapper.h. What I took is the use of ifdefs, with AVX512F and AVX2. So now I have

Then I tried several combinations of Makefiles, all with -O3, and measured the matrix element throughput:

So all in all I would say:

All these numbers must be taken with a pinch of salt as there may be some fluctuations due to the load on the VM (in principle I was told this should not happen, but it's not completely ruled out). Anyway I repeated these tests a few times in the same time frame, there should be no fluctation.

I think that being close to perfect speedup is excellent news, but I am not completely surprised, as I am only timing the number crunching which is perfectly parallelizable: all events go through excatly the same calculation, so the calculation is fully in lockstep. I might even recover a tiny bit of what is missing between 3.82 and 4, when the last bits and pieces are also vectorized (the amp[2]). Note that on the GPU we have no evidence of thread divergence, and this is exactly the same thing.

If you have any comments about AVX512, please let me know! But all I have heard from @sponcehttps://github.com/sponce and @hageboeckhttps://github.com/hageboeck sounds like it is better to stay at AVX2 and not bother further. I might try a KNL for fun at some point (see https://colfaxresearch.com/knl-avx512), but it is probably pointless.

Ah, another question I have was whether alignas can make any difference here. I have the impression that the double attribute ((vector_size (32))) is already an aligned RRRR. I have added an align as to the complex vector just in case, but now that I think of it it is irrelevant by definition (the operations are other on RRRR or on IIII).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/madgraph5/madgraph4gpu/issues/71#issuecomment-740596087, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AH6535W6MQYTDD546WWXYJLSTYNFLANCNFSM4UGYUJ3A.

sponce commented 3 years ago

Very nice work. And answers a lot of my questions. Here are a few thoughts/remarks :

hageboeck commented 3 years ago

If you have any comments about AVX512, please let me know! But all I have heard from @sponce and @hageboeck sounds like it is better to stay at AVX2 and not bother further. I might try a KNL for fun at some point (see https://colfaxresearch.com/knl-avx512), but it is probably pointless.

My comment is that mid-aged compilers are not good with it. In RooFit, only very few things profited from AVX512, and not so many "standard" CPUs support it. We settled for "we do one library for skylake 512, one for AVX2 (=AMD and most intel CPUs out there) and one for SSE4.1". If you don't fall into this, you get normal scalars.

Oh, and all we did was ever only autovectorisation. Recent compilers are great if you write simple code!

And lastly: Our main focus remains AVX2. It's supported everywhere, and AMD is totally killing on their modern CPUs. Edit: I don't think AMD will give us 512, so we might as well not invest too much time. Again, the Ryzen are killing it already with AVX2.

And lastly lastly: clang's diagnostics are amazing! Here is part of a script that I use to automatically recompile a file with switched-on diagnostics. You can e.g. pipe the compile command into it or you use cmake to extract the compile command from the build system.

cd $directory
if [[ "$compileCommand" =~ ^.*clang.*$ ]]; then
  clangFlags="-Xclang -fcolor-diagnostics -Rpass=loop-vectorize -Rpass-analysis=loop-vectorize -Rpass-missed=loop-vectorize -fno-math-errno"
  # Run with diagnostics or on compiler error run without redirecting output
  $compileCommand $clangFlags >/tmp/vecReport_all.txt 2>&1 || $compileCommand || exit 1

  # Not interested in std library vectorisation reports:
  grep -v "/usr/" /tmp/vecReport_all.txt > /tmp/vecReport.txt

  sed -nE '/remark.*(not vectorized|vector.*not benef)/{N;{p}}' /tmp/vecReport.txt | sed -n 'N; s/\n/LINEBREAK/p' | sort -u | sed -n 's/LINEBREAK/\n/p'
  grep --color "vectorized loop" /tmp/vecReport.txt
else
  gccFlags="-ftree-vectorizer-verbose=2 -fdiagnostics-color=always"
  $compileCommand $gccFlags -fopt-info-vec-missed 2>&1 | grep -vE "^/usr/|googletest" | sort -u || $compileCommand || exit 1
  $compileCommand $gccFlags -fopt-info-vec-optimized 2>&1 | grep --color -E "^/home.*vectorized"
fi
sponce commented 3 years ago

I realize I forgot a point in my comment (although last one somehow encompasses it) : did you compute how much you gain overall with this vectorization, I mean full processing time and not only the vectorized part. I ask because I've seen so many cases of perfect vectorization (factor 4 here) where the final software was slower overall, the reason being that you lose more time later dealing with vectorized data than you've gained initially. Of course that all depends how big the vectorized part is and how bad the use of vector data is later (maybe you actually even gain there).

valassi commented 3 years ago

Hi, thanks both :-)

@hageboeck, looks like I am more or less along your lines already in the klas branch

@sponce, good points:

Finally very good point on the real speedup and Ahmdahl's law: I know, here I am talking of the matrix element. But funnily enough, even for a simple LEP eemumu process, on C++/CPU this is the dominant part (on a GPU it is totally negligible). Quoting by heart, something like 12s for MEs, against 1s for the rest, so now reduced to 3s+1s. When we go to LHC processes, the ME will be MUCH larger, both on CPU and GPU, I think. So any speedup from porting this part is really great...

valassi commented 3 years ago

I moved the momenta array from a C-style AOSOA to an AOSOA where the final "A" is a vector type. I am not sure how, but I seem to have gained another factor 1.5 speedup?... From 2E6 to 3E6.

Now https://github.com/valassi/madgraph4gpu/commit/a2f8cf90f3e33aef576157e5a601a3570ca31fe5

./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Internal loops fptype_sv   = VECTOR[4] (AVX2)
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 2.965374e-01                 )  sec
TotalTime[Rambo+ME]    (23)= ( 2.689275e-01                 )  sec
TotalTime[RndNumGen]    (1)= ( 2.760994e-02                 )  sec
TotalTime[Rambo]        (2)= ( 9.534033e-02                 )  sec
TotalTime[MatrixElems]  (3)= ( 1.735872e-01                 )  sec
MeanTimeInMatrixElems      = ( 1.735872e-01                 )  sec
[Min,Max]TimeInMatrixElems = [ 1.735872e-01 ,  1.735872e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.768033e+06                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 1.949552e+06                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 3.020316e+06                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374915e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.197419e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************

Was https://github.com/valassi/madgraph4gpu/commit/2d276bfdc1bdbea4137dd5aea33090d49c57bbcd

./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Internal loops fptype_sv   = VECTOR[4] (AVX2)
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 3.720043e-01                 )  sec
TotalTime[Rambo+ME]    (23)= ( 3.441720e-01                 )  sec
TotalTime[RndNumGen]    (1)= ( 2.783231e-02                 )  sec
TotalTime[Rambo]        (2)= ( 9.215697e-02                 )  sec
TotalTime[MatrixElems]  (3)= ( 2.520150e-01                 )  sec
MeanTimeInMatrixElems      = ( 2.520150e-01                 )  sec
[Min,Max]TimeInMatrixElems = [ 2.520150e-01 ,  2.520150e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.409360e+06                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 1.523332e+06                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 2.080384e+06                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374915e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.197419e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
valassi commented 3 years ago

A large chunk of work on this issue will soon be merged from PR #152.

PR #152 replaces the two previous draft PRs #72 and #132, which I have closed. The reason these two are obsolete is that I completely rebased my SIND work (which is in epoch1) on epoch2-level code.

I have finally completed the "merge" of epoch2 and epoch1 of issue #139 in PR #151 (the "ep12" of "klas2ep12"). Presently epoch1 and epoch2 are identical. I will merge my SIMD work in epoch1 and keep epoch2 as-is prevectorization as a reference.

I am copying here a few comments I made in PR #152

The CURRENT BASELINE BEFORE VECTORIZATION is that at the end of PR #151:

-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.133317e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     8.050711 sec
real    0m8.079s
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 6.852279e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     1.233023 sec
real    0m1.552s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CPP
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.132827e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     8.059035 sec
real    0m8.086s
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 6.870531e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     1.177079 sec
real    0m1.485s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164
-------------------------------------------------------------------------

My CURRENT (WIP) BASELINE WITH VECTORIZATION is that in the FINAL MERGE OF 'origin/ep2to2ep1' into klas2ep12: https://github.com/madgraph5/madgraph4gpu/commit/870b8b342dd5b2fd1923550d8eaab6b48e88b4c2

-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP
Internal loops fptype_sv    = VECTOR[1] == SCALAR (no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.319998e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.247976 sec
real    0m7.274s
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 6.847687e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     1.181390 sec
real    0m1.488s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 120
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP
Internal loops fptype_sv    = VECTOR[4] (AVX512F)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 4.718034e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.749072 sec
real    0m3.775s
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 6.806047e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     1.199100 sec
real    0m1.505s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 120
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CPP
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.130720e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     8.045962 sec
real    0m8.072s
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 6.855929e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     1.207904 sec
real    0m1.528s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164
-------------------------------------------------------------------------

So, if I compare the vectorization branch to currenmt master, I see


A few additional comments (not in PR #152):

I will probably merge #152 tomorrow.

valassi commented 3 years ago

I think that this old issue #71 can now be closed.

In epochX (issue #244) I have now backported vectorization to the python code generating code, and I can now run vectorized c++ not only for the simple eemumu process, but also for the more complex (and more relevant to LHC!) ggttgg process. I observe similar speedups there, or even slightly better, for reasons to be understood. With respect to basic c++ with no simd, thrugh the appropriate use of SIMD eg AVX512 in 256 mode (see also #173) and LTO-like aggressive inlining (see #229) I get a factor 4 (~4.2) in double and a factor 8 (~7.8) in real.

See for instance the logs in https://github.com/madgraph5/madgraph4gpu/tree/golden_epochX4/epochX/cudacpp/tput/logs_ggttgg_auto

DOUBLE

NO SIMD, NO INLINING
Process                     = SIGMA_SM_GG_TTXGG_CPP [gcc 10.3.0] [inlineHel=0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.809004e+03                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.809004e+03                 )  sec^-1

512y SIMD, NO INLINING
Process                     = SIGMA_SM_GG_TTXGG_CPP [gcc 10.3.0] [inlineHel=0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 7.490002e+03                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 7.490002e+03                 )  sec^-1

For double, INLINING does not pay off, neither without nor with simd, it is worse than no inlining. What is interesting is that 512z is better than 512y in that case.

FLOAT

NO SIMD, NO INLINING
Process                     = SIGMA_SM_GG_TTXGG_CPP [gcc 10.3.0] [inlineHel=0]
FP precision                = FLOAT (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.793838e+03                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.793838e+03                 )  sec^-1

512y SIMD, NO INLINING
Process                     = SIGMA_SM_GG_TTXGG_CPP [gcc 10.3.0] [inlineHel=0]
FP precision                = FLOAT (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.397076e+04                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.397076e+04                 )  sec^-1
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 7775) (512y:   29) (512z:    0)

512z SIMD, INLINING
Process                     = SIGMA_SM_GG_TTXGG_CPP [gcc 10.3.0] [inlineHel=1]
FP precision                = FLOAT (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[16] ('512z': AVX512, 512bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.391933e+04                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.391933e+04                 )  sec^-1
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 4075) (512y:    7) (512z:39296)

That is to say, with float, INLINING gives eventually the same maximum speed as NO INLINING, but the former case is with AVX512/z, the latter with AVX512/y. Strange. In the simpler eemumu process, inlining did seem to provide a major booster of performance (which I could not explain). The summary is that we should use ggttgg for real studies - but also that we get VERY promising results there!

Anyway, I am closing this and will repost these numbers on the LTO study issue #229 and the AVX512 study issue #173.