kernel launchers and SIMD vectorization

madgraph5 / madgraph4gpu

GPU development for the Madgraph5_aMC@NLO event generator software package

30 stars 32 forks source link

kernel launchers and SIMD vectorization #71

Closed valassi closed 3 years ago

valassi commented 4 years ago

I finally found some time to pursue some earlier tests on an idea I had from the beginning, namely, trying to implement SIMD vectorization in the C++ code as the same time as SIMT/SPMD parallelisation on the GPU in cuda,

The idea is always the same: event-level parallelism, with execution in lockstep (all events gop through exactly the same sequence of computations).

I pushed a few initial tests in https://github.com/valassi/madgraph4gpu/tree/klas, I will create a WIP PR about that. @roiser , @oliviermattelaer , @hageboeck , I would especially be interested to have some feedback from you :-)

Implementing SIMD in the C++ is closely linked to the API of kernel launchers (and of the methods the kernels internally call) on the GPU. In my previous eemumu_AV implementation, the signature of some c++ methods was modified by adding nevt (a number of events), or instead ievt (an index over events) with respect to the cuda signature, but some lines of code (eg loops on ievt=1..nevt) were commented out as they were just reminders of possible future changes.

The main idea behind all the changes I did is simple: bring the event loop more and more towards the inside. Eventually, the event loop must be the innermost loop. This is because you eventually want to perform every single floating point addition or multiplication in parallel over several events. In practice, one concrete consequence of this is that I had to invert the order of the helicity loop: so far, there was an outer event loop, with an inner loop over helicities within each event, while now there is an outer helicity loop, with an inner loop over events for each helicity.

One limitation of the present code (possible in a simple eemumu calculation) is that there is no loop over nprocesses, because nprocesses=1. This was already assumed, but now I made it much more explicit, removing all dead code and adding FIXME warnings.

So far, I got to this point

I introduced fptype_v and cxtype_v vector types, with width neppV=neppM (the same parameter used for AOSOA memory layouts in the GPU, though the choice of the parameter on the CPU and on the GPU may be different: on the former, it is the SIMD width that counts, on the latter it is the cacheline that matters).
I have used cxtype_v arguments in the ixxx/oxxx function signatures, which also accept an iepgV parameter (index over event pages). Then I do the loop over the neppV events in each page within ixxx/oxxx.
For the moment I am just doing simple tests with autovectorization. I have added -march=native to the build options, and am observing the results of objdump to see if there are some vectorized loops, eg "ymm". I am using Sebastien Ponce's excellent manual as a guide.
I observe that, with the last commit where I add cxtype_v arguments in ixxx/oxxx. the number of symbols with "ymm" increases significantly. I take this as a simple proof of concept that there is "a bit more" autovectorization with the new code, i.e. the changes go in the right direction.
Note that all he changes I did this summer to implement AOSOA (which were mainly targeted at chieving coalesced memory access in the GPU, with optimal retrieval from cachelines) already allow to work on this. Some further changes on memory layouts however are needed to implement RRRRIIII complex storage as describe below.

A lot is still missing

Of course, I see no performance gain yet. I think that the amount of scalar code that is not vectorized is too much to see anything.
In addition to the ixxx/oxxx functions, what must be vectorized are the FVV functions. Unlike ixxx/oxxx, where only the outputs are cxtype_v vectors, here the individual inputsshould also become cxtype_v vectors.
One of the main changes that is needed, in any case, is to reimplement cxtype_v from being a cxtype[4] to being a complex(fptype[4],fptype[4]), i.e. going from RIRIRIRI to RRRRIIII formats. The has two implications. One, some change in memory layouts will also happen on the GPU (BUT remember that momenta, weights, random numbers are all real, not complex, so no change is needed there). Two, most importantly, a coding of all operations (as done already fo cucomplex) is needed, eg to determine how to sum or multiply two cxtype_v vectors: this is tedious, but should be only ~300 lines of code in a header.
Autovectorization is rather unreliable. I would start with trying that out, but then we can try vector compiler extensions, libraries like Vc or VecCore, or (in the worst case) intrinsics.

These changes may result in significant changes in the current interfaces, but I think they would normally lead to a better interface and structure also on the GPU. I'll continue in the next few days...

valassi commented 4 years ago

Repeating a few points I noted in https://github.com/madgraph5/madgraph4gpu/issues/82#issuecomment-738014896 (where multithreading is discussed, eg using openmp), the general parallelization strategy would be:

subdivide the nevt events into npagV pages with neppV events each
- this is both on the CPU and the GPU, though the values of neppV may be different on CPU and GPU as detailed below
structure all relevant memory arrays (momenta, in particular) as Arrays Of Structure Of Arrays where neppV is the last dimension, [npagV][... structure ...][neppV]
on the CPU:
- loop over the npagV pages, using OpenMP (or something else, eg c++11 threads) to split that into several threads
- use SIMD vectorization if possible to help in the loop over the neppV events in each page (the value of neppV on the CPU should be tuned for the SIMD capabilities of the specific CPU)
on the GPU:
- let a separate GPU thread process each of the nevt different events ievt
- use the AOSOA structure to allow optimal (coalesced) data access (the value of neppV on the GPU is already tuned for this, essentially it is the lowest value allowing coalesced retrieval of a 32-byte cacheline)

valassi commented 4 years ago

While playing with pragma omp parallel for, I also saw there is a pragma omp simd, https://bisqwit.iki.fi/story/howto/openmp/#SimdConstructOpenmp%204%200. Maybe that can be a simpler alternative to vector sompiler extensions (but I still think I need to pass vectors somehow in and out). To be kept in mind.

lfield commented 4 years ago

There is an interesting chapter in the Data Parallel C++ book on 'Programing for CPUs'. There is a specific subsection 'SIMD Vectorization on CPU'. You may be interested to take a look.

lfield commented 4 years ago

There is an interesting chapter in the Data Parallel C++ book on 'Programing for CPUs'. There is a specific subsection 'SIMD Vectorization on CPU'. You may be interested to take a look.

valassi commented 4 years ago

There is an interesting chapter in the Data Parallel C++ book on 'Programing for CPUs'. There is a specific subsection 'SIMD Vectorization on CPU'. You may be interested to take a look.

Thanks Laurence. I assume you mean https://link.springer.com/chapter/10.1007/978-1-4842-5574-2_11. It is interesting.

I prefer another reference by Sebastien however, it contains many more practical details, http://sponce.web.cern.ch/sponce/CSC/slides/PracticalVectorization.booklet.pdf. I also got from him a few useful headers and papers on intrinsics (but I hope I do not need to go that way).

valassi commented 3 years ago

And... it is finally paying off :-)

I am now at around a factor 4 speedup gained through SIMD vectorization in c++, from 0.5E5 to 2E6 for MEs. (Maybe more? I am not 100% sure where I was starting from).

Now: https://github.com/valassi/madgraph4gpu/commit/5f9ff6dc2c31d21456bae6d6b592f7efc7f490bf

TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.384488e+06                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 1.493997e+06                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 2.082485e+06                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0

One week ago: https://github.com/valassi/madgraph4gpu/commit/3178e95dffa663a5d080f139d8d5f878954f0096

TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 3.472268e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 3.553968e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 3.843094e+05                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0

Using also some good advice from @sponce I debugged the issues. Then I have slowly added more and more stuff to vectors. I managed to stay with autovectorization on compiler vector extensions, no Vc/VecCore or others.

Many more things to clean up and analyse, but this definitely very promising!

Clean up the final loose ends, eg amp[2] can also be used as a vector in the final calculations
Try march=native (I was at avx2 only).. get another factor 2?
Understand what are the valid values of neppV (play a bit with values)
Try floats instead of doubles.. get another factor 2? (neppV can be twice as large)
Fix runTest.exe which for some reason fails (but the average ME seems correct in the non-test executable)
Prepare some nice plots...
Then combine all this with OMP threading and with heterogeneous CPU+GPU, launch a fully loaded system

It was faster than I thought to achieve this, bits and pieces of 10 days. This would not have been possible without the work on AOSOA this summer however.

oliviermattelaer commented 3 years ago

Excellent, this is a really good news and impressive progress. We should have a small chat about this to see how to move forward on this.

sponce commented 3 years ago

This is excellent news ! Really impressive I must say. As far as I know, you have been running with AVX2 and with doubles, is that correct ? So a factor 4 would mean perfect speed-up in that context, which is amazing. Did you also change something else, like performing some cleanup or improving memory structures ?

valassi commented 3 years ago

Thanks a lot Olivier and Sebastien! I was not sure how to answer your question, so I have done a bit more analysis and prototyping with compiler flags, and some cleanup in the code.

I have decided to cleanly hardcode and support only three scenarios: scalar, AVX2 and AVX512 (maybe I will add SSE, but whats the point today). Sebastien, I was a bit inspired by Arthur's work, which we had discussed in the past, https://gitlab.cern.ch/lhcb/LHCb/-/blob/master/Kernel/LHCbMath/LHCbMath/SIMDWrapper.h. What I took is the use of ifdefs, with AVX512F and AVX2. So now I have

if AVX512F is defined, use double[8] or better double __attribute__ ((vector_size (64))) for internal loops
if AVX2 is defined, use double[4] or better double __attribute__ ((vector_size (32))) for internal loops
if neither is defined, use double for internal loops Then a complex vector is a class wrapping two double vectors, as RRRRIIII. I rely on the compiler vector extension above fully for floating point, the only boilerplate addition I need for vector types is for complex types (operation on two complex vectors, a complex vector and a complex scalar, a complex vector and a double scalar, a double vector and a complex scalar...).

Then I tried several combinations of Makefiles, all with -O3, and measured the matrix element throughput:

no additional -m, hence scalar: 5.4E5 MEs/sec https://github.com/valassi/madgraph4gpu/commit/76fc2b85c3aeedf4219d64bbe59bc5f3bf23fbc3
with -mavx2: 1.87E6 MEs/sec https://github.com/valassi/madgraph4gpu/commit/5282c92b229474da0f983f44bf42f0d366273be9
with -march=core-avx2: 2.08E6 MEs/sec (fastest, and present default) https://github.com/valassi/madgraph4gpu/commit/59e36b31c3057ebc1827430f15bf099dbd503d6d
with -mavx512f -mavx512cd -mprefer-vector-width=512: 2.01E6 MEs/sec https://github.com/valassi/madgraph4gpu/commit/7bde7985158b4bd8015eabc7830bd521887a7131
with -march=native -mprefer-vector-width=512: also 2.01E6 MEs/sec https://github.com/valassi/madgraph4gpu/commit/898cdfc26c3dd3ef2905f2e8da33f737875ab3ec
with -march=native and no preferred vector width: 2.06E6 MEs/sec https://github.com/valassi/madgraph4gpu/commit/ba3f4c4f5b38dca226bf3bf1c84993f329416684

So all in all I would say:

with -march=core-avx2 I get 2.08E6/5.4E5 i.e. a factor 3.82 (close to ideal 4), and this is ONLY from vectorization
I get around 20% better with -march=core-avx2 than -mavx2 (I am on a Skylake VM)
with AVX512, I get almost the same as AVX2 or slightly worse, certainly not better
with AVX512, I thought that -mprefer-vector-width=512 should have a positive effect (see https://stackoverflow.com/a/52543573), but if anything it is slightly worse

All these numbers must be taken with a pinch of salt as there may be some fluctuations due to the load on the VM (in principle I was told this should not happen, but it's not completely ruled out). Anyway I repeated these tests a few times in the same time frame, there should be no fluctation.

I think that being close to perfect speedup is excellent news, but I am not completely surprised, as I am only timing the number crunching which is perfectly parallelizable: all events go through excatly the same calculation, so the calculation is fully in lockstep. I might even recover a tiny bit of what is missing between 3.82 and 4, when the last bits and pieces are also vectorized (the amp[2]). Note that on the GPU we have no evidence of thread divergence, and this is exactly the same thing.

If you have any comments about AVX512, please let me know! But all I have heard from @sponce and @hageboeck sounds like it is better to stay at AVX2 and not bother further. I might try a KNL for fun at some point (see https://colfaxresearch.com/knl-avx512), but it is probably pointless.

Ah, another question I have was whether alignas can make any difference here. I have the impression that the double __attribute__ ((vector_size (32))) is already an aligned RRRR. I have added an align as to the complex vector just in case, but now that I think of it it is irrelevant by definition (the operations are other on RRRR or on IIII).

oliviermattelaer commented 3 years ago

My (small) experience with AVX512 is to let the compiler decide when to use it. It seems that (at least intel) compiler knows pretty well when it provides a speed boost (and therefore it avoid it quite often).

I indeed hear a lot of negative feedback about it.

Cheers,

Olivier

On 8 Dec 2020, at 13:40, Andrea Valassi notifications@github.com<mailto:notifications@github.com> wrote:

Thanks a lot Olivier and Sebastien! I was not sure how to answer your question, so I have done a bit more analysis and prototyping with compiler flags, and some cleanup in the code.

if AVX512F is defined, use double[8] or better double attribute ((vector_size (64))) for internal loops
if AVX2 is defined, use double[4] or better double attribute ((vector_size (32))) for internal loops
if neither is defined, use double for internal loops Then a complex vector is a class wrapping two double vectors, as RRRRIIII. I rely on the compiler vector extension above fully for floating point, the only boilerplate addition I need for vector types is for complex types (operation on two complex vectors, a complex vector and a complex scalar, a complex vector and a double scalar, a double vector and a complex scalar...).

Then I tried several combinations of Makefiles, all with -O3, and measured the matrix element throughput:

no additional -m, hence scalar: 5.4E5 MEs/sec valassi@76fc2b8https://github.com/valassi/madgraph4gpu/commit/76fc2b85c3aeedf4219d64bbe59bc5f3bf23fbc3
with -mavx2: 1.87E6 MEs/sec valassi@5282c92https://github.com/valassi/madgraph4gpu/commit/5282c92b229474da0f983f44bf42f0d366273be9
with -march=core-avx2: 2.08E6 MEs/sec (fastest, and present default) valassi@59e36b3https://github.com/valassi/madgraph4gpu/commit/59e36b31c3057ebc1827430f15bf099dbd503d6d
with -mavx512f -mavx512cd -mprefer-vector-width=512: 2.01E6 MEs/sec valassi@7bde798https://github.com/valassi/madgraph4gpu/commit/7bde7985158b4bd8015eabc7830bd521887a7131
with -march=native -mprefer-vector-width=512: also 2.01E6 MEs/sec valassi@898cdfchttps://github.com/valassi/madgraph4gpu/commit/898cdfc26c3dd3ef2905f2e8da33f737875ab3ec
with -march=native and no preferred vector width: 2.06E6 MEs/sec valassi@ba3f4c4https://github.com/valassi/madgraph4gpu/commit/ba3f4c4f5b38dca226bf3bf1c84993f329416684

So all in all I would say:

with -march=core-avx2 I get 2.08E6/5.4E5 i.e. a factor 3.82 (close to ideal 4), and this is ONLY from vectorization
I get around 20% better with -march=core-avx2 than -mavx2 (I am on a Skylake VM)
with AVX512, I get almost the same as AVX2 or slightly worse, certainly not better
with AVX512, I thought that -mprefer-vector-width=512 should have a positive effect (see https://stackoverflow.com/a/52543573), but if anything it is slightly worse

If you have any comments about AVX512, please let me know! But all I have heard from @sponcehttps://github.com/sponce and @hageboeckhttps://github.com/hageboeck sounds like it is better to stay at AVX2 and not bother further. I might try a KNL for fun at some point (see https://colfaxresearch.com/knl-avx512), but it is probably pointless.

Ah, another question I have was whether alignas can make any difference here. I have the impression that the double attribute ((vector_size (32))) is already an aligned RRRR. I have added an align as to the complex vector just in case, but now that I think of it it is irrelevant by definition (the operations are other on RRRR or on IIII).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/madgraph5/madgraph4gpu/issues/71#issuecomment-740596087, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AH6535W6MQYTDD546WWXYJLSTYNFLANCNFSM4UGYUJ3A.

sponce commented 3 years ago

Very nice work. And answers a lot of my questions. Here are a few thoughts/remarks :

it is not a surprise that avx2 is the best of the ones you've tried. Now one that could be slightly better (although we are playing between 3.84 and 3.9 here...) is -mavx512 -mprefer-vector-width=256. In short : AVX512 with 512 bits is very often a bad idea, as Olivier mentions, but with 256 bits width, you get some new instructions that can make a difference.
concerning the use of code "inspired by" SIMDWrapper, did you consider using it directly ? It should be quite easy and is header only if I'm not wrong (so just copy the .h file)
you're using VMs to test... I would REALLY avoid this. NEVER trust a VM. The core is dedicated during your run, this is correct, but nothing else is (disk, memory, kernel, bus, etc, etc...) So god knows the performance you will get (although I strongly believe even god would not manage). At the very minimum, rerun your test n times at different moments of the week and check you get the same results. In case you need access to a physical machine for the testing, I could arrange that (both for Intel and AMD if you'd like to try)
you're mentioning KNL, and I'm mentioning AMD. In both cases, we have machines with a hue number of cores, typically reaching close to 200. So memory is usually the main issue if you do not run multi-threaded. And thus the measurements should always be done on a full machine. Did you try ? It would be an interesting test

hageboeck commented 3 years ago

If you have any comments about AVX512, please let me know! But all I have heard from @sponce and @hageboeck sounds like it is better to stay at AVX2 and not bother further. I might try a KNL for fun at some point (see https://colfaxresearch.com/knl-avx512), but it is probably pointless.

My comment is that mid-aged compilers are not good with it. In RooFit, only very few things profited from AVX512, and not so many "standard" CPUs support it. We settled for "we do one library for skylake 512, one for AVX2 (=AMD and most intel CPUs out there) and one for SSE4.1". If you don't fall into this, you get normal scalars.

Oh, and all we did was ever only autovectorisation. Recent compilers are great if you write simple code!

And lastly: Our main focus remains AVX2. It's supported everywhere, and AMD is totally killing on their modern CPUs. Edit: I don't think AMD will give us 512, so we might as well not invest too much time. Again, the Ryzen are killing it already with AVX2.

And lastly lastly: clang's diagnostics are amazing! Here is part of a script that I use to automatically recompile a file with switched-on diagnostics. You can e.g. pipe the compile command into it or you use cmake to extract the compile command from the build system.

cd $directory
if [[ "$compileCommand" =~ ^.*clang.*$ ]]; then
  clangFlags="-Xclang -fcolor-diagnostics -Rpass=loop-vectorize -Rpass-analysis=loop-vectorize -Rpass-missed=loop-vectorize -fno-math-errno"
  # Run with diagnostics or on compiler error run without redirecting output
  $compileCommand $clangFlags >/tmp/vecReport_all.txt 2>&1 || $compileCommand || exit 1

  # Not interested in std library vectorisation reports:
  grep -v "/usr/" /tmp/vecReport_all.txt > /tmp/vecReport.txt

  sed -nE '/remark.*(not vectorized|vector.*not benef)/{N;{p}}' /tmp/vecReport.txt | sed -n 'N; s/\n/LINEBREAK/p' | sort -u | sed -n 's/LINEBREAK/\n/p'
  grep --color "vectorized loop" /tmp/vecReport.txt
else
  gccFlags="-ftree-vectorizer-verbose=2 -fdiagnostics-color=always"
  $compileCommand $gccFlags -fopt-info-vec-missed 2>&1 | grep -vE "^/usr/|googletest" | sort -u || $compileCommand || exit 1
  $compileCommand $gccFlags -fopt-info-vec-optimized 2>&1 | grep --color -E "^/home.*vectorized"
fi

sponce commented 3 years ago

I realize I forgot a point in my comment (although last one somehow encompasses it) : did you compute how much you gain overall with this vectorization, I mean full processing time and not only the vectorized part. I ask because I've seen so many cases of perfect vectorization (factor 4 here) where the final software was slower overall, the reason being that you lose more time later dealing with vectorized data than you've gained initially. Of course that all depends how big the vectorized part is and how bad the use of vector data is later (maybe you actually even gain there).

valassi commented 3 years ago

Hi, thanks both :-)

@hageboeck, looks like I am more or less along your lines already in the klas branch

There is one AVX512 build (even if likely useless!), one AVX2, one scalar... might add SSE, but I guess AVX2 is the minimum out there these days, so probably not needed (anyway, 30 minutes work).
It relies on autovectorisation (meaning no intrinsics, no Vc, VecCore or similar), through that one single line of vector compiler extensions. I find it more robust as every single +-*/ is vectorized, rather than relying on the compiler to vectorise a long list of operations. Indeed I went one function at a time, and the large speedup was only at the end.
Agree, AVX2 is the fastest now and seems the best to focus on. Anyway, it's a proof of concept to get the initial speedup, then we can iterate/refine later on.
Thanks I will try the clang diagnostics at some point! I had tried those in gcc with similar inputs .. again from Sebastien's slides ;-) Anyway now it seems that what can be vectorized is already, so probably not needed now (but feel free to have a look!).

@sponce, good points:

AVX2 looks still better also than AVX512f with 256 width (https://github.com/valassi/madgraph4gpu/commit/c65008e57f4cebcfc71bd3f2292b2b879500b805). Actually I think this is what -march=native gives. By default it uses 256 width, if you want 512 width apparently you need to set it explicitly (see https://stackoverflow.com/a/52543573), but if anything it gets even worse, as mentioned above.
about being inspired by SIMDwrapper versus using it: I had considered using it at the beginning, but for the moment it looks good enough (and much easier) to just use a one-line compiler vector extension (https://github.com/valassi/madgraph4gpu/blob/bd1c2171bbd406fca363aa31d0e90d6933e500d3/epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/mgOnGpuVectors.h#L19); as I said, the real addition I had to put is for complex types, and this is missing also in the lhcb simd wrapper
about VMs, yes I know :-) but this was the best compromise, as we need a set of machines with a GPU, and openstack was very convenient.. anyway the VM looks okish, more stable in performance than the lxbatch-GPU cluster for sure
KNL and AMD and memory.. yes this will come! at some point this will go in a benchmarking container and we will also send it to a few HPCs, will be interesting

Finally very good point on the real speedup and Ahmdahl's law: I know, here I am talking of the matrix element. But funnily enough, even for a simple LEP eemumu process, on C++/CPU this is the dominant part (on a GPU it is totally negligible). Quoting by heart, something like 12s for MEs, against 1s for the rest, so now reduced to 3s+1s. When we go to LHC processes, the ME will be MUCH larger, both on CPU and GPU, I think. So any speedup from porting this part is really great...

valassi commented 3 years ago

I moved the momenta array from a C-style AOSOA to an AOSOA where the final "A" is a vector type. I am not sure how, but I seem to have gained another factor 1.5 speedup?... From 2E6 to 3E6.

Now https://github.com/valassi/madgraph4gpu/commit/a2f8cf90f3e33aef576157e5a601a3570ca31fe5

./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Internal loops fptype_sv   = VECTOR[4] (AVX2)
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 2.965374e-01                 )  sec
TotalTime[Rambo+ME]    (23)= ( 2.689275e-01                 )  sec
TotalTime[RndNumGen]    (1)= ( 2.760994e-02                 )  sec
TotalTime[Rambo]        (2)= ( 9.534033e-02                 )  sec
TotalTime[MatrixElems]  (3)= ( 1.735872e-01                 )  sec
MeanTimeInMatrixElems      = ( 1.735872e-01                 )  sec
[Min,Max]TimeInMatrixElems = [ 1.735872e-01 ,  1.735872e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.768033e+06                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 1.949552e+06                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 3.020316e+06                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374915e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.197419e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************

Was https://github.com/valassi/madgraph4gpu/commit/2d276bfdc1bdbea4137dd5aea33090d49c57bbcd

./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Internal loops fptype_sv   = VECTOR[4] (AVX2)
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 3.720043e-01                 )  sec
TotalTime[Rambo+ME]    (23)= ( 3.441720e-01                 )  sec
TotalTime[RndNumGen]    (1)= ( 2.783231e-02                 )  sec
TotalTime[Rambo]        (2)= ( 9.215697e-02                 )  sec
TotalTime[MatrixElems]  (3)= ( 2.520150e-01                 )  sec
MeanTimeInMatrixElems      = ( 2.520150e-01                 )  sec
[Min,Max]TimeInMatrixElems = [ 2.520150e-01 ,  2.520150e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.409360e+06                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 1.523332e+06                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 2.080384e+06                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374915e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.197419e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************

valassi commented 3 years ago

A large chunk of work on this issue will soon be merged from PR #152.

PR #152 replaces the two previous draft PRs #72 and #132, which I have closed. The reason these two are obsolete is that I completely rebased my SIND work (which is in epoch1) on epoch2-level code.

I have finally completed the "merge" of epoch2 and epoch1 of issue #139 in PR #151 (the "ep12" of "klas2ep12"). Presently epoch1 and epoch2 are identical. I will merge my SIMD work in epoch1 and keep epoch2 as-is prevectorization as a reference.

I am copying here a few comments I made in PR #152

I reported a factor ~4 speedup with AVX2 on C++, and missing support for SSE. In the meantime I have implemented AVX512 with 256 width which is faster than AVX2, and I also implemented SSE.
There was "an additional factor 1,5 speedup (on top of the factor 4)" that I had mentioned at some point. I still need to understand this, but indeed I seem to get a better throughput even with no SIMD, see below.
I reported that a factor 4 speedup over scalar C++, as the scalar C++ is 30% slower than the scalar Fortran, means a factor 3 speedup over the current production code, if we manage to interface to madevent. I think this is still within reach, and probably a bit more than a factor 3.

The CURRENT BASELINE BEFORE VECTORIZATION is that at the end of PR #151:

-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.133317e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     8.050711 sec
real    0m8.079s
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 6.852279e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     1.233023 sec
real    0m1.552s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CPP
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.132827e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     8.059035 sec
real    0m8.086s
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 6.870531e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     1.177079 sec
real    0m1.485s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164
-------------------------------------------------------------------------

My CURRENT (WIP) BASELINE WITH VECTORIZATION is that in the FINAL MERGE OF 'origin/ep2to2ep1' into klas2ep12: https://github.com/madgraph5/madgraph4gpu/commit/870b8b342dd5b2fd1923550d8eaab6b48e88b4c2

-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP
Internal loops fptype_sv    = VECTOR[1] == SCALAR (no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.319998e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.247976 sec
real    0m7.274s
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 6.847687e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     1.181390 sec
real    0m1.488s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 120
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP
Internal loops fptype_sv    = VECTOR[4] (AVX512F)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 4.718034e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.749072 sec
real    0m3.775s
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 6.806047e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     1.199100 sec
real    0m1.505s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 120
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CPP
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.130720e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     8.045962 sec
real    0m8.072s
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 6.855929e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     1.207904 sec
real    0m1.528s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164
-------------------------------------------------------------------------

So, if I compare the vectorization branch to currenmt master, I see

No degradation in CUDA, always 6.8E8
A 20% increase in C++ without SIMD, 1.32E6 vs 1.13E6 (this is probably the "factor 1.5" discussed in the past, but less)
A factor 3.6 from AVX512, 4.72E6 vs 1.32E6
For comparison, my latest Fortran baseline was (https://github.com/madgraph5/madgraph4gpu/tree/master/epoch1/gridpack) 1.50E6, so this is now a factor 3.1 faster.

A few additional comments (not in PR #152):

First, thanmks @sponce for the idea of the 256 width on AVX512! Indeed I got maybe 10% extra wrt AVX2.
I am doing separate builds now for nehalem, haswell, skylake512/256width. These correspond more or less to the upcoming x86 levels (a suggestion by MarcoCle).
Eventually, I will probably build a single executable, implementing internal check of the processor available. So one can give a command line argument to a single executable, "use avx512, avx2, sse, none" instead of having separate builds. It may also help for the use as benchmarking. (In production of course one can build only whats needed on one architecture).

I will probably merge #152 tomorrow.

valassi commented 3 years ago

I think that this old issue #71 can now be closed.

In epochX (issue #244) I have now backported vectorization to the python code generating code, and I can now run vectorized c++ not only for the simple eemumu process, but also for the more complex (and more relevant to LHC!) ggttgg process. I observe similar speedups there, or even slightly better, for reasons to be understood. With respect to basic c++ with no simd, thrugh the appropriate use of SIMD eg AVX512 in 256 mode (see also #173) and LTO-like aggressive inlining (see #229) I get a factor 4 (~4.2) in double and a factor 8 (~7.8) in real.

See for instance the logs in https://github.com/madgraph5/madgraph4gpu/tree/golden_epochX4/epochX/cudacpp/tput/logs_ggttgg_auto

DOUBLE

NO SIMD, NO INLINING
Process                     = SIGMA_SM_GG_TTXGG_CPP [gcc 10.3.0] [inlineHel=0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.809004e+03                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.809004e+03                 )  sec^-1

512y SIMD, NO INLINING
Process                     = SIGMA_SM_GG_TTXGG_CPP [gcc 10.3.0] [inlineHel=0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 7.490002e+03                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 7.490002e+03                 )  sec^-1

For double, INLINING does not pay off, neither without nor with simd, it is worse than no inlining. What is interesting is that 512z is better than 512y in that case.

FLOAT

NO SIMD, NO INLINING
Process                     = SIGMA_SM_GG_TTXGG_CPP [gcc 10.3.0] [inlineHel=0]
FP precision                = FLOAT (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.793838e+03                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.793838e+03                 )  sec^-1

512y SIMD, NO INLINING
Process                     = SIGMA_SM_GG_TTXGG_CPP [gcc 10.3.0] [inlineHel=0]
FP precision                = FLOAT (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.397076e+04                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.397076e+04                 )  sec^-1
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 7775) (512y:   29) (512z:    0)

512z SIMD, INLINING
Process                     = SIGMA_SM_GG_TTXGG_CPP [gcc 10.3.0] [inlineHel=1]
FP precision                = FLOAT (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[16] ('512z': AVX512, 512bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.391933e+04                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.391933e+04                 )  sec^-1
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 4075) (512y:    7) (512z:39296)

That is to say, with float, INLINING gives eventually the same maximum speed as NO INLINING, but the former case is with AVX512/z, the latter with AVX512/y. Strange. In the simpler eemumu process, inlining did seem to provide a major booster of performance (which I could not explain). The summary is that we should use ggttgg for real studies - but also that we get VERY promising results there!

Anyway, I am closing this and will repost these numbers on the LTO study issue #229 and the AVX512 study issue #173.