Move latest eemumu developments from epoch1 to epoch2 ("merge" epoch2 into epoch1)

Hi @roiser @oliviermattelaer @hageboeck

All my latest developments (master, klas/vectorization, heterogeneous, unweighting etc) are all in epoch1/eemumu. It was my mistake not to base this on epoch2/eemumu.

Luckily, the changes that have been made in epoch2/eemumu since its initial creation are relatively few.

As agree with @roiser on Friday, I will essentially merge those epoch2 changes into epoch1. I proposed to also upgrade epoch2 to the level of epoch1, but we agreed that it is better to use epoch1/eemumu as the basis, and eventually use this to backport to MG and create epoch3.

We also agreed that I will self-merge essentially all of this stuff, as there is quite a bit to be done. I will try to document here what I am doing and why.

There are essentially two things to be ported

The initial differences between epoch2 and epoch1 when epoch2 was created.
The new changes on top of epoch2 over time.

For reference, the latter are these, quite few

[avalassi@itscrd70 bash] ~/GPU2020/madgraph4gpuTer/epoch2/cuda/ee_mumu> git diff upstream/master [avalassi@itscrd70 bash] ~/GPU2020/madgraph4gpuTer/epoch2/cuda/ee_mumu> git log .

commit ff021bdfafa5068b841fbd1754052524270301b2 Author: Stephan Hageboeck stephan.hageboeck@cern.ch Date: Wed Dec 16 18:22:26 2020 +0100

[ep2 cuda eemm] Port fixes in Makefile to epoch2.

commit 57497a79292cf0616e04ab9b866bba305eb93f54 Author: Stephan Hageboeck stephan.hageboeck@cern.ch Date: Wed Dec 16 17:05:24 2020 +0100

[ep2 cuda eemm] Port CUDA tests to epoch2.

commit 05d3a6f8de663db276b4755785d0713b27b043bd Author: Olivier Mattelaer olivier.mattelaer@uclouvain.be Date: Wed Dec 2 13:52:59 2020 +0100

port into MG5aMC the change from https://github.com/madgraph5/madgraph4gpu/pull/78

commit a6c18e2715bfa5b39727ba6407031f6c7633ab78 Author: Olivier Mattelaer olivier.mattelaer@uclouvain.be Date: Mon Nov 30 23:26:50 2020 +0100

cpp compilation is working

commit a683a247e13d3aedc88f62c7d3f20aefde6943d5 Author: Olivier Mattelaer olivier.mattelaer@uclouvain.be Date: Sun Nov 29 21:38:08 2020 +0100

fix issue with ixxxx

commit 389aaaa72343ad05f83168bc4fbd390ccee013e0 Author: Olivier Mattelaer olivier.mattelaer@uclouvain.be Date: Fri Nov 27 10:28:45 2020 +0100

adding json info/ more plot from PR#61

commit c092e9a053e7f037791f25462fcf8598232fab49 Author: Olivier Mattelaer olivier.mattelaer@uclouvain.be Date: Thu Nov 26 20:59:28 2020 +0100

first version of ee_mumu coming from madgraph --some PR still need to be included here

I merged the first batch of changes from PR #140 : clean up and rename files in epoch2/eemumu.

These changes are all in epoch2, essentially:

cleanup: remove duplicate rambo.cc and grambo.cu (were identical)
cleanup: remove duplicate gcheck_sa.cu and ../check.cc (they were different and only the former was used - but keep the latter only, which is more recent and closer to epoch1)
use the same naming strategy as in epoch1: keep code in .cc files, define .cu files as symlinks (NB eventually we could get rid of symlinks and use the nvcc option to treat cc as cu, used by @hageboeck in the tests... more generally we should get rid of these gXXX filenames that I introduced)
(temporarely?) rename check_sa.cc as check.cc, to ease comparisons with epoch1 (NB eventually I agree to go back to check_sa as a name, but now it just complicates things)

Running EVENTUALLY-TODO:

go back to check_sa.cc instead of check.cc
get rid of .cu to .cc symlinks, use the nvcc option to treat cc as cu instead

I have decided to split the remaining tasks further into two PR. I have done everything except CPPProcess, but this is the most complex part (and I actually even see a minor performance differences). I will split that out in a third PR.

Recap about issue #139

first part, merged PR #140 as discussed above : clean up and rename files in epoch2/eemumu
second part, I have just merged PR #149: ensure that all files in epoch1 and epoch2 are identical in master, except for CPPProcess (and the related HelAmps_sm)
third part, an upcoming PR: address also the differences between epoch2 and epoch1 in CPPProcess

More in detail about this PR #149 below, copied from the text of the PR.

In src:

1) Parameters_sm.h Remove "using namespace std;" in epoch2. Otherwise almost identical. Copy epoch2 to epoch1.

2) Parameters_sm.cc Add explicit std:: in epoch2. Otherwise almost identical. Copy epoch2 to epoch1.

3) read_slha.h Identical but for indentation: fix them manually and make them equal. (clang-format would bring too many changes)

4) read_slha.cc Identical but for a default parameter value in implementation in epoch2. Fix by copying epoch1 to epoch2.

5) rambo.h/cc Identical in epoch2 and epoch1, nothing to do

6) mgOnGpuConfig.h Identical, except for a comment (did the percent sign disturn the metacode?). Fix by copying epoch1 to epoch2.

7) mgOnGpuTypes.h Identical.

8) Makefile Almost identical, but ep1 has OMP, fastmath, Wextra. Fix by copying epoch1 to epoch2.

9) HelAmps.h/cc MISSING IN EPOCH1! Do this later...

In SubProcesses and below:

1) timer.h Identical

2) Makefile Almost identical but epoch1 has much more, cosmetics and copy ep1 to ep2 Now added also to ep2, as in epoch1: OMP, fastmath, Wextra, clang patch, host info

Note: at this stage, epoch1 is slightly faster than epoch2 in c++, but the inverse in CUDA.

3) Memory.h, nvtx.h, perf.py Identical (but a symlink is missing, to be added in epoch1)

4) timermap.h Copy epoch1 to epoch2 to add missing gcc pragmas for nvtx warnings

5) perf/data Only in epoch1 - one json file, keep it there

6) profile.sh Only in epoch1 - should bring it forward eventually (anyway the basis will be epoch1)

7) runTest.cc Initially identical, but tests had different name (e.g. EP1_CUDA_GPU vs EP2_CUDA_GPU). This is fixed by adding epoch_process_id.h where a different macro is defined per epoch, then runTest.cc is now identical.

8) check.cc

First batch of changes

Minimal changes in epoch1:

remove unused headers in epoch1
remove two empty lines in the code doing the performance dump

Port to epoch2 many changes from epoch1:

add omp.h in epoch2
use the ep1 printout about '-d' also in epoch2
use the ep1 printout about OMP_NUM_THREADS also in epoch2
export OMP_NUM_THREADS=1 if not set also in epoch2
initialize T() in hstMakeUnique also in epoch2
comment out unused stdwtim also in epoch2
add one space per line in the performance dump also in epoch2
add OMP info in the performance dump also in epoch2
[commented out] add gcc compiler info in the performance dump also in epoch2
return 0 at the end of main also in epoch2

7bis) runTest.cc 8bis) check.cc

A large batch of additional changes (mainly in PR #144) came from fixing epoch2 check.cc to use fptype for random numbers as in epoch1. This triggered many additional checks about single precision, included in PR #144, which also includes a better treatment of NaNs.

This is all at the time of this PR (after some previous ones). Then the rest will be about CPPProcess.

I have FINALLY completed also the third big part, PR #151. Now epoch1 and epoch2 (before vectorization) are strictly identical. I will then go on and develop on top of epoch1 (vectorization and more), while keeping epoch2 as a pre-vectorization reference.

This is a summary of what this contains:

In general

proceed iteratively, with detailed code/performance comparisons at each step
merge into the vectorization ("klas2ep12") branch frequently to prepare for that eventually
first use the same function names and formatting to allow side-by-side comparisons, then modify both a lot
going on, I almost rewrote from scratch the XXX and FFV functions starting from those of epoch2
eventually, epoch1 and epoch2 are now IDENTICAL (except for the "epoch_process_id" tag)

Changes in file directory structure

HelAmps_sm.h/cc : epoch2 comment out code but keep files as they were initially, also copy them as-is in epoch1
CPPProcess.cc : epoch2 embedded epoch2 HelAmps_sm.cc initially as-is, then started modifying the contents in parallel to epoch1
performance scripts, headers etc: try to move everything to Subprocesses and add links in PSigma, as in epoch2 (WARNING: not yet done for epoch1 runTest.cc, where I have big changes for vectorization; temporarely moved it also for epoch2)

Changes in file formatting, cosmetics, minor content issues

CPPProcess.cc : epoch2 (and epoch1) add mgDebug calls (plus minor fixese when already there) and returns on void

Code cleanup potentially affecting performance

CPPProcess.cc : epoch2 "clean up" the code, also to simplify vectorization later on (e.g. define variables only when used, and initialise them immediately; use c++11 zero initialization...)

Changes in XXX functions

CPPProcess.cc : epoch1 rename imzxxxM0, ixzxxxM0, oxzxxxM0 (remove M0 as in epoch2)
CPPProcess.cc : epoch1 add (commented out) ixxxxx, ipzxxx, vxxxxx, sxxxxx, oxxxxx, opzxxx, omzxxx as-is from epoch2
CPPProcess.cc : epoch1/2, rewrite ASSUMPTIONS comment using the explanation given in epoch2
CPPProcess.cc : epoch2 interface change, replace 'const int&' by 'const int' (nhel and nsf)
CPPProcess.cc : epoch2 interface change, replace 'const fptype&' by 'const fptype' (e.g. masses)
CPPProcess.cc : epoch1/2 interface change, replace 'cxtype fi[6]' by 'cxtype* fi'
CPPProcess.cc : epoch1/2 move the definition of nwf=5 and nw6=6 from mgOnGpuConfig.h to CPPProcess.cc
CPPProcess.cc : epoch1 interface change, replace fis/fos by fi/fo as in epoch2 (generally try to use Olivier's epoch2 naming/structure, eventually rewrite them from scratch starting from epoch2)
CPPProcess.cc : epoch1 use opzxxx instead of oxzxxx as in epoch2!
CPPProcess.cc : epoch1 eventually rewrite all xxx functions from scratch starting from epoch2 modified as above
CPPProcess.cc : epoch1/2 ensure ixxxxx/oxxxxx are ok, test these instead of the mass=0 functions, functionally ok but slower
CPPProcess.cc : epoch1/2 uncomment all xxx functions and ensure they build (including ipzxxx, vxxxxx, sxxxxx, omzxxx)
CPPProcess.cc : epoch1/2 replace p0123 by pvec0123 in all xxx functions consistently
CPPProcess.cc : epoch1/2 remove two build warnings from sxxxxx (unused function)
CPPProcess.cc : epoch1/2 remove two builds warnings in vxxxxx (ternary operator), WARNING must be cross checked!
CPPProcess.cc : epoch1/2 (eventually) add START/END LOOP and extra brace in all xxx functions (the extra brace is because later on in some cases fptype_v is handled by expliciting loop on ieppV without SIMD)

Changes in FFV functions

CPPProcess.cc : epoch1 add (initially commented out, later uncommented) FFV2_0, FFV2_3, FFV4_0, FFV4_3 as in epoch2
CPPProcess.cc : epoch1/2 eventually ensure all FFV functions build including FFV2_0, FFV2_3, FFV4_0, FFV4_3
CPPProcess.cc : epoch1 use GPU constant memory for couplings as in epoch2 (WARNING: this increases registers without affecting throughput, from a performance view point hardcoding them is better, but from a usability view point these "constants" must be read from user configuration files - hence do use constant memory)
CPPProcess.cc : epoch2 use fptype (was hardcoded double) for masses/couplings as in epoch1
CPPProcess.cc : epoch1 rewrite all FFV functions from scratch starting from epoch2 (with few modifications as above) (this includes the FFV1_0, FFV1P0_3, FFV2_4_0 and FFV2_4_3 that I had previously modified in epoch1 - I discard those!)
CPPProcess.cc : epoch1/2 gain 1% by inverting order of two statements in FFV2_4_0 (?!)
CPPProcess.cc : epoch1/2 gain ~1% by inverting order of two statements in FFV2_4_3 (?!)
CPPProcess.cc : epoch1/2 replace ".real()" by "cxreal()", more portable (cucomplex) and makes vectorization easier

Changes in sigmakin or calculate_wavefunction

CPPProcess.cc : epoch1 helicity filtering bug fix ME!=MELast (as in epoch2) instead of ME>MELast
CPPProcess.cc : epoch1 add using namespace MG5_sm in calculate_wavefunction, as in epoch2
CPPProcess.cc : epoch1 use amp[1] instead of amp[2] as in epoch2, add to running sum after each FFV call (WARNING! this decreases epoch1 C++ performance by ~5% to the same level as epoch2, but is useful for "2 to many" processes as it probably saves some registers; I was not yet monitoring registers)
CPPProcess.cc : epoch1/2 use "jamp -= amp0" instead of "jamp += (-amp0)" (surprisingly, this gains back some of the ppereformance previously lost in epoch1)
CPPProcess.cc : epoch2 add OMP multi-threading as in epoch1
CPPProcess.cc : epoch2 use fptype (was hardcoded double) for colors as in epoch1
CPPProcess.cc : epoch1 use opzxxx instead of oxzxxx as in epoch2!

Changes in CPPProcess other than XXX, FFV or formatting

CPPProcess.cc : epoch1 remove duplicate code defining tHel in the same way for c++ and cuda
CPPProcess.cc : epoch2 remove 'static' from 'const int tHel" as in epoch1
CPPProcess.cc : epoch2 clean up class structure (fix warnings) and copy to epoch1

Printouts and performance tools

check.cc : epoch1/2, add Epoch/Process/Language printout (using epoch_process_id header)
throughput12.sh : epoch1/2 add script to compare ep2/ep1 performances as the code changes
throughput12.sh : epoch1/2 repeat c++ test also with full OMP threads (if -omp is specified)
throughput12.sh : dump sigmaKin register usage from ncu
profile.sh : epoch1 a few fixes for ncu

TODO EVENTUALLY (after vectorization: add to the running list from previous PRs)

remove XXX and FFV functions from CPPProcess, move them only to HelAmps
move runTests.cc to Subprocesses and add a link in the PSigma directory

I will now self-merge that.

I have merged #151.

I keep this PR #139 because I'd like to do some cleanup after merging vectorization. From bits and piecse of my previous TODO EVENTUALLY:

go back to check_sa.cc instead of check.cc
get rid of .cu to .cc symlinks, use the nvcc option to treat cc as cu instead
remove XXX and FFV functions from CPPProcess, move them only to HelAmps
move runTests.cc to Subprocesses and add a link in the PSigma directory
use a beautifier on the code

My CURRENT BASELINE PERFORMANCE (before vectorization) is described in https://github.com/madgraph5/madgraph4gpu/commit/1c25007f59488e86c87f3e4d46043f09a140aafd

-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.133317e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     8.050711 sec
real    0m8.079s
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 6.852279e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     1.233023 sec
real    0m1.552s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CPP
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.132827e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     8.059035 sec
real    0m8.086s
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 6.870531e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
TOTAL       :     1.177079 sec
real    0m1.485s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164
-------------------------------------------------------------------------

The bulk of the tasks described here were completed long ago.

The pending items were also essentially all completed in one way or another in epochX (issue #244).

Answering point by point on my own latest comment:

go back to check_sa.cc instead of check.cc This was done in the epochX3 PR. See https://github.com/madgraph5/madgraph4gpu/commit/2723b954a941c3ccfd6afa1d698e8a41993c51cf#diff-1d071842bcb7eebdc7da20d041523263ec4c0e38a7b12f8662186a62de2e545b
get rid of .cu to .cc symlinks, use the nvcc option to treat cc as cu instead This is not done. I did remove grambo.cu (as in cuda in any case this is included, I include rambo.cc). I mention th epending issues in issue #54 about a general cleanup of cuda/cpp single source
remove XXX and FFV functions from CPPProcess, move them only to HelAmps This is done since quite some time now, I do not remember since when. Note that HelAmps.cc is included in cppprocess both in cuda (no rdc) and in c++ (aggressive inlining mimicking LTO) for performance reasons, but this is another story. Also because HelAmps is in src and can be eventually included by several different files if we go to nprocesses>1 (see #272)
move runTests.cc to Subprocesses and add a link in the PSigma directory This is done in epochX (probably since very early on)
use a beautifier on the code This is discussed in issue #49. However, in epochX (issue #244) I spent a lot of effort to ensure that the code generation produces code that is now "beautifully" indented and formatted. So somthing like clang-format becomes less relevant ( I would actually avoid it at this point).
My CURRENT BASELINE PERFORMANCE (before vectorization) is described in 1c25007 Note that now in epochX all performances are described in logs committed to the repo, rather than in git commit logs. The equivalent of the above (with sam eperformance within fluctuations) is the epochx2 golden tag https://github.com/madgraph5/madgraph4gpu/blob/golden_epochX2/epochX/cudacpp/tput/throughputX_log_eemumu.txt

This issue can now be closed

madgraph5 / madgraph4gpu

Move latest eemumu developments from epoch1 to epoch2 ("merge" epoch2 into epoch1) #139