Closed valassi closed 3 years ago
There are essentially two things to be ported
For reference, the latter are these, quite few
[avalassi@itscrd70 bash] ~/GPU2020/madgraph4gpuTer/epoch2/cuda/ee_mumu> git diff upstream/master [avalassi@itscrd70 bash] ~/GPU2020/madgraph4gpuTer/epoch2/cuda/ee_mumu> git log .
commit ff021bdfafa5068b841fbd1754052524270301b2 Author: Stephan Hageboeck stephan.hageboeck@cern.ch Date: Wed Dec 16 18:22:26 2020 +0100
[ep2 cuda eemm] Port fixes in Makefile to epoch2.
commit 57497a79292cf0616e04ab9b866bba305eb93f54 Author: Stephan Hageboeck stephan.hageboeck@cern.ch Date: Wed Dec 16 17:05:24 2020 +0100
[ep2 cuda eemm] Port CUDA tests to epoch2.
commit 05d3a6f8de663db276b4755785d0713b27b043bd Author: Olivier Mattelaer olivier.mattelaer@uclouvain.be Date: Wed Dec 2 13:52:59 2020 +0100
port into MG5aMC the change from https://github.com/madgraph5/madgraph4gpu/pull/78
commit a6c18e2715bfa5b39727ba6407031f6c7633ab78 Author: Olivier Mattelaer olivier.mattelaer@uclouvain.be Date: Mon Nov 30 23:26:50 2020 +0100
cpp compilation is working
commit a683a247e13d3aedc88f62c7d3f20aefde6943d5 Author: Olivier Mattelaer olivier.mattelaer@uclouvain.be Date: Sun Nov 29 21:38:08 2020 +0100
fix issue with ixxxx
commit 389aaaa72343ad05f83168bc4fbd390ccee013e0 Author: Olivier Mattelaer olivier.mattelaer@uclouvain.be Date: Fri Nov 27 10:28:45 2020 +0100
adding json info/ more plot from PR#61
commit c092e9a053e7f037791f25462fcf8598232fab49 Author: Olivier Mattelaer olivier.mattelaer@uclouvain.be Date: Thu Nov 26 20:59:28 2020 +0100
first version of ee_mumu coming from madgraph --some PR still need to be included here
I merged the first batch of changes from PR #140 : clean up and rename files in epoch2/eemumu.
These changes are all in epoch2, essentially:
Running EVENTUALLY-TODO:
I have decided to split the remaining tasks further into two PR. I have done everything except CPPProcess, but this is the most complex part (and I actually even see a minor performance differences). I will split that out in a third PR.
Recap about issue #139
More in detail about this PR #149 below, copied from the text of the PR.
In src:
1) Parameters_sm.h Remove "using namespace std;" in epoch2. Otherwise almost identical. Copy epoch2 to epoch1.
2) Parameters_sm.cc Add explicit std:: in epoch2. Otherwise almost identical. Copy epoch2 to epoch1.
3) read_slha.h Identical but for indentation: fix them manually and make them equal. (clang-format would bring too many changes)
4) read_slha.cc Identical but for a default parameter value in implementation in epoch2. Fix by copying epoch1 to epoch2.
5) rambo.h/cc Identical in epoch2 and epoch1, nothing to do
6) mgOnGpuConfig.h Identical, except for a comment (did the percent sign disturn the metacode?). Fix by copying epoch1 to epoch2.
7) mgOnGpuTypes.h Identical.
8) Makefile Almost identical, but ep1 has OMP, fastmath, Wextra. Fix by copying epoch1 to epoch2.
9) HelAmps.h/cc MISSING IN EPOCH1! Do this later...
In SubProcesses and below:
1) timer.h Identical
2) Makefile Almost identical but epoch1 has much more, cosmetics and copy ep1 to ep2 Now added also to ep2, as in epoch1: OMP, fastmath, Wextra, clang patch, host info
Note: at this stage, epoch1 is slightly faster than epoch2 in c++, but the inverse in CUDA.
3) Memory.h, nvtx.h, perf.py Identical (but a symlink is missing, to be added in epoch1)
4) timermap.h Copy epoch1 to epoch2 to add missing gcc pragmas for nvtx warnings
5) perf/data Only in epoch1 - one json file, keep it there
6) profile.sh Only in epoch1 - should bring it forward eventually (anyway the basis will be epoch1)
7) runTest.cc Initially identical, but tests had different name (e.g. EP1_CUDA_GPU vs EP2_CUDA_GPU). This is fixed by adding epoch_process_id.h where a different macro is defined per epoch, then runTest.cc is now identical.
8) check.cc
First batch of changes
Minimal changes in epoch1:
Port to epoch2 many changes from epoch1:
7bis) runTest.cc 8bis) check.cc
A large batch of additional changes (mainly in PR #144) came from fixing epoch2 check.cc to use fptype for random numbers as in epoch1. This triggered many additional checks about single precision, included in PR #144, which also includes a better treatment of NaNs.
This is all at the time of this PR (after some previous ones). Then the rest will be about CPPProcess.
I have FINALLY completed also the third big part, PR #151. Now epoch1 and epoch2 (before vectorization) are strictly identical. I will then go on and develop on top of epoch1 (vectorization and more), while keeping epoch2 as a pre-vectorization reference.
This is a summary of what this contains:
In general
Changes in file directory structure
Changes in file formatting, cosmetics, minor content issues
Code cleanup potentially affecting performance
Changes in XXX functions
Changes in FFV functions
Changes in sigmakin or calculate_wavefunction
Changes in CPPProcess other than XXX, FFV or formatting
Printouts and performance tools
TODO EVENTUALLY (after vectorization: add to the running list from previous PRs)
I will now self-merge that.
I have merged #151.
I keep this PR #139 because I'd like to do some cleanup after merging vectorization. From bits and piecse of my previous TODO EVENTUALLY:
My CURRENT BASELINE PERFORMANCE (before vectorization) is described in https://github.com/madgraph5/madgraph4gpu/commit/1c25007f59488e86c87f3e4d46043f09a140aafd
-------------------------------------------------------------------------
Process = EPOCH1_EEMUMU_CPP
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.133317e+06 ) sec^-1
MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0
TOTAL : 8.050711 sec
real 0m8.079s
-------------------------------------------------------------------------
Process = EPOCH1_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 6.852279e+08 ) sec^-1
MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0
TOTAL : 1.233023 sec
real 0m1.552s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164
-------------------------------------------------------------------------
Process = EPOCH2_EEMUMU_CPP
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.132827e+06 ) sec^-1
MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0
TOTAL : 8.059035 sec
real 0m8.086s
-------------------------------------------------------------------------
Process = EPOCH2_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 6.870531e+08 ) sec^-1
MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0
TOTAL : 1.177079 sec
real 0m1.485s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164
-------------------------------------------------------------------------
The bulk of the tasks described here were completed long ago.
The pending items were also essentially all completed in one way or another in epochX (issue #244).
Answering point by point on my own latest comment:
go back to check_sa.cc instead of check.cc This was done in the epochX3 PR. See https://github.com/madgraph5/madgraph4gpu/commit/2723b954a941c3ccfd6afa1d698e8a41993c51cf#diff-1d071842bcb7eebdc7da20d041523263ec4c0e38a7b12f8662186a62de2e545b
get rid of .cu to .cc symlinks, use the nvcc option to treat cc as cu instead This is not done. I did remove grambo.cu (as in cuda in any case this is included, I include rambo.cc). I mention th epending issues in issue #54 about a general cleanup of cuda/cpp single source
remove XXX and FFV functions from CPPProcess, move them only to HelAmps This is done since quite some time now, I do not remember since when. Note that HelAmps.cc is included in cppprocess both in cuda (no rdc) and in c++ (aggressive inlining mimicking LTO) for performance reasons, but this is another story. Also because HelAmps is in src and can be eventually included by several different files if we go to nprocesses>1 (see #272)
move runTests.cc to Subprocesses and add a link in the PSigma directory This is done in epochX (probably since very early on)
use a beautifier on the code This is discussed in issue #49. However, in epochX (issue #244) I spent a lot of effort to ensure that the code generation produces code that is now "beautifully" indented and formatted. So somthing like clang-format becomes less relevant ( I would actually avoid it at this point).
My CURRENT BASELINE PERFORMANCE (before vectorization) is described in 1c25007 Note that now in epochX all performances are described in logs committed to the repo, rather than in git commit logs. The equivalent of the above (with sam eperformance within fluctuations) is the epochx2 golden tag https://github.com/madgraph5/madgraph4gpu/blob/golden_epochX2/epochX/cudacpp/tput/throughputX_log_eemumu.txt
This issue can now be closed
Hi @roiser @oliviermattelaer @hageboeck
All my latest developments (master, klas/vectorization, heterogeneous, unweighting etc) are all in epoch1/eemumu. It was my mistake not to base this on epoch2/eemumu.
Luckily, the changes that have been made in epoch2/eemumu since its initial creation are relatively few.
As agree with @roiser on Friday, I will essentially merge those epoch2 changes into epoch1. I proposed to also upgrade epoch2 to the level of epoch1, but we agreed that it is better to use epoch1/eemumu as the basis, and eventually use this to backport to MG and create epoch3.
We also agreed that I will self-merge essentially all of this stuff, as there is quite a bit to be done. I will try to document here what I am doing and why.