Closed valassi closed 1 year ago
Note that icc 19.1 should be supported already with cuda 11.0 https://docs.nvidia.com/cuda/archive/11.0/cuda-installation-guide-linux/index.html
Vector extensions give some issues on icc. This needs some debugging. The error from the compiler makes no sense, either those are both double or they are both fptype_v...?
ccache icpc -O3 -std=c++17 -I. -Wall -Wshadow -Wextra -fopenmp -ffast-math -march=nehalem -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_COMMONRAND_ONHOST -c rambo.cc -o rambo.o
In file included from rambo.h(6),
from rambo.cc(1):
mgOnGpuVectors.h(63): error: a value of type "mgOnGpu::fptype_v" cannot be used to initialize an entity of type "double"
cxtype_v( const fptype_v& r, const fptype_v& i ) : m_real{r}, m_imag{i} {}
^
Note that in principle simple things should work https://stackoverflow.com/a/43801280
Brief status:
On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]:
=========================================================================
Process = EPOCH1_EEMUMU_CPP [icc 1910]
FP precision = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD)
Random number generation = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 1.209906e+06 ) sec^-1
MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0
TOTAL : 6.724400 sec
19,820,008,713 cycles # 2.669 GHz
47,411,758,319 instructions # 2.39 insn per cycle
6.753384559 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 2572) (avx2: 0) (512y: 0) (512z: 0)
-------------------------------------------------------------------------
Process = EPOCH1_EEMUMU_CPP [icc 1910]
FP precision = FLOAT (NaN/abnormal=5, zero=0)
Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD)
Random number generation = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 1.310986e+06 ) sec^-1
MeanMatrixElemValue = ( 1.371781e-02 +- 3.268987e-06 ) GeV^0
TOTAL : 6.221624 sec
17,585,205,984 cycles # 2.669 GHz
44,086,482,510 instructions # 2.51 insn per cycle
6.260991680 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 2052) (avx2: 0) (512y: 0) (512z: 0)
=========================================================================
Note that PR #225 does not use vectorization yet.
Maybe use the opencl vector types? https://software.intel.com/content/www/us/en/develop/documentation/iocl-opg/top/coding-for-the-intel-cpu-opencl-device/using-vector-data-types.html
PS No drop it... seems too complex... stick to gcc and clang
I now understand we should use icx (based on clang) rather than icc in the future. I am testing this and it looks much better (eg vector extensions of clang work out of the box!), will give an update soon. https://software.intel.com/content/www/us/en/develop/articles/porting-guide-for-icc-users-to-dpcpp-or-icx.html
I have added some patches in PR #225 (previosuly for icc, now for icx!).
SIMD works out of the box with clang vector extensions.
Performance is quite good. Note that 512y uses AVX512 (on top of AVX2) that are very different from those of gcc. Note that 512z with zmm is still worse than avx2.
Performance as in https://github.com/madgraph5/madgraph4gpu/commit/5291e90056a14820c145b88aa61994759c31719b
On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]:
=========================================================================
Process = EPOCH1_EEMUMU_CPP [icx 202110 (clang 12.0.0, gcc 9.2.0)]
FP precision = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD)
Random number generation = COMMON RANDOM (C++ code)
EvtsPerSec[MECalcOnly] (3a) = ( 1.279997e+06 ) sec^-1
MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0
TOTAL : 6.202581 sec
18,381,246,706 cycles # 2.660 GHz
47,024,518,007 instructions # 2.56 insn per cycle
6.230900555 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 1256) (avx2: 0) (512y: 0) (512z: 0)
-------------------------------------------------------------------------
Process = EPOCH1_EEMUMU_CPP [icx 202110 (clang 12.0.0, gcc 9.2.0)]
FP precision = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=NO]
Random number generation = COMMON RANDOM (C++ code)
EvtsPerSec[MECalcOnly] (3a) = ( 2.703703e+06 ) sec^-1
MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0
TOTAL : 3.640068 sec
11,461,025,252 cycles # 2.651 GHz
27,399,149,934 instructions # 2.39 insn per cycle
3.668955160 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 3575) (avx2: 0) (512y: 0) (512z: 0)
-------------------------------------------------------------------------
Process = EPOCH1_EEMUMU_CPP [icx 202110 (clang 12.0.0, gcc 9.2.0)]
FP precision = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=NO]
Random number generation = COMMON RANDOM (C++ code)
EvtsPerSec[MECalcOnly] (3a) = ( 5.352981e+06 ) sec^-1
MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0
TOTAL : 2.526700 sec
7,959,938,892 cycles # 2.500 GHz
14,869,683,474 instructions # 1.87 insn per cycle
2.555407655 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2983) (512y: 0) (512z: 0)
-------------------------------------------------------------------------
Process = EPOCH1_EEMUMU_CPP [icx 202110 (clang 12.0.0, gcc 9.2.0)]
FP precision = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=NO]
Random number generation = COMMON RANDOM (C++ code)
EvtsPerSec[MECalcOnly] (3a) = ( 5.424606e+06 ) sec^-1
MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0
TOTAL : 2.448236 sec
7,902,434,434 cycles # 2.501 GHz
13,833,834,610 instructions # 1.75 insn per cycle
2.476706923 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2662) (512y: 23) (512z: 0)
-------------------------------------------------------------------------
Process = EPOCH1_EEMUMU_CPP [icx 202110 (clang 12.0.0, gcc 9.2.0)]
FP precision = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=NO]
Random number generation = COMMON RANDOM (C++ code)
EvtsPerSec[MECalcOnly] (3a) = ( 3.765503e+06 ) sec^-1
MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0
TOTAL : 2.961650 sec
7,888,560,957 cycles # 2.151 GHz
12,777,170,156 instructions # 1.62 insn per cycle
2.990664088 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 0) (avx2: 3414) (512y: 16) (512z: 1195)
-------------------------------------------------------------------------
Process = EPOCH1_EEMUMU_CPP [icx 202110 (clang 12.0.0, gcc 9.2.0)]
FP precision = FLOAT (NaN/abnormal=6, zero=0)
Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD)
Random number generation = COMMON RANDOM (C++ code)
EvtsPerSec[MECalcOnly] (3a) = ( 1.346366e+06 ) sec^-1
MeanMatrixElemValue = ( 1.371780e-02 +- 3.268978e-06 ) GeV^0
TOTAL : 5.821107 sec
16,426,527,461 cycles # 2.665 GHz
45,368,664,996 instructions # 2.76 insn per cycle
5.841815941 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 1731) (avx2: 0) (512y: 0) (512z: 0)
-------------------------------------------------------------------------
Process = EPOCH1_EEMUMU_CPP [icx 202110 (clang 12.0.0, gcc 9.2.0)]
FP precision = FLOAT (NaN/abnormal=6, zero=0)
Internal loops fptype_sv = VECTOR[4] ('sse4': SSE4.2, 128bit) [cxtype_ref=NO]
Random number generation = COMMON RANDOM (C++ code)
EvtsPerSec[MECalcOnly] (3a) = ( 5.233977e+06 ) sec^-1
MeanMatrixElemValue = ( 1.371780e-02 +- 3.268977e-06 ) GeV^0
TOTAL : 2.436466 sec
7,319,517,943 cycles # 2.645 GHz
16,533,218,574 instructions # 2.26 insn per cycle
2.457423776 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 4248) (avx2: 0) (512y: 0) (512z: 0)
-------------------------------------------------------------------------
Process = EPOCH1_EEMUMU_CPP [icx 202110 (clang 12.0.0, gcc 9.2.0)]
FP precision = FLOAT (NaN/abnormal=4, zero=0)
Internal loops fptype_sv = VECTOR[8] ('avx2': AVX2, 256bit) [cxtype_ref=NO]
Random number generation = COMMON RANDOM (C++ code)
EvtsPerSec[MECalcOnly] (3a) = ( 1.088952e+07 ) sec^-1
MeanMatrixElemValue = ( 1.371786e-02 +- 3.269407e-06 ) GeV^0
TOTAL : 1.750793 sec
5,333,417,013 cycles # 2.529 GHz
10,175,903,936 instructions # 1.91 insn per cycle
1.771695841 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 0) (avx2: 3703) (512y: 0) (512z: 0)
-------------------------------------------------------------------------
Process = EPOCH1_EEMUMU_CPP [icx 202110 (clang 12.0.0, gcc 9.2.0)]
FP precision = FLOAT (NaN/abnormal=4, zero=0)
Internal loops fptype_sv = VECTOR[8] ('512y': AVX512, 256bit) [cxtype_ref=NO]
Random number generation = COMMON RANDOM (C++ code)
EvtsPerSec[MECalcOnly] (3a) = ( 1.096628e+07 ) sec^-1
MeanMatrixElemValue = ( 1.371786e-02 +- 3.269407e-06 ) GeV^0
TOTAL : 1.747508 sec
5,341,359,998 cycles # 2.530 GHz
9,923,934,126 instructions # 1.86 insn per cycle
1.768237794 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 0) (avx2: 3118) (512y: 38) (512z: 0)
-------------------------------------------------------------------------
Process = EPOCH1_EEMUMU_CPP [icx 202110 (clang 12.0.0, gcc 9.2.0)]
FP precision = FLOAT (NaN/abnormal=4, zero=0)
Internal loops fptype_sv = VECTOR[16] ('512z': AVX512, 512bit) [cxtype_ref=NO]
Random number generation = COMMON RANDOM (C++ code)
EvtsPerSec[MECalcOnly] (3a) = ( 7.953015e+06 ) sec^-1
MeanMatrixElemValue = ( 1.371786e-02 +- 3.269407e-06 ) GeV^0
TOTAL : 1.985831 sec
5,261,809,712 cycles # 2.265 GHz
9,428,840,503 instructions # 1.79 insn per cycle
2.006741173 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 0) (avx2: 3784) (512y: 0) (512z: 1865)
=========================================================================
Amongst the things to do:
Within PR #328 I have added a few patches for both icx2021 (based on clang13) and icx2022 (based on clang14). The former is supported by th elatest cuda11.6, to which I moved in that same PR. The latter is not yet supported by a cuda release.
The performance with icx2021 is quite interesting https://github.com/madgraph5/madgraph4gpu/pull/328/commits/6350a75040cb554b078683edb9c750e1352ff5c0
The results on icx2022 are very similar performance wise https://github.com/madgraph5/madgraph4gpu/pull/328/commits/8d04e49ec95c6086c60f8b68958e975a1c4181fc
I have now added support for icx2023 in MR #593. This uses clang16 internally.
Its results are very very similar to those of clang14 as seen in MR #591. In the past I had had the impression that icx was slightly better than clang, but now this does not seem so.
The results of clang itself, conversely, do differ from gcc and are worth investigating. I will open an issue.
Note that the bug #338 about icx2022 giving different ohysics results has been fixed (not sure where it was).
This can be closed as done.
This was discussed during the June 28 meeting https://indico.cern.ch/event/1053713/
Would this help with the AVX512 issue #173 ?