Open valassi opened 3 years ago
First patch for Marconi M100 (Power9+A100) is PR #224
(1) From http://cdn.openpowerfoundation.org/wp-content/uploads/resources/Intrinsics-Reference_final/Intrinsics-Reference-20200811.pdf: "IBM extended VMX by introducing the Vector-Scalar Extension (VSX) for the POWER7 family of processors. VSX adds sixty-four 128-bit vector-scalar registers (VSRs); however, to optimize the amount of per-process register state, the registers overlap with the VRs and the scalar floatingpoint registers (FPRs) (see Section 1.2, “The Unified Vector Register Set” [2]). The VSRs can represent all the data types representable by the VRs, and can also be treated as containing two 64-bit integers or two 64-bit double-precision floating-point values. However, ISA support for two 64-bit integers in VSRs was limited until Version 2.07 (POWER8) of the Power ISA, and only the VRs are supported for these instructions."
(2) From https://developer.ibm.com/technologies/linux/tutorials/migrate-app-on-lop/ "Most applications can be compiled and run on Power Systems without the need to modify the source codes. Architecture-specific options can be applied to improve the performance, such as
"Some applications include x86-specific (such as #ifdef SSE, #ifdef __x86_64) defines to enable performance on the x86 system. In this case, manually adding -DSSE or -DPPC__ is necessary to map or replace x86 functions to Power-specific functions."
"IBM has published tips (https://www.ibm.com/support/pages/vectorizing-fun-and-performance) and a guide to help porting code containing MMX/SSE/AVX to VSX (https://openpowerfoundation.org/?resource_lib=linux-power-porting-guide-vector-intrinsics)"
"IBM Advance Toolchain compiler or GCC 8+
(3) From https://www.ibm.com/support/pages/vectorizing-fun-and-performance "Each VSR is 128 bits. Thus, each VSR can hold 2 double-precision (64 bits each) or 4 single-precision (32 bits each) floating point quantities."
In summary, I imagine that it should be possible to
I have committed a basic patch for vectorization on Power9/M100 in PR #228
By the way, the CUDA build options probably needs to be reviewed too. I am building with gcc as host compiler now, not sure what "-Xcompiler" does.
The following is the current baseline (for C++) from https://github.com/madgraph5/madgraph4gpu/commit/ad4f39f5c5bf7f7968c4f36f92c797364d3d73cf
On login01 [CPU: PowerNV 8335-GTG, POWER9, altivec supported] [GPU: 4x Tesla V100-SXM2-16GB]:
=========================================================================
Process = EPOCH1_EEMUMU_CPP [gcc 8.3.1]
FP precision = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD)
Random number generation = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 128
EvtsPerSec[MECalcOnly] (3a) = ( 1.419728e+06 ) sec^-1
MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0
TOTAL : 5.771237 sec
24,251,588,159 cycles:u # 3.716 GHz (48.16%)
48,358,693 stalled-cycles-frontend:u # 0.20% frontend cycles idle (47.23%)
12,281,012,627 stalled-cycles-backend:u # 50.64% backend cycles idle (10.48%)
33,076,547,650 instructions:u # 1.36 insn per cycle
# 0.37 stalled cycles per insn (21.05%)
5.776777270 seconds time elapsed
=Symbols (vs) in CPPProcess.o= (^mtv: 12) (^xs: 562) (^xx: 81) (^xv: 48)
-------------------------------------------------------------------------
Process = EPOCH1_EEMUMU_CPP [gcc 8.3.1]
FP precision = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv = VECTOR[2] ('sse4': PPC VSX, 128bit) [cxtype_ref=YES]
Random number generation = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 128
EvtsPerSec[MECalcOnly] (3a) = ( 2.168020e+06 ) sec^-1
MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0
TOTAL : 4.432267 sec
18,823,598,617 cycles:u # 3.643 GHz (46.70%)
63,975,332 stalled-cycles-frontend:u # 0.34% frontend cycles idle (45.86%)
9,742,740,551 stalled-cycles-backend:u # 51.76% backend cycles idle (10.79%)
20,224,117,135 instructions:u # 1.07 insn per cycle
# 0.48 stalled cycles per insn (21.63%)
4.436642027 seconds time elapsed
=Symbols (vs) in CPPProcess.o= (^mtv: 76) (^xs: 19) (^xx: 432) (^xv: 1960)
-------------------------------------------------------------------------
Process = EPOCH1_EEMUMU_CPP [gcc 8.3.1]
FP precision = FLOAT (NaN/abnormal=5, zero=0)
Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD)
Random number generation = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 128
EvtsPerSec[MECalcOnly] (3a) = ( 1.357362e+06 ) sec^-1
MeanMatrixElemValue = ( 1.371787e-02 +- 3.269419e-06 ) GeV^0
TOTAL : 5.669633 sec
22,873,682,828 cycles:u # 3.773 GHz (46.30%)
28,696,480 stalled-cycles-frontend:u # 0.13% frontend cycles idle (48.05%)
11,724,563,802 stalled-cycles-backend:u # 51.26% backend cycles idle (11.38%)
31,800,642,662 instructions:u # 1.39 insn per cycle
# 0.37 stalled cycles per insn (22.77%)
5.673647465 seconds time elapsed
=Symbols (vs) in CPPProcess.o= (^mtv: 17) (^xs: 633) (^xx: 59) (^xv: 65)
-------------------------------------------------------------------------
Process = EPOCH1_EEMUMU_CPP [gcc 8.3.1]
FP precision = FLOAT (NaN/abnormal=5, zero=0)
Internal loops fptype_sv = VECTOR[4] ('sse4': PPC VSX, 128bit) [cxtype_ref=YES]
Random number generation = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 128
EvtsPerSec[MECalcOnly] (3a) = ( 4.660189e+06 ) sec^-1
MeanMatrixElemValue = ( 1.371787e-02 +- 3.269419e-06 ) GeV^0
TOTAL : 2.388031 sec
10,371,827,387 cycles:u # 3.746 GHz (41.91%)
38,874,146 stalled-cycles-frontend:u # 0.37% frontend cycles idle (45.04%)
5,190,512,607 stalled-cycles-backend:u # 50.04% backend cycles idle (12.84%)
12,224,037,376 instructions:u # 1.18 insn per cycle
# 0.42 stalled cycles per insn (25.84%)
2.391942889 seconds time elapsed
=Symbols (vs) in CPPProcess.o= (^mtv: 11) (^xs: 207) (^xx: 470) (^xv: 2077)
=========================================================================
Note that vectorization (with 128bit registers) gets a factor ~1.5 for doubles and ~3.5 for floats against some theoretical x2 and x4. The gain is a bit low for doubles.
I updated the title. Marconi100 is actually Power9 plus V100, not A100. This is visible in the logs I posted above,
As discussed at the June 21 meeting https://indico.cern.ch/event/1028452/
The M100 system documentation is here: https://wiki.u-gov.it/confluence/display/SCAIUS/UG3.2%3A+MARCONI100+UserGuide