Build and test on Marconi M100 (Power9 + V100)

valassi commented 3 years ago

As discussed at the June 21 meeting https://indico.cern.ch/event/1028452/

The M100 system documentation is here: https://wiki.u-gov.it/confluence/display/SCAIUS/UG3.2%3A+MARCONI100+UserGuide

valassi commented 3 years ago

First patch for Marconi M100 (Power9+A100) is PR #224

Disable vectorization in C++, need to understand better the hardware and the compiler
Port throughput12.sh script (different /proc/cpuinfo, different nvidia-smi, different ncu...)

valassi commented 3 years ago

(1) From http://cdn.openpowerfoundation.org/wp-content/uploads/resources/Intrinsics-Reference_final/Intrinsics-Reference-20200811.pdf: "IBM extended VMX by introducing the Vector-Scalar Extension (VSX) for the POWER7 family of processors. VSX adds sixty-four 128-bit vector-scalar registers (VSRs); however, to optimize the amount of per-process register state, the registers overlap with the VRs and the scalar floatingpoint registers (FPRs) (see Section 1.2, “The Unified Vector Register Set” [2]). The VSRs can represent all the data types representable by the VRs, and can also be treated as containing two 64-bit integers or two 64-bit double-precision floating-point values. However, ISA support for two 64-bit integers in VSRs was limited until Version 2.07 (POWER8) of the Power ISA, and only the VRs are supported for these instructions."

(2) From https://developer.ibm.com/technologies/linux/tutorials/migrate-app-on-lop/ "Most applications can be compiled and run on Power Systems without the need to modify the source codes. Architecture-specific options can be applied to improve the performance, such as

compiler flags: -O3 -flto -fpeel-loops -funroll-loops -ftree-vectorize -ffast-math -mcpu=power9 -mtune=power9
SIMD code: -DNO_WARN_X86_INTRINSICS to map MMX/SSE to VSX (only GCC 8+ or at11.0)"

"Some applications include x86-specific (such as #ifdef SSE, #ifdef __x86_64) defines to enable performance on the x86 system. In this case, manually adding -DSSE or -DPPC__ is necessary to map or replace x86 functions to Power-specific functions."

"IBM has published tips (https://www.ibm.com/support/pages/vectorizing-fun-and-performance) and a guide to help porting code containing MMX/SSE/AVX to VSX (https://openpowerfoundation.org/?resource_lib=linux-power-porting-guide-vector-intrinsics)"

"IBM Advance Toolchain compiler or GCC 8+

Remove -msse or -msse4.1 and other AVX compiler options.
Add -DNO_WARN_X86_INTRINSICS to compiler flags.
Define default CPP in the code: x86_64 or amd64. Add/ replace with PPC or PPC64 or -Dx86_64 to CPPFLAGS.
Search for SSE,SSE4.1,… AVX in source code and define it in CPPFLAGS.
Check if config.guess includes ppc64le to configure files. If not, run autoreconf -if to reconfigure the code.
By default, char type on POWER is unsigned, whereas, it is signed on x86. Always include -fsigned-char when compiling.
Add POWER processor-specific options: -mcpu=power9, -mtune=power9 (if no SSE conversion)"

(3) From https://www.ibm.com/support/pages/vectorizing-fun-and-performance "Each VSR is 128 bits. Thus, each VSR can hold 2 double-precision (64 bits each) or 4 single-precision (32 bits each) floating point quantities."

In summary, I imagine that it should be possible to

get a performance similar to SSE42, ie a factor 2 for double precision
should manually add -D__SSE4_2__
maybe should manually add -DNO_WARN_X86_INTRINSICS?
maybe should manually add some or all of -flto -fpeel-loops -funroll-loops -ftree-vectorize -ffast-math -mcpu=power9 -mtune=power9?

valassi commented 3 years ago

I have committed a basic patch for vectorization on Power9/M100 in PR #228

essentially I just added -D__SSE4_2__
I also added -mcpu=power9 -mtune=power9 which seem to gain a few 2-3%
the option -flto can give HUGE improvements (a factor 4 without vectorization??) but needs further analysis, I will open an issue for that
all other build options do not really seem to give much
I also made some basic fixes of the objdump parser (but it does not work with -flto)

By the way, the CUDA build options probably needs to be reviewed too. I am building with gcc as host compiler now, not sure what "-Xcompiler" does.

The following is the current baseline (for C++) from https://github.com/madgraph5/madgraph4gpu/commit/ad4f39f5c5bf7f7968c4f36f92c797364d3d73cf

On login01 [CPU: PowerNV 8335-GTG, POWER9, altivec supported] [GPU: 4x Tesla V100-SXM2-16GB]:
=========================================================================
Process                     = EPOCH1_EEMUMU_CPP [gcc 8.3.1]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 128
EvtsPerSec[MECalcOnly] (3a) = ( 1.419728e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     5.771237 sec
    24,251,588,159      cycles:u                  #    3.716 GHz                      (48.16%)
        48,358,693      stalled-cycles-frontend:u #    0.20% frontend cycles idle     (47.23%)
    12,281,012,627      stalled-cycles-backend:u  #   50.64% backend cycles idle      (10.48%)
    33,076,547,650      instructions:u            #    1.36  insn per cycle
                                                  #    0.37  stalled cycles per insn  (21.05%)
       5.776777270 seconds time elapsed
=Symbols (vs) in CPPProcess.o= (^mtv:   12) (^xs:  562) (^xx:   81) (^xv:   48)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 8.3.1]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[2] ('sse4': PPC VSX, 128bit) [cxtype_ref=YES]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 128
EvtsPerSec[MECalcOnly] (3a) = ( 2.168020e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     4.432267 sec
    18,823,598,617      cycles:u                  #    3.643 GHz                      (46.70%)
        63,975,332      stalled-cycles-frontend:u #    0.34% frontend cycles idle     (45.86%)
     9,742,740,551      stalled-cycles-backend:u  #   51.76% backend cycles idle      (10.79%)
    20,224,117,135      instructions:u            #    1.07  insn per cycle
                                                  #    0.48  stalled cycles per insn  (21.63%)
       4.436642027 seconds time elapsed
=Symbols (vs) in CPPProcess.o= (^mtv:   76) (^xs:   19) (^xx:  432) (^xv: 1960)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 8.3.1]
FP precision                = FLOAT (NaN/abnormal=5, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 128
EvtsPerSec[MECalcOnly] (3a) = ( 1.357362e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371787e-02 +- 3.269419e-06 )  GeV^0
TOTAL       :     5.669633 sec
    22,873,682,828      cycles:u                  #    3.773 GHz                      (46.30%)
        28,696,480      stalled-cycles-frontend:u #    0.13% frontend cycles idle     (48.05%)
    11,724,563,802      stalled-cycles-backend:u  #   51.26% backend cycles idle      (11.38%)
    31,800,642,662      instructions:u            #    1.39  insn per cycle
                                                  #    0.37  stalled cycles per insn  (22.77%)
       5.673647465 seconds time elapsed
=Symbols (vs) in CPPProcess.o= (^mtv:   17) (^xs:  633) (^xx:   59) (^xv:   65)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 8.3.1]
FP precision                = FLOAT (NaN/abnormal=5, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('sse4': PPC VSX, 128bit) [cxtype_ref=YES]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 128
EvtsPerSec[MECalcOnly] (3a) = ( 4.660189e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371787e-02 +- 3.269419e-06 )  GeV^0
TOTAL       :     2.388031 sec
    10,371,827,387      cycles:u                  #    3.746 GHz                      (41.91%)
        38,874,146      stalled-cycles-frontend:u #    0.37% frontend cycles idle     (45.04%)
     5,190,512,607      stalled-cycles-backend:u  #   50.04% backend cycles idle      (12.84%)
    12,224,037,376      instructions:u            #    1.18  insn per cycle
                                                  #    0.42  stalled cycles per insn  (25.84%)
       2.391942889 seconds time elapsed
=Symbols (vs) in CPPProcess.o= (^mtv:   11) (^xs:  207) (^xx:  470) (^xv: 2077)
=========================================================================

Note that vectorization (with 128bit registers) gets a factor ~1.5 for doubles and ~3.5 for floats against some theoretical x2 and x4. The gain is a bit low for doubles.

valassi commented 3 years ago

I updated the title. Marconi100 is actually Power9 plus V100, not A100. This is visible in the logs I posted above,

madgraph5 / madgraph4gpu

Build and test on Marconi M100 (Power9 + V100) #223