madgraph5 / madgraph4gpu

GPU development for the Madgraph5_aMC@NLO event generator software package
30 stars 33 forks source link

Build and test using the Intel compiler #220

Closed valassi closed 1 year ago

valassi commented 3 years ago

This was discussed during the June 28 meeting https://indico.cern.ch/event/1053713/

Would this help with the AVX512 issue #173 ?

valassi commented 3 years ago

Note that icc 19.1 should be supported already with cuda 11.0 https://docs.nvidia.com/cuda/archive/11.0/cuda-installation-guide-linux/index.html

valassi commented 3 years ago

Vector extensions give some issues on icc. This needs some debugging. The error from the compiler makes no sense, either those are both double or they are both fptype_v...?

ccache icpc  -O3 -std=c++17 -I.  -Wall -Wshadow -Wextra -fopenmp  -ffast-math  -march=nehalem  -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_COMMONRAND_ONHOST  -c rambo.cc -o rambo.o
In file included from rambo.h(6),
                 from rambo.cc(1):
mgOnGpuVectors.h(63): error: a value of type "mgOnGpu::fptype_v" cannot be used to initialize an entity of type "double"
      cxtype_v( const fptype_v& r, const fptype_v& i ) : m_real{r}, m_imag{i} {}
                                                                ^

Note that in principle simple things should work https://stackoverflow.com/a/43801280

valassi commented 3 years ago

Brief status:

On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]:
=========================================================================
Process                     = EPOCH1_EEMUMU_CPP [icc 1910]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 1.209906e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     6.724400 sec
    19,820,008,713      cycles                    #    2.669 GHz                    
    47,411,758,319      instructions              #    2.39  insn per cycle         
       6.753384559 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 2572) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [icc 1910]
FP precision                = FLOAT (NaN/abnormal=5, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 1.310986e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371781e-02 +- 3.268987e-06 )  GeV^0
TOTAL       :     6.221624 sec
    17,585,205,984      cycles                    #    2.669 GHz                    
    44,086,482,510      instructions              #    2.51  insn per cycle         
       6.260991680 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 2052) (avx2:    0) (512y:    0) (512z:    0)
=========================================================================
valassi commented 3 years ago

Note that PR #225 does not use vectorization yet.

Maybe use the opencl vector types? https://software.intel.com/content/www/us/en/develop/documentation/iocl-opg/top/coding-for-the-intel-cpu-opencl-device/using-vector-data-types.html

PS No drop it... seems too complex... stick to gcc and clang

valassi commented 3 years ago

I now understand we should use icx (based on clang) rather than icc in the future. I am testing this and it looks much better (eg vector extensions of clang work out of the box!), will give an update soon. https://software.intel.com/content/www/us/en/develop/articles/porting-guide-for-icc-users-to-dpcpp-or-icx.html

valassi commented 3 years ago

I have added some patches in PR #225 (previosuly for icc, now for icx!).

SIMD works out of the box with clang vector extensions.

Performance is quite good. Note that 512y uses AVX512 (on top of AVX2) that are very different from those of gcc. Note that 512z with zmm is still worse than avx2.

valassi commented 3 years ago

Performance as in https://github.com/madgraph5/madgraph4gpu/commit/5291e90056a14820c145b88aa61994759c31719b

On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]:
=========================================================================
Process                     = EPOCH1_EEMUMU_CPP [icx 202110 (clang 12.0.0, gcc 9.2.0)]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
Random number generation    = COMMON RANDOM (C++ code)
EvtsPerSec[MECalcOnly] (3a) = ( 1.279997e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     6.202581 sec
    18,381,246,706      cycles                    #    2.660 GHz
    47,024,518,007      instructions              #    2.56  insn per cycle
       6.230900555 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 1256) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [icx 202110 (clang 12.0.0, gcc 9.2.0)]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=NO]
Random number generation    = COMMON RANDOM (C++ code)
EvtsPerSec[MECalcOnly] (3a) = ( 2.703703e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     3.640068 sec
    11,461,025,252      cycles                    #    2.651 GHz
    27,399,149,934      instructions              #    2.39  insn per cycle
       3.668955160 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 3575) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [icx 202110 (clang 12.0.0, gcc 9.2.0)]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=NO]
Random number generation    = COMMON RANDOM (C++ code)
EvtsPerSec[MECalcOnly] (3a) = ( 5.352981e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     2.526700 sec
     7,959,938,892      cycles                    #    2.500 GHz
    14,869,683,474      instructions              #    1.87  insn per cycle
       2.555407655 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2983) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [icx 202110 (clang 12.0.0, gcc 9.2.0)]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=NO]
Random number generation    = COMMON RANDOM (C++ code)
EvtsPerSec[MECalcOnly] (3a) = ( 5.424606e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     2.448236 sec
     7,902,434,434      cycles                    #    2.501 GHz
    13,833,834,610      instructions              #    1.75  insn per cycle
       2.476706923 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2662) (512y:   23) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [icx 202110 (clang 12.0.0, gcc 9.2.0)]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=NO]
Random number generation    = COMMON RANDOM (C++ code)
EvtsPerSec[MECalcOnly] (3a) = ( 3.765503e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     2.961650 sec
     7,888,560,957      cycles                    #    2.151 GHz
    12,777,170,156      instructions              #    1.62  insn per cycle
       2.990664088 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 3414) (512y:   16) (512z: 1195)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [icx 202110 (clang 12.0.0, gcc 9.2.0)]
FP precision                = FLOAT (NaN/abnormal=6, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
Random number generation    = COMMON RANDOM (C++ code)
EvtsPerSec[MECalcOnly] (3a) = ( 1.346366e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371780e-02 +- 3.268978e-06 )  GeV^0
TOTAL       :     5.821107 sec
    16,426,527,461      cycles                    #    2.665 GHz
    45,368,664,996      instructions              #    2.76  insn per cycle
       5.841815941 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 1731) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [icx 202110 (clang 12.0.0, gcc 9.2.0)]
FP precision                = FLOAT (NaN/abnormal=6, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('sse4': SSE4.2, 128bit) [cxtype_ref=NO]
Random number generation    = COMMON RANDOM (C++ code)
EvtsPerSec[MECalcOnly] (3a) = ( 5.233977e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371780e-02 +- 3.268977e-06 )  GeV^0
TOTAL       :     2.436466 sec
     7,319,517,943      cycles                    #    2.645 GHz
    16,533,218,574      instructions              #    2.26  insn per cycle
       2.457423776 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 4248) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [icx 202110 (clang 12.0.0, gcc 9.2.0)]
FP precision                = FLOAT (NaN/abnormal=4, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('avx2': AVX2, 256bit) [cxtype_ref=NO]
Random number generation    = COMMON RANDOM (C++ code)
EvtsPerSec[MECalcOnly] (3a) = ( 1.088952e+07                 )  sec^-1
MeanMatrixElemValue         = ( 1.371786e-02 +- 3.269407e-06 )  GeV^0
TOTAL       :     1.750793 sec
     5,333,417,013      cycles                    #    2.529 GHz
    10,175,903,936      instructions              #    1.91  insn per cycle
       1.771695841 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 3703) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [icx 202110 (clang 12.0.0, gcc 9.2.0)]
FP precision                = FLOAT (NaN/abnormal=4, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('512y': AVX512, 256bit) [cxtype_ref=NO]
Random number generation    = COMMON RANDOM (C++ code)
EvtsPerSec[MECalcOnly] (3a) = ( 1.096628e+07                 )  sec^-1
MeanMatrixElemValue         = ( 1.371786e-02 +- 3.269407e-06 )  GeV^0
TOTAL       :     1.747508 sec
     5,341,359,998      cycles                    #    2.530 GHz
     9,923,934,126      instructions              #    1.86  insn per cycle
       1.768237794 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 3118) (512y:   38) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [icx 202110 (clang 12.0.0, gcc 9.2.0)]
FP precision                = FLOAT (NaN/abnormal=4, zero=0)
Internal loops fptype_sv    = VECTOR[16] ('512z': AVX512, 512bit) [cxtype_ref=NO]
Random number generation    = COMMON RANDOM (C++ code)
EvtsPerSec[MECalcOnly] (3a) = ( 7.953015e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371786e-02 +- 3.269407e-06 )  GeV^0
TOTAL       :     1.985831 sec
     5,261,809,712      cycles                    #    2.265 GHz
     9,428,840,503      instructions              #    1.79  insn per cycle
       2.006741173 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 3784) (512y:    0) (512z: 1865)
=========================================================================
valassi commented 3 years ago

Amongst the things to do:

valassi commented 2 years ago

Within PR #328 I have added a few patches for both icx2021 (based on clang13) and icx2022 (based on clang14). The former is supported by th elatest cuda11.6, to which I moved in that same PR. The latter is not yet supported by a cuda release.

The performance with icx2021 is quite interesting https://github.com/madgraph5/madgraph4gpu/pull/328/commits/6350a75040cb554b078683edb9c750e1352ff5c0

The results on icx2022 are very similar performance wise https://github.com/madgraph5/madgraph4gpu/pull/328/commits/8d04e49ec95c6086c60f8b68958e975a1c4181fc

valassi commented 1 year ago

I have now added support for icx2023 in MR #593. This uses clang16 internally.

Its results are very very similar to those of clang14 as seen in MR #591. In the past I had had the impression that icx was slightly better than clang, but now this does not seem so.

The results of clang itself, conversely, do differ from gcc and are worth investigating. I will open an issue.

Note that the bug #338 about icx2022 giving different ohysics results has been fixed (not sure where it was).

This can be closed as done.