madgraph5 / madgraph4gpu

GPU development for the Madgraph5_aMC@NLO event generator software package
30 stars 32 forks source link

Build and test on AMD EPYC CPUs #239

Open valassi opened 3 years ago

valassi commented 3 years ago

It would be useful to build and test on AMD x86 CPUs and not only Intel.

Thanks to @lfield and his colleagues I have had access to an AMD EPYC at CERN. The results are in PR #238.

Note that that node does not yet support AVX512. It is an AMD EPYC 7302, so apparently a Zen2 https://en.wikichip.org/wiki/amd/epyc/7302

Instead AVX512 will be supported in 2021 by Zen4 https://www.techpowerup.com/279129/amd-zen-4-microarchitecture-to-support-avx-512

Anyway, the results on this older Zen2 are already quite interesting. Thanks again Laurence!

valassi commented 2 years ago

Note: also Juwels Booster has Zen2 with only AVX2 (actually 7402 and not 7302, but still Zen2) , see PR #381

https://en.wikichip.org/wiki/amd/epyc/7402

valassi commented 2 years ago

A summary of results at Juwels Booster is here (thanks to @roiser for getting access!) https://github.com/madgraph5/madgraph4gpu/blob/a69d7f9ea37dd6445cd375e6b29a33f6a884e681/epochX/cudacpp/tput/summaryTable_juwels.txt#L50

*** FPTYPE=d ******************************************************************
+++ REVISION df441ad +++
On jwb0085.juwels [CPU: AMD EPYC 7402 24-Core Processor] [GPU: 4x NVIDIA A100-SXM4-40GB]:

[nvcc 11.5.50 (gcc 11.2.0)] 
HELINL=0 HRDCOD=0
            eemumu      ggtt        ggttg       ggttgg      ggttggg     
CUD/none    1.57e+09    1.69e+08    2.37e+07    9.45e+05    2.04e+04    
CPP/none    2.26e+06    2.63e+05    2.89e+04    2.17e+03    9.11e+01    
CPP/sse4    4.34e+06    3.94e+05    5.41e+04    4.15e+03    1.72e+02    
CPP/avx2    8.58e+06    8.69e+05    1.26e+05    9.92e+03    3.68e+02    

*** FPTYPE=f ******************************************************************
+++ REVISION df441ad +++
On jwb0085.juwels [CPU: AMD EPYC 7402 24-Core Processor] [GPU: 4x NVIDIA A100-SXM4-40GB]:

[nvcc 11.5.50 (gcc 11.2.0)] 
HELINL=0 HRDCOD=0
            eemumu      ggtt        ggttg       ggttgg      ggttggg     
CUD/none    3.80e+09    4.78e+08    5.73e+07    1.80e+06    3.74e+04    
CPP/none    2.36e+06    2.74e+05    3.09e+04    2.30e+03    9.95e+01    
CPP/sse4    8.72e+06    6.15e+05    1.08e+05    9.14e+03    3.84e+02    
CPP/avx2    1.74e+07    1.50e+06    2.50e+05    1.98e+04    7.36e+02    

There are nice x4 and x8 speedups from AVX2 (there is no AVX512).

Note that the no-vectorization C++ performance on this single threaded test is half way between an Intel Silver and an Intel Gold, for instance for ggttggg CPP/none double: