madgraph5 / madgraph4gpu

GPU development for the Madgraph5_aMC@NLO event generator software package
30 stars 32 forks source link

Detailed comparisons of Nvidia A100 and V100 #156

Open valassi opened 3 years ago

valassi commented 3 years ago

Hi @ingvildh,

as discussed this morning, one thing that could be quite useful, for studying A100s, would be to start by comparing the performance to our V100 baseline.

I would suggest to start by taking the gg to ggtt code of epoch2 (issue #146, e.g. commit https://github.com/madgraph5/madgraph4gpu/commit/dd8711d0aa22c85429802b23de3169a68a97f298), run nsight compute and get a baseline for V100, then rerun for A100 and see if you can compare the various metrics. I have never used nsight compute to compare profiles from two different systems, but I assume it is possible.

Later on, one idea may be to look at tensor cores (#118), maybe for color algebra (#155)

Thanks! Andrea

valassi commented 2 years ago

This can be useful for tensor cores #118 and specifically for color algebra on tensor cores #155. But we need to find the A100 cards first...

valassi commented 2 years ago

Thanks to Stefan we now have A100, see PR #381

valassi commented 2 years ago

Without yet looking at tensor cores, the face-value performance for the same implementation on V100 and A100 is compred in the Juwels Booster tests in PR #381. See https://github.com/madgraph5/madgraph4gpu/blob/a69d7f9ea37dd6445cd375e6b29a33f6a884e681/epochX/cudacpp/tput/summaryTable_juwels.txt#L50

*** FPTYPE=d ******************************************************************

+++ REVISION c2e67b4 +++
On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]:

[nvcc 11.6.55 (gcc 10.2.0)] 
HELINL=0 HRDCOD=0
            eemumu      ggtt        ggttg       ggttgg      ggttggg     
CUD/none    1.35e+09    1.41e+08    1.45e+07    5.20e+05    1.18e+04    

+++ REVISION df441ad +++
On jwb0085.juwels [CPU: AMD EPYC 7402 24-Core Processor] [GPU: 4x NVIDIA A100-SXM4-40GB]:

[nvcc 11.5.50 (gcc 11.2.0)] 
HELINL=0 HRDCOD=0
            eemumu      ggtt        ggttg       ggttgg      ggttggg     
CUD/none    1.57e+09    1.69e+08    2.37e+07    9.45e+05    2.04e+04    

*** FPTYPE=f ******************************************************************

+++ REVISION c2e67b4 +++
On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]:

[nvcc 11.6.55 (gcc 10.2.0)] 
HELINL=0 HRDCOD=0
            eemumu      ggtt        ggttg       ggttgg      ggttggg     
CUD/none    3.26e+09    3.79e+08    4.75e+07    9.71e+05    2.66e+04    

+++ REVISION df441ad +++
On jwb0085.juwels [CPU: AMD EPYC 7402 24-Core Processor] [GPU: 4x NVIDIA A100-SXM4-40GB]:

[nvcc 11.5.50 (gcc 11.2.0)] 
HELINL=0 HRDCOD=0
            eemumu      ggtt        ggttg       ggttgg      ggttggg     
CUD/none    3.80e+09    4.78e+08    5.73e+07    1.80e+06    3.74e+04    

The throughput increase seems to range from 10-20% for the simplest processes to almost a factor 2 for the more complex processes. Eventualy one could try to understand it better with nsight profiling.