Open valassi opened 3 years ago
This can be useful for tensor cores #118 and specifically for color algebra on tensor cores #155. But we need to find the A100 cards first...
Thanks to Stefan we now have A100, see PR #381
Without yet looking at tensor cores, the face-value performance for the same implementation on V100 and A100 is compred in the Juwels Booster tests in PR #381. See https://github.com/madgraph5/madgraph4gpu/blob/a69d7f9ea37dd6445cd375e6b29a33f6a884e681/epochX/cudacpp/tput/summaryTable_juwels.txt#L50
*** FPTYPE=d ******************************************************************
+++ REVISION c2e67b4 +++
On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]:
[nvcc 11.6.55 (gcc 10.2.0)]
HELINL=0 HRDCOD=0
eemumu ggtt ggttg ggttgg ggttggg
CUD/none 1.35e+09 1.41e+08 1.45e+07 5.20e+05 1.18e+04
+++ REVISION df441ad +++
On jwb0085.juwels [CPU: AMD EPYC 7402 24-Core Processor] [GPU: 4x NVIDIA A100-SXM4-40GB]:
[nvcc 11.5.50 (gcc 11.2.0)]
HELINL=0 HRDCOD=0
eemumu ggtt ggttg ggttgg ggttggg
CUD/none 1.57e+09 1.69e+08 2.37e+07 9.45e+05 2.04e+04
*** FPTYPE=f ******************************************************************
+++ REVISION c2e67b4 +++
On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]:
[nvcc 11.6.55 (gcc 10.2.0)]
HELINL=0 HRDCOD=0
eemumu ggtt ggttg ggttgg ggttggg
CUD/none 3.26e+09 3.79e+08 4.75e+07 9.71e+05 2.66e+04
+++ REVISION df441ad +++
On jwb0085.juwels [CPU: AMD EPYC 7402 24-Core Processor] [GPU: 4x NVIDIA A100-SXM4-40GB]:
[nvcc 11.5.50 (gcc 11.2.0)]
HELINL=0 HRDCOD=0
eemumu ggtt ggttg ggttgg ggttggg
CUD/none 3.80e+09 4.78e+08 5.73e+07 1.80e+06 3.74e+04
The throughput increase seems to range from 10-20% for the simplest processes to almost a factor 2 for the more complex processes. Eventualy one could try to understand it better with nsight profiling.
Hi @ingvildh,
as discussed this morning, one thing that could be quite useful, for studying A100s, would be to start by comparing the performance to our V100 baseline.
I would suggest to start by taking the gg to ggtt code of epoch2 (issue #146, e.g. commit https://github.com/madgraph5/madgraph4gpu/commit/dd8711d0aa22c85429802b23de3169a68a97f298), run nsight compute and get a baseline for V100, then rerun for A100 and see if you can compare the various metrics. I have never used nsight compute to compare profiles from two different systems, but I assume it is possible.
Later on, one idea may be to look at tensor cores (#118), maybe for color algebra (#155)
Thanks! Andrea