Build and test using ARM vector extensions

valassi commented 3 years ago

This was discussed during the June 28 meeting https://indico.cern.ch/event/1053713/ This is mainly about ARMv9 Scalable Vector Extension 2 (SVE2)... of course we need to find a test system first.

What would this give on the AVX512 issue #173 ?

In theory, vector widths up to 2048 bits could be supported? See for instance https://levelup.gitconnected.com/armv9-what-is-the-big-deal-4528f20f78f3

valassi commented 3 years ago

A few interesting links about this issue

https://arxiv.org/abs/1803.06185 The ARM Scalable Vector Extension (2018)
https://levelup.gitconnected.com/armv9-what-is-the-big-deal-4528f20f78f3 (2021)
https://developer.arm.com/documentation/100891/0612/sve-overview/introducing-sve

roiser commented 3 years ago

That's interesting, I have some connections to UK ARM company and HPC centers. Let's discuss next time.

valassi commented 2 years ago

Hi @roiser I dump here a few comments, which wer etriggered by your PR #421 but go beyond that. (This even goes beyond ARM so maybe I should open a new issue, but it is easier to discuss it here).

Until now, before your PR #421 for mac M1 ARMv8, we had only two types of SIMD:

the full spectrum of Intel, where I initially developed this: none, sse4 (128bit), avx2 (256bit), 512y (avx512 limited to 256bt ymm registers), 512z (avx512 with 512bit zmm registers)
Power9 from CINECA (PR #223), which only has two options: none, or 128bit VSX

The important points here in what I designed are two

The SIMD level must be strictly configurable: I want to be able to test the same platform with no simd and compare it to several levels of SIMD. This means I want to be able to have a 'none' option.
SIMD is implemented via compiler vector extensions that are fully controlled in the code. Depending on some ifdefs, the length neppV of the fptype_v vector is set. There is no autovectorization in the matrix element calculation (but there may be some elsewhere, for the moment...).

The patch I added for Power9 is quite ugly, but that's what it is. Since Power9 has no SSE4, but the logic of the code internally depends on -D__SSE4_2__, in the Makefile I am simply adding this flag for Power9 if AVX=sse4 is set https://github.com/madgraph5/madgraph4gpu/blob/898e8b4c98356c5e31b8bd4755b9f8cdd8d98865/epochX/cudacpp/ee_mumu/SubProcesses/Makefile#L216

Your patch in PR #421 is using a different (probably better) logic: instead of relying on the SSE4 defines, the code internally looks for the __ARM_NEON__ defines, which are proper to ARM v8 (I guess on any ARM, not only on Mac M1), https://github.com/madgraph5/madgraph4gpu/blob/02c26af4f0554e05c67eab2fa78169eac62ddf04/epochX/cudacpp/gg_tt/src/mgOnGpuConfig.h#L146

The problem I have with this are two

One, I would prefer to have a consistent approach: Power9 VSX and ARM NEON are exactly in the same situation, they only have 128 bit registers, and now they are being handled differently. I prefer moving the ARM version to the same approach as Power9 (to be eventually modified as described below, using a MG specific ifdef for "128" SIMD).
Two, the patch as it is has no way of defining a "none" option. The ARM_NEON define is always there, because there is no modifier applied to the CXXFLAGS to distinguish between no-simd and 128-simd.

As briefly mentioned above, I think that eventually it would probably be better to stop relying on the Intel specific defines, and introduce MG-specific defines for 'none', 'sse4', 'avx2', '512y' and '512z' (and even these names could be changes as they are the Intel namess, but I have a huge amount of scripts and logs with them, and it may be easier to stay with them). Note by the way that there is already a MG-specific define: using only gcc native flags and defines, it is impossible to distinguish between 512y and 512z, because actually the difference is in the MG code (use 4 or 8 doubles per fptype_v), and this is handled with the MG-specific -DMGONGPU_PVW512.

About the fact that ARM_NEON is currently always defined, I have done some investigations, eg using

echo | c++ -E -dM -
echo | c++ -v -E -

which are the default settings. I think that one option would be using these two sets of flags, if we wanted to disable ARM_NEON by modifying CXXFLAGS:

echo | c++ -E -dM - -march=armv8.3a
echo | c++ -v -E - -march=armv8.3a
echo | c++ -E -dM - -march=armv8.3a+nosimd
echo | c++ -v -E - -march=armv8.3a+nosimd

These are actually documented under gcc flags, even if c++ is clang here (not armclang): https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html This may be useful later for ARMv9 (which was the original point of this #221).

About yet another realted issue, which is discovering in the Makefile which is the 'best' SIMD level one can support, this was done with /proc/cpuinfo on (Intel) Linux. On Mac (for which I only have ARM), this is is with sysctl, for instance

> sysctl -a | grep -i armv
hw.optional.armv8_1_atomics: 1
hw.optional.armv8_2_fhm: 1
hw.optional.armv8_2_sha512: 1
hw.optional.armv8_2_sha3: 1
hw.optional.armv8_3_compnum: 1
hw.optional.armv8_crc32: 1
hw.optional.armv8_gpi: 1

> sysctl -a | grep -i neon
hw.optional.neon: 1
hw.optional.neon_hpfp: 1
hw.optional.neon_fp16: 1

I might add a check based on this, too.

Summary for now, about PR #421: I will modify that to use the same approach as in Power9, ie using SSE4 defines. This makes it unnecessary to change the -march and -mcpu compiler flags, but will allow a distinction between "none" and "sse4" (ie neon here) with -D__SSE4_2__.

One important point: I will test on Mac only from the command line with make: no cmake, no xcode. I hope this does not break things...

As for later on: I guess one option would be to change the code based on equivalents of the "prefer vector width" define DMGONGPU_PVW512. For instance:

SCALAR for none
PVW128 for sse4/Intel, vsx/Power9, neon/armv8
PVW256 (with AVX512 not defined) for avx2/intel
PVW256 (with AVX512 defined) for 512y/intel
PVW512 for 512z/Intel This approach should also result in build errors for unsupported configurations (I guess 8 doubles per vector will fail compilation unless AVX512 is on?).

Long post, feedback more than welcome! Andrea

valassi commented 2 years ago

The new implementation is in WIP PR #425 (which also includes many other Mac patches, not just the handling of NEON SIMD on ARM)

valassi commented 2 years ago

This is a summary table from the tests of all five processes double/float on Mac M1 ARMv8 with and without NEON SIMD in PR #425. One nicely sees factors close to 2x and 4x for double and float https://github.com/madgraph5/madgraph4gpu/blob/b5ef53ca10a77bece14c0814e5d5ef0275dee53e/epochX/cudacpp/tput/summaryTable_macm1.txt

*** FPTYPE=d ******************************************************************

+++ cudacpp REVISION ea28661 +++
On mac-1T0-438.local [CPU: Apple M1] [GPU: none]:

[Apple clang 12.0.5] 
HELINL=0 HRDCOD=0
                eemumu          ggtt         ggttg        ggttgg       ggttggg
         [2048/256/12]  [2048/256/1]    [64/256/1]    [64/256/1]     [1/256/1]
CPP/none      3.51e+06      5.77e+05      7.02e+04      5.43e+03      2.24e+02
CPP/sse4      7.55e+06      9.79e+05      1.39e+05      1.11e+04      3.65e+02

*** FPTYPE=f ******************************************************************

+++ cudacpp REVISION ea28661 +++
On mac-1T0-438.local [CPU: Apple M1] [GPU: none]:

[Apple clang 12.0.5] 
HELINL=0 HRDCOD=0
                eemumu          ggtt         ggttg        ggttgg       ggttggg
         [2048/256/12]  [2048/256/1]    [64/256/1]    [64/256/1]     [1/256/1]
CPP/none      3.57e+06      5.76e+05      7.10e+04      5.59e+03      2.61e+02
CPP/sse4      1.50e+07      1.85e+06      2.48e+05      2.16e+04      7.21e+02

roiser commented 2 years ago

Hi @valassi thanks for adding the "none" option, that's certainly useful.

For what concerns multiple architectures, I agree its a bit ugly to use e.g. SSE4_2 for Power, so going to generic names would probably make sense.

valassi commented 2 years ago

Hi @roiser, thanks a lot, good that we agree :-) I have opened (non urgent) issue #426 about this as a reminder.

madgraph5 / madgraph4gpu

Build and test using ARM vector extensions #221