madgraph5 / madgraph4gpu

GPU development for the Madgraph5_aMC@NLO event generator software package
30 stars 33 forks source link

Build and test using ARM vector extensions #221

Open valassi opened 3 years ago

valassi commented 3 years ago

This was discussed during the June 28 meeting https://indico.cern.ch/event/1053713/ This is mainly about ARMv9 Scalable Vector Extension 2 (SVE2)... of course we need to find a test system first.

What would this give on the AVX512 issue #173 ?

In theory, vector widths up to 2048 bits could be supported? See for instance https://levelup.gitconnected.com/armv9-what-is-the-big-deal-4528f20f78f3

valassi commented 3 years ago

A few interesting links about this issue

roiser commented 3 years ago

That's interesting, I have some connections to UK ARM company and HPC centers. Let's discuss next time.

valassi commented 2 years ago

Hi @roiser I dump here a few comments, which wer etriggered by your PR #421 but go beyond that. (This even goes beyond ARM so maybe I should open a new issue, but it is easier to discuss it here).

Until now, before your PR #421 for mac M1 ARMv8, we had only two types of SIMD:

The important points here in what I designed are two

The patch I added for Power9 is quite ugly, but that's what it is. Since Power9 has no SSE4, but the logic of the code internally depends on -D__SSE4_2__, in the Makefile I am simply adding this flag for Power9 if AVX=sse4 is set https://github.com/madgraph5/madgraph4gpu/blob/898e8b4c98356c5e31b8bd4755b9f8cdd8d98865/epochX/cudacpp/ee_mumu/SubProcesses/Makefile#L216

Your patch in PR #421 is using a different (probably better) logic: instead of relying on the SSE4 defines, the code internally looks for the __ARM_NEON__ defines, which are proper to ARM v8 (I guess on any ARM, not only on Mac M1), https://github.com/madgraph5/madgraph4gpu/blob/02c26af4f0554e05c67eab2fa78169eac62ddf04/epochX/cudacpp/gg_tt/src/mgOnGpuConfig.h#L146

The problem I have with this are two

As briefly mentioned above, I think that eventually it would probably be better to stop relying on the Intel specific defines, and introduce MG-specific defines for 'none', 'sse4', 'avx2', '512y' and '512z' (and even these names could be changes as they are the Intel namess, but I have a huge amount of scripts and logs with them, and it may be easier to stay with them). Note by the way that there is already a MG-specific define: using only gcc native flags and defines, it is impossible to distinguish between 512y and 512z, because actually the difference is in the MG code (use 4 or 8 doubles per fptype_v), and this is handled with the MG-specific -DMGONGPU_PVW512.

About the fact that ARM_NEON is currently always defined, I have done some investigations, eg using

echo | c++ -E -dM -
echo | c++ -v -E -

which are the default settings. I think that one option would be using these two sets of flags, if we wanted to disable ARM_NEON by modifying CXXFLAGS:

echo | c++ -E -dM - -march=armv8.3a
echo | c++ -v -E - -march=armv8.3a
echo | c++ -E -dM - -march=armv8.3a+nosimd
echo | c++ -v -E - -march=armv8.3a+nosimd

These are actually documented under gcc flags, even if c++ is clang here (not armclang): https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html This may be useful later for ARMv9 (which was the original point of this #221).

About yet another realted issue, which is discovering in the Makefile which is the 'best' SIMD level one can support, this was done with /proc/cpuinfo on (Intel) Linux. On Mac (for which I only have ARM), this is is with sysctl, for instance

> sysctl -a | grep -i armv
hw.optional.armv8_1_atomics: 1
hw.optional.armv8_2_fhm: 1
hw.optional.armv8_2_sha512: 1
hw.optional.armv8_2_sha3: 1
hw.optional.armv8_3_compnum: 1
hw.optional.armv8_crc32: 1
hw.optional.armv8_gpi: 1

> sysctl -a | grep -i neon
hw.optional.neon: 1
hw.optional.neon_hpfp: 1
hw.optional.neon_fp16: 1

I might add a check based on this, too.

Summary for now, about PR #421: I will modify that to use the same approach as in Power9, ie using SSE4 defines. This makes it unnecessary to change the -march and -mcpu compiler flags, but will allow a distinction between "none" and "sse4" (ie neon here) with -D__SSE4_2__.

One important point: I will test on Mac only from the command line with make: no cmake, no xcode. I hope this does not break things...

As for later on: I guess one option would be to change the code based on equivalents of the "prefer vector width" define DMGONGPU_PVW512. For instance:

Long post, feedback more than welcome! Andrea

valassi commented 2 years ago

The new implementation is in WIP PR #425 (which also includes many other Mac patches, not just the handling of NEON SIMD on ARM)

valassi commented 2 years ago

This is a summary table from the tests of all five processes double/float on Mac M1 ARMv8 with and without NEON SIMD in PR #425. One nicely sees factors close to 2x and 4x for double and float https://github.com/madgraph5/madgraph4gpu/blob/b5ef53ca10a77bece14c0814e5d5ef0275dee53e/epochX/cudacpp/tput/summaryTable_macm1.txt

*** FPTYPE=d ******************************************************************

+++ cudacpp REVISION ea28661 +++
On mac-1T0-438.local [CPU: Apple M1] [GPU: none]:

[Apple clang 12.0.5] 
HELINL=0 HRDCOD=0
                eemumu          ggtt         ggttg        ggttgg       ggttggg
         [2048/256/12]  [2048/256/1]    [64/256/1]    [64/256/1]     [1/256/1]
CPP/none      3.51e+06      5.77e+05      7.02e+04      5.43e+03      2.24e+02
CPP/sse4      7.55e+06      9.79e+05      1.39e+05      1.11e+04      3.65e+02

*** FPTYPE=f ******************************************************************

+++ cudacpp REVISION ea28661 +++
On mac-1T0-438.local [CPU: Apple M1] [GPU: none]:

[Apple clang 12.0.5] 
HELINL=0 HRDCOD=0
                eemumu          ggtt         ggttg        ggttgg       ggttggg
         [2048/256/12]  [2048/256/1]    [64/256/1]    [64/256/1]     [1/256/1]
CPP/none      3.57e+06      5.76e+05      7.10e+04      5.59e+03      2.61e+02
CPP/sse4      1.50e+07      1.85e+06      2.48e+05      2.16e+04      7.21e+02
roiser commented 2 years ago

Hi @valassi thanks for adding the "none" option, that's certainly useful.

For what concerns multiple architectures, I agree its a bit ugly to use e.g. SSE4_2 for Power, so going to generic names would probably make sense.

valassi commented 2 years ago

Hi @roiser, thanks a lot, good that we agree :-) I have opened (non urgent) issue #426 about this as a reminder.