Open valassi opened 3 years ago
A few interesting links about this issue
That's interesting, I have some connections to UK ARM company and HPC centers. Let's discuss next time.
Hi @roiser I dump here a few comments, which wer etriggered by your PR #421 but go beyond that. (This even goes beyond ARM so maybe I should open a new issue, but it is easier to discuss it here).
Until now, before your PR #421 for mac M1 ARMv8, we had only two types of SIMD:
The important points here in what I designed are two
The patch I added for Power9 is quite ugly, but that's what it is. Since Power9 has no SSE4, but the logic of the code internally depends on -D__SSE4_2__, in the Makefile I am simply adding this flag for Power9 if AVX=sse4 is set https://github.com/madgraph5/madgraph4gpu/blob/898e8b4c98356c5e31b8bd4755b9f8cdd8d98865/epochX/cudacpp/ee_mumu/SubProcesses/Makefile#L216
Your patch in PR #421 is using a different (probably better) logic: instead of relying on the SSE4 defines, the code internally looks for the __ARM_NEON__ defines, which are proper to ARM v8 (I guess on any ARM, not only on Mac M1), https://github.com/madgraph5/madgraph4gpu/blob/02c26af4f0554e05c67eab2fa78169eac62ddf04/epochX/cudacpp/gg_tt/src/mgOnGpuConfig.h#L146
The problem I have with this are two
As briefly mentioned above, I think that eventually it would probably be better to stop relying on the Intel specific defines, and introduce MG-specific defines for 'none', 'sse4', 'avx2', '512y' and '512z' (and even these names could be changes as they are the Intel namess, but I have a huge amount of scripts and logs with them, and it may be easier to stay with them). Note by the way that there is already a MG-specific define: using only gcc native flags and defines, it is impossible to distinguish between 512y and 512z, because actually the difference is in the MG code (use 4 or 8 doubles per fptype_v), and this is handled with the MG-specific -DMGONGPU_PVW512.
About the fact that ARM_NEON is currently always defined, I have done some investigations, eg using
echo | c++ -E -dM -
echo | c++ -v -E -
which are the default settings. I think that one option would be using these two sets of flags, if we wanted to disable ARM_NEON by modifying CXXFLAGS:
echo | c++ -E -dM - -march=armv8.3a
echo | c++ -v -E - -march=armv8.3a
echo | c++ -E -dM - -march=armv8.3a+nosimd
echo | c++ -v -E - -march=armv8.3a+nosimd
These are actually documented under gcc flags, even if c++ is clang here (not armclang): https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html This may be useful later for ARMv9 (which was the original point of this #221).
About yet another realted issue, which is discovering in the Makefile which is the 'best' SIMD level one can support, this was done with /proc/cpuinfo on (Intel) Linux. On Mac (for which I only have ARM), this is is with sysctl, for instance
> sysctl -a | grep -i armv
hw.optional.armv8_1_atomics: 1
hw.optional.armv8_2_fhm: 1
hw.optional.armv8_2_sha512: 1
hw.optional.armv8_2_sha3: 1
hw.optional.armv8_3_compnum: 1
hw.optional.armv8_crc32: 1
hw.optional.armv8_gpi: 1
> sysctl -a | grep -i neon
hw.optional.neon: 1
hw.optional.neon_hpfp: 1
hw.optional.neon_fp16: 1
I might add a check based on this, too.
Summary for now, about PR #421: I will modify that to use the same approach as in Power9, ie using SSE4 defines. This makes it unnecessary to change the -march and -mcpu compiler flags, but will allow a distinction between "none" and "sse4" (ie neon here) with -D__SSE4_2__.
One important point: I will test on Mac only from the command line with make: no cmake, no xcode. I hope this does not break things...
As for later on: I guess one option would be to change the code based on equivalents of the "prefer vector width" define DMGONGPU_PVW512. For instance:
Long post, feedback more than welcome! Andrea
The new implementation is in WIP PR #425 (which also includes many other Mac patches, not just the handling of NEON SIMD on ARM)
This is a summary table from the tests of all five processes double/float on Mac M1 ARMv8 with and without NEON SIMD in PR #425. One nicely sees factors close to 2x and 4x for double and float https://github.com/madgraph5/madgraph4gpu/blob/b5ef53ca10a77bece14c0814e5d5ef0275dee53e/epochX/cudacpp/tput/summaryTable_macm1.txt
*** FPTYPE=d ******************************************************************
+++ cudacpp REVISION ea28661 +++
On mac-1T0-438.local [CPU: Apple M1] [GPU: none]:
[Apple clang 12.0.5]
HELINL=0 HRDCOD=0
eemumu ggtt ggttg ggttgg ggttggg
[2048/256/12] [2048/256/1] [64/256/1] [64/256/1] [1/256/1]
CPP/none 3.51e+06 5.77e+05 7.02e+04 5.43e+03 2.24e+02
CPP/sse4 7.55e+06 9.79e+05 1.39e+05 1.11e+04 3.65e+02
*** FPTYPE=f ******************************************************************
+++ cudacpp REVISION ea28661 +++
On mac-1T0-438.local [CPU: Apple M1] [GPU: none]:
[Apple clang 12.0.5]
HELINL=0 HRDCOD=0
eemumu ggtt ggttg ggttgg ggttggg
[2048/256/12] [2048/256/1] [64/256/1] [64/256/1] [1/256/1]
CPP/none 3.57e+06 5.76e+05 7.10e+04 5.59e+03 2.61e+02
CPP/sse4 1.50e+07 1.85e+06 2.48e+05 2.16e+04 7.21e+02
Hi @valassi thanks for adding the "none" option, that's certainly useful.
For what concerns multiple architectures, I agree its a bit ugly to use e.g. SSE4_2 for Power, so going to generic names would probably make sense.
Hi @roiser, thanks a lot, good that we agree :-) I have opened (non urgent) issue #426 about this as a reminder.
This was discussed during the June 28 meeting https://indico.cern.ch/event/1053713/ This is mainly about ARMv9 Scalable Vector Extension 2 (SVE2)... of course we need to find a test system first.
What would this give on the AVX512 issue #173 ?
In theory, vector widths up to 2048 bits could be supported? See for instance https://levelup.gitconnected.com/armv9-what-is-the-big-deal-4528f20f78f3