Closed malaterre closed 1 year ago
Hi, I'm not familiar with POWER4, looks like that launched in 2001 :o I'm curious what the use-case is?
Debian supports two big-endian arches with simd:
ppc32 is technically G3, but since it can run on G4 I assumed some altivec code could be "borrowed" for those systems.
ppc64 is supposed to run on G5, so again some recent altivec code be "borrowed" for this arch.
Got it. It's plausible that the current code could run on G5 with hopefully not too major modifications. If someone is willing to do so and update run_tests.sh with the required qemu/flags, I would consider maintaining that, especially if there is someone who actually wants to run Highway on that arch.
There are some limitations on targets that don't support VSX such as PowerPC G4/PowerPC G5/POWER4/POWER5/POWER6, including the following:
Vec128<int64_t, N>
, Vec128<uint64_t, N>
, Mask128<int64_t, N>
, and Vec128<uint64_t, N>
types are still needed for Altivec targets that don't support VSX as the Vec128<uint64_t, N>
type is needed for the SumsOf8
, StoreInterleaved3
, and StoreInterleaved4
operations on Altivec targetsVec128<int64_t, N>
vectors need to be backed by __vector signed int
instead of __vector signed long long
on targets that support Altivec but not VSXVec128<uint64_t, N>
vectors need to be backed by __vector unsigned int
instead of __vector unsigned long long
on targets that support Altivec but not VSXMask128<int64_t, N>
and Mask128<uint64_t, N>
masks need to be backed by __vector __bool int
instead of __vector __bool long long
on targets that support Altivec but not VSXvec_ld
operations, a lvsl
or lvsr
operation, and a vec_perm
operationvec_re
operation, then refining the approximate reciprocal through Newton-Raphson refinement steps, and then multiplying the dividend by the refined reciprocal (which is described at https://en.wikipedia.org/wiki/Division_algorithm#Newton%E2%80%93Raphson_division)vec_rsqrte
operation, then multiplying the source values by the approximate square root, and then refining the result using Goldschmidt's algorithm (which is described at https://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Goldschmidt%E2%80%99s_algorithm)I originally included support for targets that supported Altivec but not VSX in the port of Highway to PPC, but I removed the support for Altivec/PPC7 from the hwy/ops/ppc_vsx-inl.h header as some of the Highway unit tests were failing on the Altivec target (but passing on little-endian PPC8/PPC9/PPC10 targets).
Here is a gist that has implementations of various int64_t, uint64_t, and float vector operations for Altivec: https://gist.github.com/johnplatts/761fc35054eeb2a83b15cebd4b6ef288
Thanks @johnplatts for the list. It looks like some nontrivial effort is required, but might be worthwhile if/when someone actually wants to run on POWER4. Let's wait to see if such a use case arises?
FYI it is legitimate for a Highway implementation to set
#define HWY_HAVE_INTEGER64 0
#define HWY_HAVE_FLOAT16 0
#define HWY_HAVE_FLOAT64 0
This would take care of many of the missing bits, but also be less useful for apps that actually do want to use 64-bit operations.
@jan-wassenberg Would it be acceptable for the time being to simply copy/paste x86_128-inl.h
for power4/power5 with gcc rs6000 helps (I think clang also has some ppc wrappers):
?
Interesting, I didn't know the compiler ships something like that. sse2neon/neon2sse are indeed useful.
If we copy-paste the entire file, that will be a larger maintenance burden. How about we do something like, at the end of highway.h, also including x86_128-inl.h #if HWY_TARGET == HWY_PPC4
?
How about we do something like, at the end of highway.h, also including x86_128-inl.h
#if HWY_TARGET == HWY_PPC4
?
I believe I misread the documentation. This port is only for powerpc64el, so this will never work for POWER4/POWER5.
Technically it even fails for PPC8 with random inconstencies (SSE vs AVX...), not sure what gcc is supposed to support here:
[ 1%] Building CXX object CMakeFiles/hwy.dir/hwy/per_target.cc.o
In file included from /home/malat/highway/hwy/highway.h:384,
from /home/malat/highway/hwy/per_target.cc:21:
/home/malat/highway/hwy/ops/x86_128-inl.h: In function 'hwy::N_PPC8::Vec128<unsigned char> hwy::N_PPC8::AESRound(Vec128<unsigned char>, Vec128<unsigned char>)':
/home/malat/highway/hwy/ops/x86_128-inl.h:5883:26: error: '_mm_aesenc_si128' was not declared in this scope; did you mean '_mm_testnzc_si128'?
5883 | return Vec128<uint8_t>{_mm_aesenc_si128(state.raw, round_key.raw)};
| ^~~~~~~~~~~~~~~~
| _mm_testnzc_si128
Ah, OK. So we don't yet have a drop-in solution for PPC4.
It's not surprising they do not support _mm_aesenc_si128 - (efficiently) emulating that in software is a couple hundred lines of tricky code.
We haven't yet heard of potential Highway users on POWER4/5, but please feel free to reopen if that changes :)
POWER8/POWER9 support has been added recently, it would be nice to also have POWER4/POWER5 support.