simd-everywhere / simde

Implementations of SIMD instruction sets for systems which don't natively support them.
https://simd-everywhere.github.io/blog/
MIT License
2.26k stars 233 forks source link

ARM SVE implementations of AVX/AVX2 #98

Open nemequ opened 4 years ago

nemequ commented 4 years ago

SVE is ARM's "vector-length agnostic" API. AFAICT it's really only available for CPUs geared towards HPC, and not very widely used (yet?), but it works with vector sizes that are a multiple of 128 bits, up to 2048 bits (maybe less on some CPUs, I'm not totally clear on that). It should be possible to use it to implement 256-bit and 512-bit vectors (i.e., AVX and AVX-512).

There is an emulator which we may be able to install on CI (though it's only available on AArch64). If anyone is interested in this please let me know and I'll look into that, but I'd be willing to add SVE without it since at least I can still test locally.

It sounds like SVE is a bit limited so coverage of Intel's APIs may be a bit sparse, but SVE2 is supposed to greatly expand support and implementing SVE would put us in a good position to add SVE2 when it's available.

Torinde commented 3 months ago

It seems SVE is so far supported as:

Apple M4 - SVE2/SME - details unknown?

Torinde commented 3 months ago

Regarding CI: QEMU supports SVE, SVE2 (and even SME) - in all vector lengths from 128-bit to 2048-bit (with 128-bit increments), including the non-power-of-2 ones, which were later disallowed (interesting blog on that link).

Few questions about the "scalable" part of SVE:

  1. Code written for 128-bit will just work on 256-bit and bigger SVE CPUs?
  2. Code written for 128-bit will just work on 256-bit SVE and will be faster than on 128-bit SVE? Or speed will remain the same?
  3. Code written for 256-bit will just work on 128-bit SVE, but will be slower than on 256-bit? Or it will not work?

Possible answers:

  1. Yes, that's the basic premise of "scalable"
  2. Some (few?), but not all, instructions are Vector Length Agnostic (VLA), so part of the code may execute faster when ran on wider SVE hardware. ARMv9.4 brings "Hybrid VLA" (any link to further info about it?).
  3. Will not work, even one of the commenters is not sure if OS can trap that and provide slow emulation if code itself doesn't have fallback codepaths