Add/plan for 256 bit and 512 bit SIMD functions?

oscarbg commented 10 years ago

Hi, since last year we have 256 bit (8*int32) SIMD ISA shipping (AVX2) in Haswell processors.. seems next year we will have also 512 bit SIMD support (i.e. 16xint32) in form of AVX512.. since executing 128 bit SIMD instructions on a 512 bit SIMD capable processor (Intel Skylake?) is only 25% efficient i.e. similar to currently no SIMD support on a SSE only processor (pre 2011 SandyBridge's) seems you should already plan adding Int32x8 and Int32x16 instructions which in case of say only 128 bit SIMD support by processor should be lowered to 2 int32x4 or 4 int32x4 instructions respectively.. Make sense?

johnmccutchan commented 10 years ago

Keep in mind that no ARM processor supports 256-bit or 512-bit wide SIMD. The current SIMD-128 specification is designed to run fast on both x86 and ARM chips. It will be difficult to design a portable set of operations for SIMD-256 and SIMD-512 without a second instruction set.

oscarbg commented 10 years ago

yes ARM is only 128 bit right now but anyway can't 256-bit or 512bit extension be lowered to use a 128 bit SIMD ISA with nearly perfect efficiency? i.e. Int32x8 and Int32x16 can be exposed (at least add multiply and bitwise instructions as any serious SIMD implementation of any width must expose) using SIMD instructions in ARM processors altough it would require two o 4 129 bit instructions.. I think makes sense altough can wait farther in the future..

BrendanEich commented 10 years ago

This is a low-level API, meant to map to hardware instructions with O(1) asymptotic complexity. As David Bacon and others like to point out, O(k) for constant k, while formally O(1), is often noticeably different in real-world performance.

My advice matches John's: let's wait for two ISAs to adopt wider vectors before we expend effort and add complexity here.

/be

sbalko commented 9 years ago

That's a pity - doubling the performance in fitting cases seems very significant to me. Not sure if the 64-bit ARM instruction sets will have 256bit wide vectors - probably not (yet). But what about GPGPU? With GPUs receiving so much attention on mobile devices these days and compute shaders available on Android (RenderScript), iOS (Metal), and the most recent incarnation of OpenGL ES, supporting wider vectors seems like a very elegant way of improving JS performance by using the GPU (and without resorting to some WebGL hacks like this one: https://github.com/stormcolor/webclgl).

sunfishcode commented 9 years ago

SIMD.js by itself isn't suitable to run on GPUs, and extending it to 256-bit or any other predetermined size won't change that. It's a very CPU-oriented API, and it benefits greatly from the simplicity this affords it.

On the GPU side, the most recent incarnation of OpenGL ES is coming, under the name WebGL 2. It is expected to bring ARB_compute_shader, which will open up the GPU to a much broader audience.

What lies in the future, beyond what we're specifically planning in SIMD.js and WebGL 2? Many things are possible :-).

jfbastien commented 9 years ago

Doubling the vector width that SIMD.js expose may indeed double the performance on architecture which natively support these vector width, but won't leave other architectures with the same performance as if the code had used 128-bit SIMD.

Oftentimes, using wider vectors efficiently will require:

Fancy masking to turn off lanes.
Slightly divergence in the algorithm based on the data being modified.
Entirely different algorithms and datastructures, because the effects on the processor's cache hierarchy are different.

Splitting wide vector instructions into smaller ones causes much more register pressure, and either requires the JIT-compiler to re-roll loops or suffer significant performance hits. This also means that a tight loop may not fit in e.g. x86's loop stream detector, hitting yet another perf cliff.

AFAIK ISA designers such as ARM haven't discussed their plans for wider vectors publicly, and I would strongly advise that they be involved in this discussion.

TL;DR: standardizing wider vector types should be heavily based on implementation-experience and data on multiple architectures.

sunfishcode commented 9 years ago

I agree. Another thing with wider vectors that can't easily be cleaned up in a JIT is that they require more code in cleanup loops to handle cases where array lengths aren't multiples of the SIMD lane count.

And to correct something I said above, compute shaders will not be in WebGL 2; hopefully they will likely be added sometime after that. Regardless, I believe compute shaders will still eventually serve some subset of the larger problem here.

tc39 / ecmascript_simd

Add/plan for 256 bit and 512 bit SIMD functions? #43