How will native code port on top of JS-SIMD?

tc39 / ecmascript_simd

SIMD numeric type for EcmaScript

Other

540 stars 71 forks source link

How will native code port on top of JS-SIMD? #59

Open juj opened 9 years ago

juj commented 9 years ago

With Emscripten, we have the capacity to port native C&C++ code to the web. When/if people read tweets along the lines of "JS has SIMD", it will invariably result in a stream of Emscripten developers attempting to port their MMX/SSE1/SSE2/... -based codebases over to JS-SIMD. We need to have an answer to these developers about what the support of mapping these constructs over to JS-SIMD looks like.

In the Emscripten compiler, we already have small bits of such SIMD support available. To chart what this mapping would look like for SSE1 in particular (focusing on just one instruction set spec to start with, and SSE1 is the most interesting one) when completed, I wrote up this spreadsheet: https://docs.google.com/spreadsheets/d/1QAGGf2M2IA6l4cvh8eTXdXGEUcPjdmTe_BLKGn5YCB4/edit?usp=sharing

As one can imagine, comparing the current spec and the set of SSE1 intrinsics listed in the above spreadsheet, there is a large gap. I wonder how this could be resolved?

huningxin commented 9 years ago

@juj , thanks for the spreadsheet. It is very informative! I happened to work on JS-SIMD in emscripten a bit. As my understanding, emscripten generates SIMD.js code from 1) LLVM vector types (<4 x i32>, <4 x f32> and <2 x f64>) and operations; 2) emscripten builtins, say emscripten_float32x4_min etc.,; So to fill the SSE1 intrinscis gap, there could be two ways:

map SSE1 intrinscis to LLVM vector type operations, and if there are no direct mapping, it might need some helper code
expose SIMD.js via emscripten builtins, then map SSE1 intrinscis to these emscripten builtins.

If there is a need to add new SIMD.js API, we need to consider it in JavaScript API design principle, say cross architecture for example, and open a specific issue here for discussion.

Your thoughts?

I found your filed issues (https://github.com/kripken/emscripten/labels/SIMD) in emscripten repo. I am willing to work with you to fill the gap. Let's see how far we can go. :)

juj commented 9 years ago

Looking at the current code, I am very worried about the potential need of such "helper code". Also while assembling the SSE1 support spreadsheet, I could not imagine how to support that API without performance cliffs. The front page talks about a "straw man proposal", so let me try to attack that here, somewhat boldly if that's ok:

Has it been considered that the JS-SIMD spec would directly add the SIMD intrinsics as-is from the instruction set to the spec? That is, after adding the new SIMD types (including int64), we would have SIMD.SSE1.load_ps, SIMD.SSE1.load_ss, SIMD.SSE1.loadh_pi, and so on (or simply via SIMD.load_ps without the extra .SSE1.), and the same for other intrinsic sets and NEON? Then a common "mapping" of overlapping functions would be layered on top, e.g. in the namespace SIMD.common.xxx (or simply document which of the SIMD.xxx are common), for people who want to write one SIMD code to work on both SSE and NEON. This would have the following advantages:

it would provide the functionality as-is like the hardware has it, unchanged.
there will be a good contract of explicit performance. I.e. calling a function, one knows what he will be getting, like with native SIMD. There won't be hidden under-the-hood instructions generated.
developers can reuse existing documentation from Intel and ARM for SIMD intrinsics, since the API is identical.
the API would match 1:1 with features provided by hardware (e.g. SIMD.SSE3.xxx will be available iff the target hardware supports SSE3)
developers will have the power to choose what kind of compatibility to target: if they want their code to run across SSE and NEON, they can limit to the documented common overlapping feature set, or if they want to target a specific SSSE3 or SSE4.1 functionality, they can also choose to do so themselves.
also developers can reuse existing code samples, and transform existing native SIMD code algorithms and snippets over to JS systematically.
the web working group does not have to invent a new SIMD instruction set, and in turn, developers will not need to learn a new SIMD instruction set and the surface area of new required documentation will be much smaller.
it would offer a suitable compile target for Emscripten, and directly usable in existing native code that uses SIMD, like Vorbis and Theora, or games and game engines like Unity3D or Unreal Engine 4.

Developers who wanted to use e.g. SSE2/3 but still have their code work on NEON and not break, we could offer an API like SIMD.allowSoftwareEmulation() which enables all functionality, but implemented in terms of another SIMD instruction set or software in the absence of the real thing. Or alternatively (and perhaps simpler), offer a JS polyfill library on how to implement those functions.

There are hundreds of different domain areas that utilize SIMD in different forms. I worry that if for example some of the SSE1 intrinsics (or MMX, SSE2, or NEON) are left out from the spec, we will need to soft-emulate the functions in code when compiling with Emscripten, which can easily become catastrophically slow. This will lead to the developer to need to adapt his code so that his SIMD only uses the "native" JS-SIMD feature set, which in turn will create a big need to author new "JS-SIMD porting guide and SSE1/SSE2/... emulation tips" documentations on how existing SSE algorithms should be rewritten to JS-SIMD, and what might be supported and what is not. If instead JS-SIMD would run a sequence of one or more of instructions to emulate that the browser needs to do under the hood, it will lead to failure when the performance is not like the developer expected. Currently I see that there is already pressure for JS-SIMD to jump over the fence to cater to domain-specific areas like #58, and if the spec offered the direct hardware instructions, solving a problem like #58 would be easy for the developer to do himself, like he does in the native world. Reading the issues in the tracker, I see that a Mandelbrot code sample has been used as a test, but I think that the real test should be to use different applications, which means that in addition to simple amplified for-loop processing (parallel for, autovectorization), it should stress audio/video decoding (Vorbis/Theora et al.), image processing (RGB<->YUV/RGB888<->RGB565, color->grayscale, gamma adjust, ...), string and block ops, raytracing and games (micro-interleaved SIMD ops), to name a few.

The approach that was taken with WebGL was to "just copy GLES2, and make sure it's safe" and it was very successful. I think the same should be done with SIMD: just copy the intrinsic APIs over, and make sure it's safe. That will give us comfort that all of the abovementioned domain areas will be catered for, since the native world has already proven that, and the performance will be equally as good, since the hardware mapping is explicit. The only purpose of SIMD is performance, and I think that we will fail unless the spec can deliver that uncompromised, in explicitly written down guarantees like "this function will compile down to this SSE/NEON instruction".

This turned out to be much longer writeup than I thought and I'm sure that this discussion has already been out there, so thanks if you managed to read it all the way to the end! Emscripten will be one of the heavy users of SIMD, and we already have more than a dozen codebases that mostly use MMX, SSE and SSE2, and they would happily flip the switch if they could compile over to JS-SIMD, so that's good to keep in mind when the spec evolves!

BrendanEich commented 9 years ago

@juj - thanks for writing that up.

If one believes (as I do) in the extensible web manifesto, then there is a tension between exposing native, hardware-specific capabilities on the one hand; and trying to unify hardware under a common API on the other.

Unification looks good at first blush, but if the portable path is intersection of divergent hardware architectures' low-level APIs or ISAs, then the result will not compete with "Native code", and so it will not advance the Web -- more likely it will hold back the Web vs. Native via indirect (opportunity) and even direct (intersection implementation) costs.

Unification via "union" rather than "intersection" means performance cliffs for low-level interfaces such as SIMD, which are worse than the alternative you outline. No one wants cliffs.

This leaves hardware- or ISA-specific APIs. Developers can then adapt higher-level libraries based on what is available (good old "object detection"). Just as native code developers have always done. And similar to how JS developers have coped with browsers and hardware across time and space.

This is where I land, too. Comments from others more than welcome. We probably need an es-discuss thread or three to really thrash this out, but I'm happy to start here.

/be

sunfishcode commented 9 years ago

I'm open to the idea of ISA-specific APIs. That's an interesting conversation to have. However, it still makes sense to have an "intersection"-ish API to serve as a common shared base, which is roughly the current SIMD spec that's in progress here today. I can see both styles of APIs coexisting, and even complementing each other. Developers could choose to use the portable API when they want to run well everywhere and don't need platform-specific features, and the hardware-specific API when they feel that's appropriate, or mix the two to make their own tradeoffs.

And so, I'd also like to continue to make progress on this "intersection"-ish API we have here, regardless of the direction of the "union"-ish API conversations.

kripken commented 9 years ago

I agree the intersection, of stuff that we want to guarantee runs well across all major SIMD implementations, is most important here. But I also see the motivation for something like SIMD.SSE1 etc. So what about this as a possible "compromise":

We spec and implement the intersection (what we are already doing). This is going to be fast on all major CPUs, in as guaranteed a way as we can do on the web.
We implement a semi-official JS "polyfill" for SIMD.SSE1. It uses the specced SIMD API under the hood (where possible). This means that it works in all browsers, as it is "just" a polyfill. However, by having this be a semi-official way to represent SSE1 etc. operations, browsers might take care to optimize it well. In the limit, a browser could make sure that those patterns are actually optimized down to the relevant SSE1 operation, when SSE1 is available, because semantically those patterns are identical to an SSE1 operation.

This does lack a guarantee of actually getting SSE1 when you ask for it, but then you wouldn't get it if your website happens to run on an ARM phone either...

BrendanEich commented 9 years ago

@kripken: the problem with 2 is that object detection, even without fallback, is better than a perf cliff. The app that doesn't also code for NEON will just fail to start (it should have arranged to be Intel-only before running, anyway).

I like common portable APIs, don't get me wrong -- so I agree with @sunfishcode that where hardware has a viable intersection semantics, we should have a generically namespaced API. That has no perf cliff problem.

Is there a non-cliff, a "perf hill", that you think could be tolerable enough to be preferable to no-service on the "wrong" arch?

/be

mhaghigh commented 9 years ago

Great discussion all, and thanks to @juj for the initial post.

Since the processors will evolve in time, in future we might see SIMD capabilities (some evolutionary, some more radical) that do not exist on any of the processors today. Inevitably, SIMD.JS needs to evolve as well. One option would be to bring in a group of new capabilities in each generation of SIMD.JS. For instance, now we may start with a set from the existing common capabilities of the processors as well as judicious select instructions/capabilities that are justified because of their dramatic performance impact for certain application domains (do not want to miss them and they do not pose performance cliffs). This would be analogous to the first version of SSE in the native world. Of course, for the first generation of SIMD.JS, we are not restricted to SSE, SSE2, etc. In other words, at each new generation of SIMD.JS, through the collective agreement of the community, we bring in new capabilities that are considered necessary, very helpful, etc.

Now, we are bringing in the SIMD object. Later on, we can add SIMD2 object, and so on. Ideally, it would be required to have backward compatibility: SIMD_n implies availability of SIMD_m for m < n. That way, object detection would also be very practical. Again, there may not be any connection between SIMD_n and SSE_n.

This is approximately the way the native world of each CPU platform works today and it seems a plausible approach for the web.

So, now we should decide what should come in at the first stage. This seems better be application domain driven. I don't consider all SIMD instructions equally important, some are more equal ;)

-moh

kripken commented 9 years ago

@BrendanEich , not sure I follow you? There is always going to be a perf cliff in some case here. If someone writes code specifically for one CPU's SIMD, and it runs on another, the polyfill or the browser will have to implement the right semantics in a likely slower manner. For that reason it seems risky to put one CPU's specific operations in an official spec. But a semi-official polyfill that browsers are free to implement or not, is within the realm of normal optimization unpredictability on the web. Or do you mean a different type of perf cliff here?

BrendanEich commented 9 years ago

@kripken: It would help if I defined "perf cliff": I mean the code works everywhere, but terribly slowly on some platforms, untenable slowness (4x slowdown counts? I think so).

The cases I see, to repeat in case I was unclear (quite possible!) are:

Portable intersection semantics (no perf cliffs, but intersection could be too small a set).
Portable union semantics (emulations with perf cliffs).
Non-portable union among top desktop+mobile SIMD ISAs (no perf cliffs, see below).

Obviously combinations are possible, and good. As with WebGL = OpenGL ES2 (currently), a strong enough (1) wins for many cases.

But SIMD and desktop/mobile divergence make me think (1) + (3) is strictly better, and worth the risk of non-portable JS being written. Let the github hordes help us discover the future (1), rolling up what wins from (3) and co-evolving with the hardware.

I'm assuming hardware vendors pay attention to what developers do with (1)+(3). I'm also giving devs the advantage, since web devs number ~10M vs. ~500K native devs. Check my numbers!

/be

kripken commented 9 years ago

It's possible the perf cliffs would be small if enough people make sure to support top SIMD ISAs, yes. That leaves new ISAs, but as you say, hardware vendors are likely aware of this stuff. But, the main concern is if people just write to one CPU. If it's a github library, then collaboration can fill in the holes, but in a specific app, they may well just focus on their main market (one CPU/browser/OS maybe).

This seems unavoidable anyhow, though. Safari's FTL uses LLVM which can autovectorize, and that approach may increase in parallel to the SIMD.js API. Autovectorization will always have such perf cliffs. So for that reason I am not too worried about adding non-portable things. However, I do feel that putting such non-portable things in a spec is troubling - for that reason I was suggesting it be in a semi-official library on the side. A library or autovectorization can also lead to perf cliffs, but are less problematic from a standards perspective.

Overall I think there is little difference between our positions. Perhaps I am focusing too much on small details.

BrendanEich commented 9 years ago

@kripken: it's true, unless a particular instruction available only in one arch were on a super-critical path, the cliff might be much less than 4x for the macro-benchmark. Hard to say without concrete instruction, emulation, and macro-benchmark.

It would be helpful to me at least to see the NEON version of

https://docs.google.com/spreadsheets/d/1QAGGf2M2IA6l4cvh8eTXdXGEUcPjdmTe_BLKGn5YCB4/edit#gid=0

and then the union and intersection, or at least their sizes. Is anyone doing that?

/be

juj commented 9 years ago

Thanks for all the discussion here!

I filled out the spreadsheet on the SSE1 support page to add a new column on how those SSE1 instructions map to NEON.

If one is looking for a strict set intersection at the intrinsics API level between NEON and SSE1 only, where the semantics are exactly identical (ignoring flush-denormals-to-zero and hardware fp exceptions) then, if I got it right, it is equal to the following functions:

_mm_loadu_ps = vld1q_f32
_mm_set1_ps = vdupq_n_f32
_mm_storeu_ps = vst1q_f32
_mm_add_ps = vaddq_f32
_mm_mul_ps = vmulq_f32
_mm_sub_ps = vsubq_f32
_mm_and_ps = vandq_u32 + vreinterpret_q
_mm_or_ps = vorrq_u32 + vreinterpret_q
_mm_xor_ps = veorq_u32 + vreinterpret_q
_mm_cmpeq_ps = vceqq_f32
_mm_cmpge_ps = vcgeq_f32
_mm_cmpgt_ps = vcgtq_f32
_mm_cmple_ps = vcleq_f32
_mm_cmplt_ps = vcltq_f32

If I missed something, please help complete the chart in the SSE1 spreadsheet page. As one can see following the spreadsheet, the overlap is very small.

That set intersection is perhaps barely suitable for autovectorization, which I see as a very small and uninteresting part of SIMD that at most applies to problems that one could call "embarrasingly SIMDable". Most applications of SIMD are outside that scope, so I think any kind of core minimal required set intersection vs optional SSE/NEON extensions approach will not work.

Dan rather excellently provides an example of my greatest fears in #67 . Thanks Dan for the research there! In #67, we are asking whether we should write the spec of the min() function according to how NEON works (and then x86 suffers a perf hit) or how SSE works (then NEON would suffer a perf hit), so it becomes a question of which one to favor at the expense of the other platform. This is the situation I would like to avoid at all costs: with that kind of interface, the compiler will have to insert instructions under the hood to satisfy the requirements put forth by our JS-SIMD specification. One might think that a few extra instructions is not bad, but in that example, running three instructions instead of just one is a +200% slowdown impact. But it gets even worse, since Dan picked the NEON max instruction there, how would we implement the semantics of _mm_max_ps on top of that api? This would require us to doubly-emulate the semantics, when Emscripten emulates SSE max on top of NEON max, which emulates back to SSE max inside the browser. The slowdown can easily be 10x or more for a single instruction.

Native developers enjoy the following advantages when writing SIMD:

The developer can choose which SIMD set to target by choosing the intrinsic functions to use.
The intrinsics are strictly documented to specify which hardware instruction they will run. (the few that don't, like _mm_set_ps, are helper ones for predictable instruction patterns)
The developer can (and will!) verify what he got by investigating the disassembly of the generated code.

In the native world, the only reason that developers accepted intrinsics and everyone doesn't still write SIMD by hand in assembly is the combination of 2 and 3. These provide a way for the developers to understand where their performance is going. I would argue that in order to deliver to par with native, we need 2 and 3 as well. Currently, web does not have any kind of history with 3, which would be especially important to have if the compiler has to do compatibility emulation like #67 under the hood. Otherwise we might be providing developers with an unpredictable black box one can't reason about, and a "trust us, we picked the fastest sequence for you" argument is outright patronizing. Also, I'm a bit worried about 1. Dan's fastest options for #67 are when AVX or SSE 4.1 is available, and require a fallback on older SSE sets - but the max function was a SSE1 operation to start with!

I agree that it is critical that we have consistently computed results across platforms. The more I think about this, I think we should abort any attempt to merge the SSEx and NEON instruction sets together into a new overarching JS-SIMD API. Instead we should offer all the intrinsics as-is without trying to come up with a merged API, especially if that would mean compromises like #67. To solve #67, we would have SIMD.SSE1._mm_max_ps and SIMD.NEON.vmaxq_f32, which would compute the maximum using the full semantics of either. This way we would not do favors or disfavors to either x86 or ARM by choosing an official instruction, and we would give the guarantee of a direct mapping where available. It would also solve the double-emulation problem from above. The results would be consistent: you can run SIMD.SSE1._mm_max_ps on ARM devices as well, and it will use the best possible SSE-over-NEON sequence we know is possible (either as a polyfill, or as implemented by the browser), using the exact semantics delivered by SSE1. As a bit of extra, we would provide an API which allows querying which SIMD instructions sets the current hardware directly has available, so that user code can choose which path to take.

I see that would be the perfect performance + cross-platform compatibility solution. What kind of arguments are there against this kind of approach?

sunfishcode commented 9 years ago

On the topic of _mm_max_ps in particular:

We won't be mapping _mm_max_ps onto the JS-SIMD max function. _mm_max_ps has defined behavior on NaN, and no matter what we do in #67, it won't make sense to do max + extra stuff when we can just do select(greaterThan(x, y), x, y) (or something very close to that). JITs can even pattern-match that down to a single maxps instruction on x86 if they wish, and even if they don't it's still only about 4 instructions or so (and fewer with SSE4.1+). On ARM this will just be a compare and select, 2 instructions if I'm not missing something, which isn't terrible. There won't be any 10x slowdowns or double emulation for min or max.

On the topic of intersection versus platform-specific API approaches:

As I said above, I'm open to discussing platform-specific APIs. It's bold, and it's good for us implementers to hear from this perspective, and it's a great conversation to have.

However, even if we do do that, there is still significant utility in a common intersection API, which should include all the stuff in your strict intersection list above (thanks for compiling that list, btw!), and also several things that are "pretty close", which I would say includes shuffles, swizzles, min/max, and perhaps some other things. In any kind of real code that can stay within this intersection, I expect average overhead will usually be lower than 200%, because the most common things are all still single-instruction.

I'm aware that there is a class of developers who define success in terms of the percentage of some theoretical peak of the hardware they have pre-selected for the software to run on, and they may feel that they cannot possibly be successful with this API. However there are also developers who would be happy to write code that simply runs several times faster than scalar code on any decent SIMD-capable CPU, present or foreseeable future, and they will find they can do a lot with this API. I'm even hopeful that we can do a good enough job in the intersection to appeal to a fair number of people in the middle of that spectrum as well.

I also don't want to live in a "trust us we picked the best instruction" world. I think part of the answer here is that we should ideally improve our tools for allowing developers to inspect the assembly code generated by the JIT. Part of the answer may be that we have platform-specific SIMD APIs along side the portable API. Part of the answer may be that there will hopefully be some JS-SIMD benchmarks that we can compare across implementations.

sunfishcode commented 9 years ago

As a follow-up, I just added implementations of _mm_max_ps and _mm_min_ps to Emscripten's xmmintrin.h using compare+select as described above. This makes the NaN and -0.0 handling exactly match that of x86, which is what the API wants, and it avoids the double-emulation problem.

juj commented 9 years ago

For reference, here is the commit mentioned to above: https://github.com/kripken/emscripten/commit/8c8c7fd3ac716f20c21a8edee9e2010d672d76d5 . The select(greaterThan(x, y), x, y) set of instructions would directly map to

movaps mask, x
cmpps mask, y, GT
andps x, mask
andnotps mask, y
orps x, y

which is five instructions, and requires one extra temporary register compared to maxps x, y. I don't think that would be good performance in any scenario. The proposed solution that "JITs can even pattern-match that down to a single maxps instruction on x86 if they wish" feels like the wrong direction for the spec, because:

that is the opposite of the "explicit performance guarantees" notion.
Firefox might do the pattern-matching but other browsers might not, leading to a source of performance differences across browsers.
pattern-matching might complicate JS VM development and slow down runtime JIT compilation times.
this might be source material for "arcane magic"-like performance tips guides for JS-SIMD, where people would document paragraphs like If you don't care about NaN handling, for best performance, avoid calling SIMD.max() when running on x86, since it emulates an ARM NEON max instruction. Instead, prefer the sequence SIMD.select(SIMD.greaterThan(x, y), x, y) to match SSE semantics and have a JIT generate the actual SSE max instruction.
A developer will need a JS-SIMD disassembler (point 3 from the earlier post) to be able to confirm if he got the intended sequence or not.

I hope that we would have as few needs to pattern-match in the JIT as possible to deliver performance (except for any usual register allocation that takes place in the compiler). These feel like fixing up after the fact that the interface was not expressive enough. Strictly for Emscripten purposes I think it might work, since we control both sides of the fence and can make sure that they evolve hand-to-hand, but for the general web, I think that would be a disservice.

Would it be possible to assemble a spreadsheet, where all JS-SIMD API instructions are listed in one column, then in another the assembly sequence that they compile down to on x86, and in a third column, the assembly sequence that they compile down to on ARM? I think that would be very important to see, even if it wouldn't end up being an official part of the spec.

BrendanEich commented 9 years ago

I contend that portable but 2x or slower is a non-starter. SIMD-based C/C++ code can't tolerate it when cross-compiled if the resulting JS is to compete with native and provided the slowdown dominates total runtime, or merely is bad enough that users notice and object or seek native code.

Sorry if I'm missing something -- @sunfishcode, please help me see why architecture-independent API with 2x or greater slowdown is worth doing, in a competitive case analysis. I can see that it's better than no SIMD, but I then argue it's not competitive with native.

Telling devs who can't take the slowdown to try the architecture-specific APIs is risky: you probably lure some devs into wasting human coding cycles, finding perf loss unacceptable, and then rewriting. In general "make it right, then make it fast" -- but we are not in a general code regime, we're dealing with (a) SIMD intrinsics in Emscripten source, and (b) winning over low-level hackers who use C/C++ to use JS as well and with the same guarantees.

Of course, a slow portable API could be good enough if the slowdown hits only a small part of the total schedule. But then how important was SIMD to such a program in the native case?

/be

P.S. JIT pattern-matching in competitive regimes works: engines level up to tie or win at benchmarketing and/or "design wins" in sales ($0 but still) settings. But JIT pattern-matching is a sideshow if our goal is to compete with native, where hackers hand-select SIMD instructions to get best performance.

sunfishcode commented 9 years ago

I expect we'll beat 2x in many cases with the portable API. Even though they have dominated the discussion here, min and max are a sideshow compared to add and mul.

That said, I expect I'm not going to be your main challenge to convince about doing a "union" API. I am talking with people I know to learn what people think about the idea, and I encourage everyone interested in this to do the same.

ghost commented 9 years ago

My two cents: unless we choose a truly union API (exposing every SIMD op up to the bleeding edge), there will always exist cases where the best JS-SIMD can do is 4x slower than the best native can do. I think the important thing isn't so much % coverage of the instruction set but % coverage of real world use cases.

Also, I think it makes sense to avoid hidden performance cliffs by not supporting automatic translation for 100% of mmintrin.h but, rather, having an mmintrin.h-derived "emscriptintrin.h" that contained only the ops in JS-SIMD. I think "write new SIMD code for new platforms" is part of the usual porting story for applications and so requiring a rewrite for emscriptintrin.h doesn't seem unreasonable. Also, this will help make it clear what JS-SIMD supports and help collect feedback for future iterations.

I do assume, though, that pretty quickly we'll want ops that are only fast on one arch. In that case, I think we should expose this fact through feature testing. Rather than separating by instruction set, I was thinking perhaps we could use a scheme:

SIMD.{float32x4, ...}.* : ops that are optimal on both SSE/NEON
SIMD.arch.{float32x4, ...}.* : ops that are optimal only on the current device
SIMD.simulated.{float32x4, ...}.* : the union of all SIMD.arch.* ops; not necessarily optimized

Thus, 1 is the intersection, 2 describes the current device and 3 is the union. Only ops in 2 need feature testing and a good portable implementation would start by feature testing SIMD.arch before falling back on an implementation using a mix of 1 and 3. The point of 3 is that, with the full instruction set at its disposal, the JIT should be able to do a better job at simulating ops than JS could in terms of JS-SIMD (still achieving a speedup over plain scalar code).

Applying this to the current situation with min/max, we could consider:

SIMD.float32x4.minNoNaN - undefined what happens with NaN (fast on SSE/NEON)
SIMD.arch.float32x4.{min, minAsymmetric} - the former available on NEON, the latter on SSE
SIMD.simulated.{min, minAsymmetric} - call whichever you want, it's slower on the other arch This gives the programmer maximum control over saying what they want.

But maybe this is overkill, though? Definitely it's overkill if min/max are the only use cases; I think we need more iteration with what we have now, in the intersection API, to know what the situation really is.

There is also the issue that I would expect an intersection API to be much easier to initially get into the standard and implemented (as long as it was shown to have enough ops to be generally useful). Once we have this foot in the door, it seems like we'd be able to iterate quickly on SIMD.arch ops. With feature-testing, browsers would starting getting the fast paths as soon as they implemented the new ops.

ghost commented 9 years ago

One other thought I had, regarding the "WebGL was a success by exactly modeling the underlying API" line of reasoning:

WebGL emulated OpenGL, which itself had already done the work of providing a device/manufacturer-independent graphics hardware abstraction. If WebGL had followed the pattern we're discussing here with SIMD.SSE.x/SIMD.NEON.y, we'd have two similar-but-different interfaces for OpenGL and DirectX and Microsoft would have never optimized for GL.

Similarly, if we can provide the developer a way to test which operations are efficient (1:1 with machine insns) (analogous to WebGL's extension-testing I think?), then it stands to reason that both Intel and ARM might, in the future, evolve to support the others' optimized ops. With feature-testing, we'll just enable these ops after doing cpuid festing and existing code will just run faster. Over time, JS-SIMD could end up being an OpenGL-like force that promotes SIMD feature convergence.

sunfishcode commented 9 years ago

@juj, @andhow, and @kripken and I discussed this earlier today. The conclusion was that if we're going to embark on a bold new strategy here, we'll need some compelling arguments to motivate it, and the best argument for this kind of thing is data. So, when Emscripten+OdinMonkey are ready to rock some intersection-style SIMD together (and this is coming soon!), we'll compile some code and do some hopefully realistic benchmarking and, in general, collect some real data. What works, what doesn't work, what's fast, what's slow, what's easy to fix, and what's a lost cause. Then we'll be able to make a more informed decision, and if we need to do something bold, we'll be able to explain our choices to others with data to back them up.

huningxin commented 9 years ago

So, when Emscripten+OdinMonkey are ready to rock some intersection-style SIMD together (and this is coming soon!), we'll compile some code and do some hopefully realistic benchmarking

So excited about that!

chadaustin commented 9 years ago

This discussion is excellent. Thank you all. @sunfishcode and I have also argued on this topic in #asm.js. Allow me to make my case here too.

The only reason to use SIMD is performance.

The challenge for a great deal of SIMD algorithms is arranging data: gathering, mixing, and splatting into the appropriate lanes, doing a tiny bit of SIMD work, and scattering the register lanes back out into memory.

Sometimes, after taking a scalar algorithm and applying SIMD, you might merely see a 2x performance increase. Perhaps even less. Rarely will you see the maximum of a 4x increase.

Thus, it's likely that any additional instructions emitted for the sake of consistency-under-NaN across ISAs will ENTIRELY offset the gain of going SIMD in the first place.

My recommendation is to leave SIMD semantics under NaN unspecified or implementation-specified for maximum performance.

sunfishcode commented 9 years ago

Hi @chadaustin. I see that you're passionate about this issue, which is great, because we would benefit from some help :-).

One thing that would help would be testcases, preferably code we can run, but pseudo-code or just a description of an algorithm can also be useful. The stronger connection to a real-world use case the better.

The NaN consistency issue is disproportionately represented in min/max, so I'm likely to face someone claiming that the concern is overstated because a real-world testcase would do things other than just min/max. How should I respond to that? A testcase demonstrating a real use case where the NaN consistency issue causes significant slowdown would be a powerful motivator. Thanks!

juj commented 9 years ago

@andhow: I think the important thing isn't so much % coverage of the instruction set but % coverage of real world use cases.

I find that statement a bit objectionable. In the recent meeting, @sunfishcode asked me to come up with real world use cases to motivate why a union api is needed, or why direct SSE support should be added, and while I'll do my best to provide such data, I think it would be presumptuous or outright arrogant if that data will be later on used as an example to separate "this is the important part of SSE" and "this is the part of SSE we don't need to care so much about" categories. If we were designing a new specification, I would agree, but I think here we are following the native world to bring over a feature from the native world that proved successful there. Also, since the real world use cases are based on the native world specifications, we are guaranteed that if we can match the native instruction sets, we will also win over the real world use cases. The number of instructions in the set is very small compared to the number of applications that have been written on top of the instruction sets.

@andhow: Also, I think it makes sense to avoid hidden performance cliffs by not supporting automatic translation for 100% of mmintrin.h but, rather, having an mmintrin.h-derived "emscriptintrin.h" that contained only the ops in JS-SIMD. I think "write new SIMD code for new platforms" is part of the usual porting story for applications and so requiring a rewrite for emscriptintrin.h doesn't seem unreasonable.

In Emscripten community, I am one of the big proponents of adding Emscripten-specific APIs, and I do a lot of the work involved in designing and implementing those (to which @kripken likes to object :), but SIMD is not one of those areas where that makes sense. Asking users to rewrite their SIMD code would make sense if we were dealing with a new platform that actually had new SIMD hardware in place, but we don't. If we have an application that is written to talk SSE, and it is being run on a processor that talks SSE, it will be a very hard sell to tell a developer that on the web these can't connect directly, but he must rewrite the code (assuming it is even possible if the JS-SIMD instruction set is too limited), and the end result he will get won't be as good as what direct SSE-to-SSE in native is.

The title of this issue is specifically "How will native code port on top of JS-SIMD?", and by this, you are proposing that native code should not. I don't see that reasonable. If we do tell developers that their SSE code (or NEON code for that matter) will not apply to the web, we have conceded that JS-SIMD does not support the native code porting use case.

In the webkit mailing list, there was an argument that JS-SIMD should not even exist because SIMD is not performance-portable. The bit about performance-portability is absolutely true. But I see that as a fact which native SIMD developers routinely deal with and there are no problems associated with it. For native developers, it is not a problem that different hardware has different performance characteristics, since the developer has direct access to each hardware and has the tools in his toolbox to design for this:

Native developer recognizes that the problem he is solving is representable in the SSE and NEON intersection (for the set of problem input values he cares about), so he simply aliases the operations under a common interface (SSE: https://github.com/juj/MathGeoLib/blob/master/src/Math/simd.h#L57 , NEON: https://github.com/juj/MathGeoLib/blob/master/src/Math/simd.h#L231), and then writes one algorithm using that common interface, that works on both: https://github.com/juj/MathGeoLib/blob/master/src/Geometry/AABB.cpp#L475 . He will get the 100% native performance on both SIMD instruction sets, since the instructions map 1:1 to the underlying hardware instructions.

Native developer recognizes that the problem requires a different approach for both SIMD instruction sets. He conditions the code to take a branch in a cold part of the code, to jump in the hot path:

function SolveProblem(input)
{
 if (SupportsSSE())
   return RunHotAlgorithmWithSSE(input);
 else if (SupportsNEON())
   return RunHotAlgorithmWithNEON(input);
 else
   return RunHotAlgorithmWithScalar(input);
}

Native developer recognizes that there are too many paths where different approaches are needed for SSE and NEON and that such runtime if-else branches are not feasible to maintain without performance loss. He recompiles the code for both platforms:
```
function SolveProblem(input)
{
#ifdef SUPPORTS_SSE
   return RunHotAlgorithmWithSSE(input);
#elif defined(SUPPORTS_NEON)
   return RunHotAlgorithmWithNEON(input);
#else
   return RunHotAlgorithmWithScalar(input);
#endif
}
```

JS-SIMD is currently trying to specify a set intersection of instructions, along with an emulation layer to make the set intersection absolutely consistent across ARM and x86 hardware. That will lend to partially enabling the first category, but without a direct 100% instruction-to-instruction performance mapping guarantee. We have a native world full of code that already solves the performance portability challenge via second or third categories, but with the current JS-SIMD, we would not be able to reuse those solutions. If the JS-SIMD spec gave direct access to the instruction sets, web developers would be able to reuse the same design tools that native developers have, and the performance-portability problem would be equally manageable problem for the web as it is for native developers today, or perhaps even easier, since 4. with the help of polyfills, web developers would have the extra ability to write SSE code that actually runs on NEON, and vice versa, in which case the browser could emulate the best closest thing. In that case, developers would be happy instead of angry, since they understand that emulation is reasonable: if I wrote an application that talks only SSE, and I'm running it on a NEON chip, of course I can expect performance loss.

With a direct intrinsics level api, the web developer can choose from 1. - 4. to decide which will give the best performance.

@andhow: Similarly, if we can provide the developer a way to test which operations are efficient (1:1 with machine insns) (analogous to WebGL's extension-testing I think?), then it stands to reason that both Intel and ARM might, in the future, evolve to support the others' optimized ops.

This is an argument that I am not capable of predicting, but I think it's fair to agree that we should design the spec for the real world today if we want it to have a practical impact in the real world now. I would rather we solved this problem in the spec ourselves, instead of waiting if perhaps that the hardware industry would change around us to remove this problem.

@chadaustin: My recommendation is to leave SIMD semantics under NaN unspecified or implementation-specified for maximum performance.

If we modelled direct SIMD intrinsics access, we would not have unspecified or implementation-specified behavior. I think that would be better for the web, as both SIMD.SSE1.max and SIMD.NEON.max would be strongly specified with semantics of their own and no uncertainties, but still have maximum performance. That way the user would explicitly know if he is getting different results on x86 and ARM, it must have been due to one or more of if (SupportsSSE()) vs if (SupportsNEON()) paths he wrote himself, which would give a stronger clue to tracking the origin compared to USB in the spec (which he might not have read). Conversely, if the developer did not write a single if (SupportsX()) statements in his code, he would be guaranteed to get identical results across ARM and x86, and performance that is dependent on which variant of SIMD functions his app was written in and what his current execution platform is.

@sunfishcode: One thing that would help would be testcases, preferably code we can run, but pseudo-code or just a description of an algorithm can also be useful. The stronger connection to a real-world use case the better.

I wrote an automated benchmark of the current SSE1 api implementation over the weekend. It is available here: https://github.com/juj/emscripten/commits/sse1 . The way to run it is to check out the code, and then run python tests/benchmark_sse1.py, and the test will automatically run, and generate a results_sse1.html page in the current directory. Here is the results of running the benchmark on my system: http://clb.demon.fi/dump/results_sse1_20140929.html . For anyone looking through that link, please don't yet take home any conclusions from the current numbers, since it is not yet an asm.js-validated run.

What the test does is it stresses each individual function of the SSE1 api and times them. It is synthetic, I know, but I think it is a superset of all the real-world SSE codebases, and therefore will have a stronger connection to real-world use cases than any single real-world codebase itself has. I do think that if it is to be rejected as an invalid test case, the reason should be something else than just the label "synthetic" it comes with. I wrote it specifically as a tool to give data for @sunfishcode and @huningxin to use to work on https://github.com/kripken/emscripten/issues/2793, so I'm hoping you'll be willing to approach it with an open investigative mind. If the test is bad, please point the bad parts out, and how we could fix the test up. I believe that excelling in synthetic tests like this one will be a prerequisite for excelling in real world codebases. It is of course not a replacement for real world codebases, but if I had the capacity to only optimize one test case, it would be this synthetic test. Let me know if I can help you run the benchmark on your own systems. Note also that the https://github.com/juj/emscripten/commits/sse1 branch comes with more implemented SSE1 functions than what the current xmmintrin.h file in upstream is. We should try to merge that in soon.

I'll work on investigating C/C++ codebases that we could build for actual real world benchmarks. Video and Audio codecs and FFT come to mind at first, so I'll probably go for some of that field.

chadaustin commented 9 years ago

Thanks Dan! :)

The major SIMD algorithm from IMVU is already represented in the skinning benchmark at https://github.com/chadaustin/Web-Benchmarks/tree/master/skinning and @huningxin is already looking at that.

I just uploaded another minor one here: https://gist.github.com/chadaustin/0ad326c7e06cda799cf7

There's another one I can't paste publicly but it's basically a blinn-phong lighting calculation with color vectors being accumulated (ambient, diffuse, specular terms), and then saturated to [0,1] with minps and maxps.

Here's a simple triangle-to-depth-buffer rasterizer linked from Fabian's excellent series on optimizing the Intel Software Occlusion Culling demo: https://github.com/rygorous/intel_occlusion_cull/blob/97eae9a8/SoftwareOcclusionCulling/DepthBufferRasterizerSSEMT.cpp#L219

http://fgiesen.wordpress.com/2013/02/10/optimizing-the-basic-rasterizer/

That's all I've got handy at the moment...

I think the problem with saying that minps and maxps are rare is that, while true, any kind of saturating arithmetic inside an inner loop is going to use one of them. Any clamped arithmetic would use both. All just so the spec can precisely-define NaN semantics, which I think it is a bad idea in the first place. :) Then again, I think JavaScript would benefit from a healthy dose of undefined behavior in general. ;)

sunfishcode commented 9 years ago

Thanks @chadaustin, I haven't had time to look at everything in detail, but it looks really helpful!

sunfishcode commented 9 years ago

@juj: The purpose of my request for benchmarks and testcases was to allow me to evaluate how good our current design is. If we get data and it exposes minor things that could easily be fixed, we're going to just fix those things. If it exposes a manageable number of bigger things which could be added to the current design, possibly in the manner that @andhow has outlined above, we're likely to just do that. Your proposal above would have much higher costs for us, so while I'm open to it, the data we get here will need to show that we have major problems likely to hit us in important real-world scenarios that we can't fix in simpler ways before I can adopt it.

sunfishcode commented 9 years ago

@juj: I should also mention that the SSE1 tests you have here look like really great tools, and I'm definitely looking forward to using them. Being synthetic benchmarks, they'll give us lots of data, and with context and interpretation, such data can be very powerful.

juj commented 9 years ago

I've now worked on the quest to produce real world benchmarks to the extent that I think is useful at this point.

First off, here are some places that were looked but rejected:

Ogg Theora: This was my first go-to for a SIMD example, but unfortunately I got to realize that theora is using handwritten MMX assembly, which makes it unsuitable for a benchmark.
Ogg Vorbis: Surprisingly it does not contain SIMD, authors on IRC comment that they want to keep it as a clean "reference implementation", and advised to look for SIMD optimized versions elsewhere.
Opus Codec (http://www.opus-codec.org): has some SIMD, but authors stated they have their SSE2 optimizations currently under way and not yet complete. This is probably a good benchmark once it is done and worth a revisit.

I've assembled the following two codebases for building with JS-SIMD:

Bullet physics (ammo.js):

Worked to patch Emscripten and Bullet in ammo.js repository so that it builds with SSE2 enabled.
The Emscripten repository is here: https://github.com/juj/emscripten/tree/sse2
the ammo.js repository here: https://github.com/juj/ammo.js/tree/sse .
Will likely be a nice benchmark, however there is still some work we need in Emscripten side to be able to get it running (in particular issues https://github.com/kripken/emscripten/issues/3009, https://github.com/kripken/emscripten/issues/2848 and https://github.com/kripken/emscripten/issues/2855 )
Build instructions for your own STR at https://github.com/kripken/emscripten/issues/3009

My own MathGeoLib math library: http://clb.demon.fi/MathGeoLib/nightly/

Has unit test and benchmarks suite.
Online benchmarks suite graphs results like http://clb.demon.fi/dump/MathGeoLib_testresults/index.html?revision=f5646ed848d79f66ea65c22c070967c8e7f1357a
This library has strict build modes for <=SSE1, <=SSE2, <=SSE3 etc. support. Currently trying SSE1 build mode only with Emscripten.
Builds with Emscripten but does not run due to issue at https://github.com/kripken/emscripten/issues/3010 .

This is where my effort got blocked. We are not yet in a state where we could start building actual benchmark projects, since our support for SSE1 and SSE2 is not yet complete enough. To be able to build benchmarks, we need to resolve the following:

merge the SSE1 suite at https://github.com/kripken/emscripten/pull/2792
resolve https://github.com/kripken/emscripten/issues/2855 "SIMD code does not validate as asm.js", because the polyfill impacts correctness as NaN canonicalization is breaking loading and storing bitmasks in SSE registers, see https://github.com/kripken/emscripten/issues/2840 .
resolve the other blocking SIMD limitations at https://github.com/kripken/emscripten/issues?q=is%3Aopen+is%3Aissue+label%3ASIMD+
add SSE2 support and unit test suite similar to https://github.com/kripken/emscripten/pull/2792 does for SSE1.

I also did an audit of the Unreal Engine 4 and Unity 3D codebases for their SSE uses, but the conclusion is that it does not make sense to try to leap to those quite yet because we cannot build smaller examples at the moment. This will be retried again later once our support progresses some more.

I still hold that the synthetic benchmark suite that I wrote at http://clb.demon.fi/dump/results_sse1_20140929.html is the best benchmark we can look at from the spec and porting perspective at the moment, because it is explicitly visualizing what the relative performance between native vs JavaScript and scalar vs SIMD is for each SSE1 instruction. This is the only honest way we have for measuring the performance right now, because it covers the full API.

Going forward, what I would like the working group to decide for the JS-SIMD from the SSE1 intrinsics perspective are the following:

which SSE1 instructions will have direct 1:1 instruction level support in JS-SIMD? (addps, ... ?) What is the language syntax to access those instructions?
which SSE1 instructions will be inaccessible directly and behind an emulated but reasonably fast path (whatever that means - perhaps "some speedup", or "faster, or at least as fast as scalar")? (maxps, ... ?) What is the expected emulation cost for these? (in terms of instructions, clock cycles, or something else that is at least somehow quantizable)
which SSE1 instructions will be put in an expected slow path? (the operation is available in some JS-SIMD expressible way, but there is no performance expected)
which SSE1 instructions will not be accessible in JS-SIMD at all? What is the rationale for leaving those out from the spec? (useless/superceded? not important? too hard/too much to add at this point?) What should a developer who is migrating existing SSE1 code over to JS-SIMD do with such constructs?

SSE1 of course not being in any way special, except that it's only the first SSE instruction set, SSE2, SSE3 and later are sure to follow as we build on the support in Emscripten, but just for sake of keeping a scope for the discussion since the instruction sets are large.

The reason I am asking these kind of questions is not to poke into any extra undesirable exercises, but simply because I know I will be the contact person for dozens of developers who will port their native codebases over to JS-SIMD with Emscripten, and I can anticipate that these types of questions are exactly what they will posing. We will need to provide the necessary support material for such Emscripten developers or these ports will just not happen, so to me it makes sense to ask these already before the spec is finished so that we can say that the spec was designed to have the proper answers for each. Is this something that the working group could do? I think this is similar in scope to the WebGL Specification section 6 "Differences Between WebGL and OpenGL ES 2.0" at https://www.khronos.org/registry/webgl/specs/latest/1.0/#6 .

Also I'd like to see that the designed status of the instructions in the fast vs slow paths is reflected in the synthetic SSE1 benchmark, which it isn't yet. I would like to understand that if we are going for the "set intersection" API for JS-SIMD v1.0, how does it look like in the optimized version of the synthetic SSE1 benchmark. Currently the version I am able to run is ranging between 50x vs 1000x slower than native for some instructions because we don't have a fast asm.js validating path yet available. How fast can we get these in the current state of the spec?

huningxin commented 9 years ago

Side note, I tried the bullet3 SSE path (native) on Linux before. However I didn't get good speedup there. See https://github.com/bulletphysics/bullet3/issues/66 for details.

huningxin commented 9 years ago

@juj, I updated https://docs.google.com/a/intel.com/spreadsheets/d/1QAGGf2M2IA6l4cvh8eTXdXGEUcPjdmTe_BLKGn5YCB4/edit according to emscripten https://github.com/kripken/emscripten-fastcomp/commit/0bae9ad47629268bdd2fb79f1190746c7b17142e. Please take a look. Thanks!

johnmccutchan commented 9 years ago

@juj Thanks for doing this.

huningxin commented 9 years ago

I got some numbers of the @chadaustin 's skinning benchmark https://github.com/chadaustin/Web-Benchmarks/tree/master/skinning compiled with emscripten 1.28.3 of incoming branch https://github.com/kripken/emscripten/tree/9ab0ccaa00d8c58ae316c29b03c03813e7d4f398 running on chromium-m40 SIMD.js prototype (https://drive.google.com/folderview?id=0B9RVWZYRtYFeMTJiMzE5VjlkTWc&usp=sharing):

The tests were conducted on my Intel i7-4770K Linux machine. First, the native numbers are:

nhu@nhu-z87:~/devel/simd.js/Web-Benchmarks/skinning/build$ ./clang-O3-scalar 
Skinned vertices per second: 130528000, blah=0.000000
nhu@nhu-z87:~/devel/simd.js/Web-Benchmarks/skinning/build$ ./clang-O3-simd 
Skinned vertices per second: 219052000, blah=0.000000

The native SIMD speedup to scalar is 1.68X.

The results of JavaScript version are: Scalar:

nhu@nhu-z87:~/devel/simd.js/Web-Benchmarks/skinning/build$ /home/nhu/devel/simd.js/v8-m40/out/ia32.release/d8 --simd-object emscripten-O3-scalar.js
Skinned vertices per second: 44946000, blah=0.000000

SIMD version compiled by Emscripten 1.28.3:

js/v8-m40/out/ia32.release/d8 --simd-object emscripten-O3-simd.js
Skinned vertices per second: 60812000, blah=0.000000

The SIMD speedup to scalar is 1.35X.

It is not as good as native speedup.

According to https://docs.google.com/a/intel.com/spreadsheets/d/1QAGGf2M2IA6l4cvh8eTXdXGEUcPjdmTe_BLKGn5YCB4/edit#gid=0, the SSE1 partial load and store (including _mm_load_ss, _mm_loadh_pi, '_mm_loadl_pi,_mm_store_ss,_mm_storeh_pi,_mm_storel_pi) and_mm_add_ss` are emulated. However, they are used in hot loop of skinning benchmark.

My first try was to optimize partial load and store with SIMD.float32x4.loadX, SIMD.float32x4.loadXY and SIMD.float32x4.storeX and SIMD.float32x4.storeXY. The experimental implementation is in PR https://github.com/kripken/emscripten/pull/3120 and https://github.com/kripken/emscripten-fastcomp/pull/58. The number is:

nhu@nhu-z87:~/devel/simd.js/Web-Benchmarks/skinning/build$ /home/nhu/devel/simd.js/v8-m40/out/ia32.release/d8 --simd-object emscripten-O3-simd.js
Skinned vertices per second: 83608000, blah=0.000000

The SIMD speedup to scalar is 1.86X.

For _mm_add_ss, as SIMD.js doesn't support SIMD.float32x4.addX, emscripten emulates it with a SIMD.float32x4.add and a SIMD.float32x4.shuffle. I found it can be replaced by _mm_add_ps without side effect in skinning benchmark as PR https://github.com/chadaustin/Web-Benchmarks/pull/4. The result of using _mm_add_ps is:

nhu@nhu-z87:~/devel/simd.js/Web-Benchmarks/skinning/build$ /home/nhu/devel/simd.js/v8-m40/out/ia32.release/d8 --simd-object emscripten-O3-simd.js
Skinned vertices per second: 90454000, blah=0.000000

The SIMD speedup to scalar is 2.01X.

Even with above partial load/store PR for emscripten, the _mm_storeh_pi is still emulated by a SIMD.float32x4.shuffle and a SIMD.float32x4.storeXY. Ideally, it could be mapped to movhps by either optimization of JS engine or adding SIMD.float32x4.storeZW into SIMD.js API. I tried to prototype later one and got number as:

nhu@nhu-z87:~/devel/simd.js/Web-Benchmarks/skinning/build$ /home/nhu/devel/simd.js/v8-m40/out/ia32.release/d8 --simd-object emscripten-O3-simd.js
Skinned vertices per second: 94560000, blah=0.000000

The SIMD speedup to scalar is 2.10X.

metabench commented 9 years ago

I think this is a reason why direct equivalents of intrinsics (or intrinsics themselves) would help.

juj commented 9 years ago

There's now been some progress in Emscripten with adding more SIMD support. Previously the SSE1 instruction set was already supported, and now there's a pull request that also adds SSE2 support ( https://github.com/kripken/emscripten-fastcomp/pull/103, https://github.com/kripken/emscripten/pull/3542 ). The implementation has been done against the v0.6 version of the SIMD.js spec at http://littledan.github.io/simd.html . Both SSE1 and SSE2 apis are supported fully (although not necessarily accelerated!), except for rounding modes, floating point exceptions, denormal handling and issues with certain patterns of NaN<->float interaction.

Quantitatively measuring, the analysis is that current SIMD.js support allows native code to get the following:

38/95 (40.0%) of the intrinsics of SSE1 are expected to compile down to the direct native counterpart instruction.
The same amount, 38/95 (40.0%) intrinsics of SSE1 need to be emulated as a sequence of other intrinsics in order to simulate the native hardware instruction that SIMD.js is not able to express.
The remaining 19/95 (20.0%) instructions cannot be emulated as other SIMD.js intrinsics, but are implemented as a scalar fallback. One of these instructions, _mm_prefetch(), is not available and therefore a no-op.

For the SSE2 intrinsics set, the quantitative analysis is as follows:

101/207 (48.8%) of the intrinsics are expected to be in the fast path, and offer 1:1 the direct native counterpart.
48/207 (23.2%) of the intrinsics are sequences of other SIMD.js intrinsics in order to emulate the native SSE2 instruction.
The remaining 58/207 (28.0%) intrinsics are not supported by SIMD.js, and a scalar fallback has been implemented instead. Of these, two intrinsics, _mm_clflush() and _mm_pause() are ignored as no-op operations.

The numbers are my estimates from tallying up the current implementations of the code in Emscripten's xmmintrin.h and emmintrin.h - they may be possible to improve. In total, SIMD.js supports 46.0% of native SSE1+SSE2 instructions.

Unfortunately the internet does not much care about details, and/or the initial messaging was not handled in the best possible way when SIMD.js went public, and this turned into a binary "SIMD is now supported" message (e.g. http://www.reddit.com/r/programming/comments/1tv5ap/javascript_gains_support_for_simd/ ). For us, this has led to having to do some amount of expectation management with developers that we communicate with, since they are not aware that JavaScript does not really have direct access to the hardware SIMD instructions. In Emscripten we have added a new API of intrinsics that maps directly to the SIMD.js counterparts, but there has so far been very little interest from developers to hear about rewriting their SIMD code on top of a new meta-API that does not reflect to the hardware. The expectation seems to be that the SIMD port should be doable by switching (an existing) build flag, and if not, they will not care. For this reason, supporting the exact SSEx instruction sets with bit-to-bit correct behavior is the most important initial goal for SIMD in Emscripten.

Qualitatively, it is yet difficult to say if the coverage rates are large enough to provide a successful API to Emscripten developers, or for how many this will be a performance blocker. It is clear that running off the SIMD path to a scalar fallback will seriously hit performance, but the question is whether overall performance will still be a gain despite the presence of emulated instructions. Currently no browser yet supports enough of the v0.6 version of the SIMD.js to run SSE2, and they fall back to the polyfill. For example, Firefox does not yet have the int8x16, int16x8 and float64x2 types, and hence SSE2 code does not validate as asm.js, so I do not yet have benchmark numbers of how good SSE2 performance is on an "average" codebase (whatever that might mean).

We have already seen most companies come up with their easy (e.g. from perspective to being even autovectorized) SSE1 Mandelbrot tests, and in those, there is an easily achievable performance gain. That does not yet validate the SIMD.js API as a mature spec, but rather it will be the variety of projects that SIMD.js can cater to that will. More complex codecs, image processing libraries and similar are not sufficient with SSE1 only, so now that Emscripten and browser support matures, we are looking to benchmark more complex projects for better experience of how useful SIMD.js already is.

I think that the current SIMD.js spec will be useful for the most linear algebra game math libraries, and game physics engines, that map quite straightforwardly to the float32x4/float64x2 add/sub/mul/div/shuffle ops and don't need much else (and certainly never see nans or infs). Most likely it will be the codebases where developers have really gotten creative in their uses of SSE, abusing the nonsymmetry of the various operations, NaN behavior and propagation, and masking, that are likely to cause headache. NaN canonicalization (in the SIMD<->float boundary) seems to be the number one blocker, since it breaks correctness across the board in different code patterns, though it might be possible to come up with a set of rules and conventions on what kind of code people should be looking out for that will not work in JS. Also the lack of controllable rounding modes might be an issue to some projects.

Nevertheless, I feel quite happy that the SSE1 and SSE2 support code in Emscripten is even in such a good shape that it already is. At first I thought that it would not be possible to support nearly as much of SSE2 that it is, so great job with the spec there! I'm looking forward to seeing more discussion about the SIMD.universe and perhaps in the future with WebAssembly, that might improve the coverage rates for native SSE instructions.

littledan commented 9 years ago

@juj Great to hear the progress of Emscripten on SIMD!

I'm sympathetic to your concerns about getting all developers' code running as fast as possible, but developers need to understand that they are cross-compiling, not compiling directly to native code. As articulated previously in this thread, the initial goal should be portable performance, with platform-specific performance coming in a follow-on SIMD.universe API. I'd argue when thinking about SIMD quantitatively, the relevant measure is the kind of benchmarks you've been doing on existing programs, rather than counting the number of instructions that are supported. Do you know which of the missing parts come up the most often? That can help us prioritize adding them in a way that provides predictable performance.

NaN canonicalization is a platform-specific concern and does not live in the SIMD.js spec. V8 does not canonicalize NaN, and I got language removed from the ES6 spec to prevent the addition of some kind of NaN conversion which otherwise might have infected SIMD.js. The ES6 and SIMD specs are written to leave NaN representation pretty open, allowing implementations to canonicalize or not, as they prefer. This isn't a new issue for SIMD--TypedArrays already let ES6 users observe NaN canonicalization by looking at the binary representation of NaN. At this point, it would be hard to change Javascript to disallow canonicalizing NaN, but SpiderMonkey can always change to the V8 model if it wants to give Emscripten semantics more similar to C.

It's unfortunate to be reminded that the polyfill won't work in asm.js. I hope value types could eventually help with this, though they are still some time off and will probably come after the first version of SIMD.

juj commented 9 years ago

NaN canonicalization is not a new thing, and it's observed in Emscripten for a long time. However, only now with SIMD it is becoming a problem, since if is common in C to have the following code

uint32_t mask = 0xFFFFFFFF;
float f = *(float*)&mask;
...
// later
__m128 m = _mm_set1_ps(f);

and different variations of that. In this simple example, it is possible to refactor to a form

uint32_t mask = 0xFFFFFFFF;
__m128 m = _mm_load1_ps((float*)&mask);

in order to avoid the data ever roundtripping via a float register that would nuke the bits. However there are two problems to this approach:

often times the code pattern is different, and it is very difficult to analyse all call flows where canonizalised NaNs might get introduced
LLVM is eager at doing several kinds of optimizations which break even the second form of code (and slight variants of that), and the optimized version contains a pattern that does hit a path where a float register was used, and canonicalization did happen. It is yet uncertain which all optimizations we should disable to enforce that the compiler will never emit such unsafe code transformations.

johnmccutchan commented 9 years ago

As @littledan pointed out, V8 doesn't canonicalize NaNs. It is not part of the SIMD.js specification and language was removed from ES-6 that would have required it (in general for typed arrays). I suggest you take up NaN canonicalization up with the teams writing JITs that employ it.

littledan commented 9 years ago

I don't think there's much chance that JavaScript will evolve in a way that bans NaN canonicalization. It's a common, core strategy from implementations. There's also a strategy of using NaN payloads to hold pointers (JSC did this at some point, maybe still), which requires that the full spectrum of NaNs is not available to users directly. However, things stored in TypedArrays are not canonicalized. TypedArrays support copyWithin, which copies bytes without canonicalizing when called appropriately, so if you're careful to keep things stored in a TypedArray and refer to pointers, then you might be able to avoid some of the bad effects of canonicalization; obviously, this could hurt performance, though. Maybe this is a case where WebAssembly could help, as it doesn't have any of the baggage of supporting pointers and NaN == semantics. Or, specific JIT optimizations for referring to things in TypedArrays. Or maybe value types could help a user define a float type which doesn't canonicalize in a more efficient way. But I don't see how the SIMD.js spec can do anything about this issue.