Is it possible to use SIMD in p2?

wojciak commented 9 years ago

I was wondering if we can speed up physics calculations by migrating to SIMD (https://01.org/node/1495) where supported for some operations.

It's implemented in FF: https://blog.mozilla.org/javascript/2015/03/10/state-of-simd-js-performance-in-firefox/

Chromium: https://groups.google.com/a/chromium.org/forum/m/#!topic/blink-dev/2PIOEJG_aYY

Chakra: http://channel9.msdn.com/Events/Build/2015/2-763

And the elephant in the room - Crosswalk: https://crosswalk-project.org/documentation/samples/simd.html It would give JS wrapped games a great boost on android and tizen platforms.

schteppe commented 9 years ago

Probably possible...

I guess figuring out a good data layout is a good start. The current position and angle properties in Body could be replaced with a float32x4 instance, filling up 3 of 4 float lanes. Same for velocities. This would make operations like integration easy to parallelize. Constraint solving could probably benefit from a similar layout.

This would require a major change of the API and a lot of rewriting.. Keeping the API as it is probably not be possible. Or maybe it is? I need a brain massage before I can answer.

wojciak commented 9 years ago

As I thought this isn't so simple, and indeed I think an api change would be necessary. It's a nice to have for the long future ahead ;]

schteppe commented 9 years ago

I made a prototype similar to the p2.js API using SIMD. What I found is that the SIMD version does not run faster than a scalar version. I'm a bit puzzled.

If you run this demo in Firefox Nightly, you'll get the SIMD version. If you run it in some other browser without SIMD, you'll get the scalar version: http://jsfiddle.net/ym720t83/5/

If I set N=500, I get about 6fps in Firefox Nightly (SIMD), but around 20fps in Chrome (non-SIMD).

schteppe commented 9 years ago

Okay, stab 2... A lot more simplified example: http://jsfiddle.net/chdt0hs6/3/

If you get it running in FF Nightly, you'll see something like:

Running test...
NoSIMD: 118ms
SIMD: 393ms

:(

Do you get better results on your machine?

schteppe commented 9 years ago

Stab 3! I managed to get my Body.integrate benchmark 4.5x faster than the scalar one :)

The best option I found this far is to use Float32Arrays of length 4 (to make SIMD happy), then use SIMD.Float32Array.load() and SIMD.Float32Array.store() before and after making the computations inside the body methods. Both the linear and angular things (position, velocity, force) go into the same vector to improve performance.

Prototype SIMD Body class:

var SIMDBody = function(){
    this.invMass = new Float32Array([1,2,3,0]);
    this.position = new Float32Array([1,2,3,0]);
    this.velocity = new Float32Array([1,2,3,0]);
    this.force = new Float32Array([1,2,3,0]);
}
SIMDBody.prototype.integrate = function(dtVec){
    var f = SIMD.Float32x4.load(this.force, 0);
    var v = SIMD.Float32x4.load(this.velocity, 0);
    var x = SIMD.Float32x4.load(this.position, 0);
    var iM = SIMD.Float32x4.load(this.invMass, 0);

    var fhMinv = SIMD.Float32x4.mul(f, iM);
    var fhMinv2 = SIMD.Float32x4.mul(fhMinv, dtVec);
    var v2 = SIMD.Float32x4.add(fhMinv2, v);
    var v_dt = SIMD.Float32x4.mul(v2, dtVec);
    var x2 = SIMD.Float32x4.add(x, v_dt);

    SIMD.Float32x4.store(this.velocity, 0, v2);
    SIMD.Float32x4.store(this.position, 0, x2);
};

Scalar version that I use for comparison:

var Body = function (){
    this.invMass = 1;
    this.position = new Float32Array([0.1,0.2,0.3]);
    this.velocity = new Float32Array([0.1,0.2,0.3])
    this.force = new Float32Array([0.1,0.2,0.3])
}
Body.prototype.integrate = function(dt){
    this.velocity[0] += this.force[0] * dt * this.invMass;
    this.velocity[1] += this.force[1] * dt * this.invMass;
    this.velocity[2] += this.force[2] * dt * this.invMass;
    this.position[0] += this.velocity[0] * dt;
    this.position[1] += this.velocity[1] * dt;
    this.position[2] += this.velocity[2] * dt;
};

If I run the SIMD code using the polyfill, it is slower than the scalar version. It uses 2x as much computation time, probably because of garbage collection and other stuff.

Initially I stored the SIMD.Float32Array objects on the Body instance itself, in hope to improve performance. But that made everything slower. I wonder why...

I also tried to store the positions of all bodies in a consecutive typed array (same for the velocities and forces, etc etc) but that was not very much faster. I call this "structure of arrays". The prototype SIMD code above (array of structures) was 1.4x faster.

schteppe commented 8 years ago

Ported that stuff to jsperf, and added some solver tests. Results:

p2-simd

integrate is a simplified scalar version of Body.prototype.integrate. integrate-simd is the same method but using SIMD.

solve-ish is a simplified version of a few combined core methods in GSSolver.js, which handle the contact and constraint solving in p2. simd solve-ish is the SIMD version.

These tests show that the data layout is good for operating both on single bodies, and on body pairs. One could of course structure the data so that 4 bodies can be operated on at the same time, but my gut feeling is that it's going to be much more work to get that going.

Good results :+1:

Now just need to write everything using SIMD.

schteppe commented 8 years ago

First stab at a p2 SIMD shim. SIMD Body integration! https://github.com/schteppe/p2.js/commit/84dbcf7c1e8419420fc508402e4fea43626c040f

englercj commented 8 years ago

:+1: Really cool research and results @schteppe, keep up the good work!

schteppe commented 8 years ago

Thanks @englercj! I just hope SIMD.js will get some traction soon

schteppe commented 8 years ago

Today I had another (simpler) approach at SIMD + p2 in Firefox Nightly. Unfortunately it didn't give very much.

At the end of the vector math file (src/math/vec2.js), I added SIMD shims (see code below), making all vec2s/Float32Arrays have length 4. It also shims some of the most used math methods in p2, so they use SIMD if available. In theory this can make the code run twice as fast; the methods do indeed run at 2x speed on their own:

JSPerf for vec2.add JSPerf for vec2.rotate

I tried running the circles demo with 450 circles using this approach, but I see no performance gain. Rather the opposite; I lost a few FPS...

Looking at the profiles, I see that the island splitting and the TupleDictionary are eating a lot of performance, those should really be optimized. And the demo rendering is a bit expensive too. Turning these things off and increasing the number of circles to 1500, I still get no performance gain with SIMD :(

if(typeof SIMD !== 'undefined'){
    vec2.create = function(){
        return new Float32Array(4);
    };
    vec2.clone = function(a) {
        var out = vec2.create();
        out[0] = a[0];
        out[1] = a[1];
        return out;
    };
    vec2.fromValues = function(x, y) {
        var out = vec2.create();
        out[0] = x;
        out[1] = y;
        return out;
    };
    vec2.add = function(out, a, b) {
        var sa = SIMD.Float32x4.load(a, 0);
        var sb = SIMD.Float32x4.load(b, 0);
        var sout = SIMD.Float32x4.add(sa,sb);
        SIMD.Float32x4.store(out, 0, sout);
        return out;
    };
    vec2.subtract = function(out, a, b) {
        var sa = SIMD.Float32x4.load(a, 0);
        var sb = SIMD.Float32x4.load(b, 0);
        var sout = SIMD.Float32x4.sub(sa,sb);
        SIMD.Float32x4.store(out, 0, sout);
        return out;
    };
    vec2.sub = vec2.subtract;
    vec2.multiply = function(out, a, b) {
        var sa = SIMD.Float32x4.load(a, 0);
        var sb = SIMD.Float32x4.load(b, 0);
        var sout = SIMD.Float32x4.mul(sa,sb);
        SIMD.Float32x4.store(out, 0, sout);
        return out;
    };
    vec2.mul = vec2.multiply;
    vec2.divide = function(out, a, b) {
        var sa = SIMD.Float32x4.load(a, 0);
        var sb = SIMD.Float32x4.load(b, 0);
        var sout = SIMD.Float32x4.div(sa,sb);
        SIMD.Float32x4.store(out, 0, sout);
        return out;
    };
    vec2.div = vec2.divide;
    vec2.scale = function(out, a, b) {
        var sa = SIMD.Float32x4.load(a, 0);
        var sb = SIMD.Float32x4.splat(b);
        var sout = SIMD.Float32x4.mul(sa,sb);
        SIMD.Float32x4.store(out, 0, sout);
        return out;
    };
    vec2.rotate = function(out,a,angle){
        var sa = SIMD.Float32x4.load(a, 0);

        var c = Math.cos(angle),
            s = Math.sin(angle);

        var cs = SIMD.Float32x4(c,s,0,0);
        var sc = SIMD.Float32x4(-s,c,0,0);

        var xx = SIMD.Float32x4.swizzle(sa,0,0,2,3);
        var yy = SIMD.Float32x4.swizzle(sa,1,1,2,3);

        var sout = SIMD.Float32x4.add(
            SIMD.Float32x4.mul(cs,xx),
            SIMD.Float32x4.mul(sc,yy)
        );

        SIMD.Float32x4.store(out, 0, sout);
    };
}

jtenner commented 8 years ago

What's the reason?

Is it because the function calls are slow, or because it requires a lot more memory consumption (and thus more gc that slows down the browser?)

Edit:

    vec2.add = function(out, a, b) {
        var sa = SIMD.Float32x4.load(a, 0);
        var sb = SIMD.Float32x4.load(b, 0);
        var sout = SIMD.Float32x4.add(sa,sb);
        SIMD.Float32x4.store(out, 0, sout);
        return out;
    };

It might also be helpful to do this instead, 3 variables per add call will add to GC and might reduce FPS:

    vec2.add = function(out, a, b) {
        SIMD.Float32x4.store(out, 0, 
           SIMD.Float32x4.add(
              SIMD.Float32x4.load(a, 0), 
              SIMD.Float32x4.load(b, 0)
           )
        );
        return out;
    };

Calling the .store function to put the data back inside the TypedArrays probably ruins the whole point of having SIMD objects. The objects are immutable and forcing JS to push values from an immutable object every add is also a waste in my opinion.

I can't come up with a better solution, however.

schteppe commented 8 years ago

Memory consumption and GC might actually be the case. When I compared, I used Float32Arrays of length 2 vs length 4. Should have thought of that. Will test again with length 4 for both cases (simd vs no simd), when there's time. Thanks!

jtenner commented 8 years ago

@schteppe I'm talking about adding 3 extra variables per method call, I think the example you're talking about isn't a problem.

Check the GC rate with my modified function calls and see if it makes a small difference.

jtenner commented 8 years ago

   //3 variable references, 3 objects created, 4 function calls
   vec2.add = function(out, a, b) {
        SIMD.Float32x4.store(out, 0, 
           SIMD.Float32x4.add(
              SIMD.Float32x4.load(a, 0), 
              SIMD.Float32x4.load(b, 0)
           )
        );
        return out;
    };
  //vs
    //6 variable references, 3 objects created, 4 function calls
    vec2.add = function(out, a, b) {
        var sa = SIMD.Float32x4.load(a, 0);
        var sb = SIMD.Float32x4.load(b, 0);
        var sout = SIMD.Float32x4.add(sa,sb);
        SIMD.Float32x4.store(out, 0, sout);
        return out;
    };

GC slows down when it had to clean up 2x as many references as necessary.

The memory itself is cheap, it's the cleanup that might be an issue.

Edit: I'm glad I took up this issue, because now I have a better understanding of SIMD. I'm going to do some research to see if I can help with the memory issues. -Josh

jtenner commented 8 years ago

Things that I understand now:

SIMD.Float32x4(1,2,3,4) === SIMD.Float32x4(1,2,3,4) returns false because even though the data is immutable, it creates 2 different objects in the process. I can't compare vectors with object references. This seriously hampers the reasons to use immutable data.
CPU utilization mid-function goes down, and memory consumption goes WAY up because of object creation. There is no way to pre-allocate memory or object references for math purposes, because every SIMD function returns a new object and you have to perform the math at runtime.
Reducing the variable references the incremental GC has to clean up (like I previously stated in my suggestions,) might be the best solution for optimizing the SIMD functions, however, it's definitely not going to compare to re-using objects like @schteppe designed p2 to do. Float32Array reigns supreme in data formats for repeated use.
If there were native function calls to modify object references like the normal vec2.add(byref, operand) function does, native vs js would be no comparison.

When I compared, I used Float32Arrays of length 2 vs length 4.

This will not matter in the slightest, you are worried about the 3 objects created as seen in the example below:

    vec2.add = function(out, a, b) {
        var sa = SIMD.Float32x4.load(a, 0); //object creation
        var sb = SIMD.Float32x4.load(b, 0); //object creation
        var sout = SIMD.Float32x4.add(sa,sb); //object creation
        SIMD.Float32x4.store(out, 0, sout);
        return out;
    };

3 x n where n = number of additions needed to be performed per integration.

jtenner commented 8 years ago

I have come up with exactly one unreasonable recommendation to remedy the memory usage.

Converting the property data types on the Shapes/Bodies to SIMD data types would be nearly 4x as efficient per operation but would come at the cost of a huge refactor.

Perhaps a simple loop over the bodies in question at the post-integration step to load the data from the SIMD objects would be the final step. I cannot even fathom a rewrite and refactor of this nature and highly suggest avoiding SIMD data for this reason.

E.G.

body._SIMDPosition = SIMD.Float32x4.add(body._SIMDPosition, movementSIMDVector);
//post integration
SIMD.Float32x4.load(body.position, body._SIMDPosition);

I hope I'm wrong, SIMD is very cool and I want to use it very badly.

schteppe commented 8 years ago

Wow, thanks for your hard work!

Do you really think that the SIMD objects add to GC? Aren't they optimized away to nothing? I have no idea how these things work really, you're probably right.

This is frustrating.

jtenner commented 8 years ago

Do you really think that the SIMD objects add to GC?

Absolutely. I did some performance testing on 100 matrix math operations per frame to see Memory usage shoot up 200,000,000 bytes. This is terrible for animation loops.

They aren't optimized away because they are immutable, and a new object is created every time a function is called.

Just like when we allocate a new Float32Array there is an up front cost. Now we are allocating 4x arrays per add, subtract, etc... However, If we don't use them every frame and only use them up front, the cost is minimal.

jtenner commented 8 years ago

I also think the reason why your JSPerf was so fast was because it wasn't taking GC into account. CPU usage for SIMD objects is fast, at the cost of memory usage.

schteppe / p2.js

Is it possible to use SIMD in p2? #158