Open wojciak opened 9 years ago
Probably possible...
I guess figuring out a good data layout is a good start. The current position and angle properties in Body could be replaced with a float32x4 instance, filling up 3 of 4 float lanes. Same for velocities. This would make operations like integration easy to parallelize. Constraint solving could probably benefit from a similar layout.
This would require a major change of the API and a lot of rewriting.. Keeping the API as it is probably not be possible. Or maybe it is? I need a brain massage before I can answer.
As I thought this isn't so simple, and indeed I think an api change would be necessary. It's a nice to have for the long future ahead ;]
I made a prototype similar to the p2.js API using SIMD. What I found is that the SIMD version does not run faster than a scalar version. I'm a bit puzzled.
If you run this demo in Firefox Nightly, you'll get the SIMD version. If you run it in some other browser without SIMD, you'll get the scalar version: http://jsfiddle.net/ym720t83/5/
If I set N=500, I get about 6fps in Firefox Nightly (SIMD), but around 20fps in Chrome (non-SIMD).
Okay, stab 2... A lot more simplified example: http://jsfiddle.net/chdt0hs6/3/
If you get it running in FF Nightly, you'll see something like:
Running test...
NoSIMD: 118ms
SIMD: 393ms
:(
Do you get better results on your machine?
Stab 3! I managed to get my Body.integrate benchmark 4.5x faster than the scalar one :)
The best option I found this far is to use Float32Arrays of length 4 (to make SIMD happy), then use SIMD.Float32Array.load()
and SIMD.Float32Array.store()
before and after making the computations inside the body methods. Both the linear and angular things (position, velocity, force) go into the same vector to improve performance.
Prototype SIMD Body class:
var SIMDBody = function(){
this.invMass = new Float32Array([1,2,3,0]);
this.position = new Float32Array([1,2,3,0]);
this.velocity = new Float32Array([1,2,3,0]);
this.force = new Float32Array([1,2,3,0]);
}
SIMDBody.prototype.integrate = function(dtVec){
var f = SIMD.Float32x4.load(this.force, 0);
var v = SIMD.Float32x4.load(this.velocity, 0);
var x = SIMD.Float32x4.load(this.position, 0);
var iM = SIMD.Float32x4.load(this.invMass, 0);
var fhMinv = SIMD.Float32x4.mul(f, iM);
var fhMinv2 = SIMD.Float32x4.mul(fhMinv, dtVec);
var v2 = SIMD.Float32x4.add(fhMinv2, v);
var v_dt = SIMD.Float32x4.mul(v2, dtVec);
var x2 = SIMD.Float32x4.add(x, v_dt);
SIMD.Float32x4.store(this.velocity, 0, v2);
SIMD.Float32x4.store(this.position, 0, x2);
};
Scalar version that I use for comparison:
var Body = function (){
this.invMass = 1;
this.position = new Float32Array([0.1,0.2,0.3]);
this.velocity = new Float32Array([0.1,0.2,0.3])
this.force = new Float32Array([0.1,0.2,0.3])
}
Body.prototype.integrate = function(dt){
this.velocity[0] += this.force[0] * dt * this.invMass;
this.velocity[1] += this.force[1] * dt * this.invMass;
this.velocity[2] += this.force[2] * dt * this.invMass;
this.position[0] += this.velocity[0] * dt;
this.position[1] += this.velocity[1] * dt;
this.position[2] += this.velocity[2] * dt;
};
If I run the SIMD code using the polyfill, it is slower than the scalar version. It uses 2x as much computation time, probably because of garbage collection and other stuff.
Initially I stored the SIMD.Float32Array objects on the Body instance itself, in hope to improve performance. But that made everything slower. I wonder why...
I also tried to store the positions of all bodies in a consecutive typed array (same for the velocities and forces, etc etc) but that was not very much faster. I call this "structure of arrays". The prototype SIMD code above (array of structures) was 1.4x faster.
Ported that stuff to jsperf, and added some solver tests. Results:
integrate
is a simplified scalar version of Body.prototype.integrate
. integrate-simd
is the same method but using SIMD.
solve-ish
is a simplified version of a few combined core methods in GSSolver.js, which handle the contact and constraint solving in p2. simd solve-ish
is the SIMD version.
These tests show that the data layout is good for operating both on single bodies, and on body pairs. One could of course structure the data so that 4 bodies can be operated on at the same time, but my gut feeling is that it's going to be much more work to get that going.
Good results :+1:
Now just need to write everything using SIMD.
First stab at a p2 SIMD shim. SIMD Body integration! https://github.com/schteppe/p2.js/commit/84dbcf7c1e8419420fc508402e4fea43626c040f
:+1: Really cool research and results @schteppe, keep up the good work!
Thanks @englercj! I just hope SIMD.js will get some traction soon
Today I had another (simpler) approach at SIMD + p2 in Firefox Nightly. Unfortunately it didn't give very much.
At the end of the vector math file (src/math/vec2.js), I added SIMD shims (see code below), making all vec2s/Float32Arrays have length 4. It also shims some of the most used math methods in p2, so they use SIMD if available. In theory this can make the code run twice as fast; the methods do indeed run at 2x speed on their own:
JSPerf for vec2.add
JSPerf for vec2.rotate
I tried running the circles demo with 450 circles using this approach, but I see no performance gain. Rather the opposite; I lost a few FPS...
Looking at the profiles, I see that the island splitting and the TupleDictionary are eating a lot of performance, those should really be optimized. And the demo rendering is a bit expensive too. Turning these things off and increasing the number of circles to 1500, I still get no performance gain with SIMD :(
if(typeof SIMD !== 'undefined'){
vec2.create = function(){
return new Float32Array(4);
};
vec2.clone = function(a) {
var out = vec2.create();
out[0] = a[0];
out[1] = a[1];
return out;
};
vec2.fromValues = function(x, y) {
var out = vec2.create();
out[0] = x;
out[1] = y;
return out;
};
vec2.add = function(out, a, b) {
var sa = SIMD.Float32x4.load(a, 0);
var sb = SIMD.Float32x4.load(b, 0);
var sout = SIMD.Float32x4.add(sa,sb);
SIMD.Float32x4.store(out, 0, sout);
return out;
};
vec2.subtract = function(out, a, b) {
var sa = SIMD.Float32x4.load(a, 0);
var sb = SIMD.Float32x4.load(b, 0);
var sout = SIMD.Float32x4.sub(sa,sb);
SIMD.Float32x4.store(out, 0, sout);
return out;
};
vec2.sub = vec2.subtract;
vec2.multiply = function(out, a, b) {
var sa = SIMD.Float32x4.load(a, 0);
var sb = SIMD.Float32x4.load(b, 0);
var sout = SIMD.Float32x4.mul(sa,sb);
SIMD.Float32x4.store(out, 0, sout);
return out;
};
vec2.mul = vec2.multiply;
vec2.divide = function(out, a, b) {
var sa = SIMD.Float32x4.load(a, 0);
var sb = SIMD.Float32x4.load(b, 0);
var sout = SIMD.Float32x4.div(sa,sb);
SIMD.Float32x4.store(out, 0, sout);
return out;
};
vec2.div = vec2.divide;
vec2.scale = function(out, a, b) {
var sa = SIMD.Float32x4.load(a, 0);
var sb = SIMD.Float32x4.splat(b);
var sout = SIMD.Float32x4.mul(sa,sb);
SIMD.Float32x4.store(out, 0, sout);
return out;
};
vec2.rotate = function(out,a,angle){
var sa = SIMD.Float32x4.load(a, 0);
var c = Math.cos(angle),
s = Math.sin(angle);
var cs = SIMD.Float32x4(c,s,0,0);
var sc = SIMD.Float32x4(-s,c,0,0);
var xx = SIMD.Float32x4.swizzle(sa,0,0,2,3);
var yy = SIMD.Float32x4.swizzle(sa,1,1,2,3);
var sout = SIMD.Float32x4.add(
SIMD.Float32x4.mul(cs,xx),
SIMD.Float32x4.mul(sc,yy)
);
SIMD.Float32x4.store(out, 0, sout);
};
}
What's the reason?
Is it because the function calls are slow, or because it requires a lot more memory consumption (and thus more gc that slows down the browser?)
Edit:
vec2.add = function(out, a, b) {
var sa = SIMD.Float32x4.load(a, 0);
var sb = SIMD.Float32x4.load(b, 0);
var sout = SIMD.Float32x4.add(sa,sb);
SIMD.Float32x4.store(out, 0, sout);
return out;
};
It might also be helpful to do this instead, 3 variables per add call will add to GC and might reduce FPS:
vec2.add = function(out, a, b) {
SIMD.Float32x4.store(out, 0,
SIMD.Float32x4.add(
SIMD.Float32x4.load(a, 0),
SIMD.Float32x4.load(b, 0)
)
);
return out;
};
Calling the .store
function to put the data back inside the TypedArrays probably ruins the whole point of having SIMD objects. The objects are immutable and forcing JS to push values from an immutable object every add is also a waste in my opinion.
I can't come up with a better solution, however.
Memory consumption and GC might actually be the case. When I compared, I used Float32Arrays of length 2 vs length 4. Should have thought of that. Will test again with length 4 for both cases (simd vs no simd), when there's time. Thanks!
@schteppe I'm talking about adding 3 extra variables per method call, I think the example you're talking about isn't a problem.
Check the GC rate with my modified function calls and see if it makes a small difference.
//3 variable references, 3 objects created, 4 function calls
vec2.add = function(out, a, b) {
SIMD.Float32x4.store(out, 0,
SIMD.Float32x4.add(
SIMD.Float32x4.load(a, 0),
SIMD.Float32x4.load(b, 0)
)
);
return out;
};
//vs
//6 variable references, 3 objects created, 4 function calls
vec2.add = function(out, a, b) {
var sa = SIMD.Float32x4.load(a, 0);
var sb = SIMD.Float32x4.load(b, 0);
var sout = SIMD.Float32x4.add(sa,sb);
SIMD.Float32x4.store(out, 0, sout);
return out;
};
GC slows down when it had to clean up 2x as many references as necessary.
The memory itself is cheap, it's the cleanup that might be an issue.
Edit: I'm glad I took up this issue, because now I have a better understanding of SIMD. I'm going to do some research to see if I can help with the memory issues. -Josh
Things that I understand now:
SIMD.Float32x4(1,2,3,4) === SIMD.Float32x4(1,2,3,4)
returns false because even though the data is immutable, it creates 2 different objects in the process. I can't compare vectors with object references. This seriously hampers the reasons to use immutable data.p2
to do. Float32Array
reigns supreme in data formats for repeated use.vec2.add(byref, operand)
function does, native vs js would be no comparison.When I compared, I used Float32Arrays of length 2 vs length 4.
This will not matter in the slightest, you are worried about the 3 objects created as seen in the example below:
vec2.add = function(out, a, b) {
var sa = SIMD.Float32x4.load(a, 0); //object creation
var sb = SIMD.Float32x4.load(b, 0); //object creation
var sout = SIMD.Float32x4.add(sa,sb); //object creation
SIMD.Float32x4.store(out, 0, sout);
return out;
};
3 x n
where n
= number of additions needed to be performed per integration.
I have come up with exactly one unreasonable recommendation to remedy the memory usage.
Converting the property data types on the Shapes/Bodies to SIMD data types would be nearly 4x as efficient per operation but would come at the cost of a huge refactor.
Perhaps a simple loop over the bodies in question at the post-integration step to load the data from the SIMD objects would be the final step. I cannot even fathom a rewrite and refactor of this nature and highly suggest avoiding SIMD data for this reason.
E.G.
body._SIMDPosition = SIMD.Float32x4.add(body._SIMDPosition, movementSIMDVector);
//post integration
SIMD.Float32x4.load(body.position, body._SIMDPosition);
I hope I'm wrong, SIMD is very cool and I want to use it very badly.
Wow, thanks for your hard work!
Do you really think that the SIMD objects add to GC? Aren't they optimized away to nothing? I have no idea how these things work really, you're probably right.
This is frustrating.
Do you really think that the SIMD objects add to GC?
Absolutely. I did some performance testing on 100 matrix math operations per frame to see Memory usage shoot up 200,000,000 bytes. This is terrible for animation loops.
They aren't optimized away because they are immutable, and a new object is created every time a function is called.
Just like when we allocate a new Float32Array
there is an up front cost. Now we are allocating 4x arrays per add
, subtract
, etc... However, If we don't use them every frame and only use them up front, the cost is minimal.
I also think the reason why your JSPerf was so fast was because it wasn't taking GC into account. CPU usage for SIMD objects is fast, at the cost of memory usage.
I was wondering if we can speed up physics calculations by migrating to SIMD (https://01.org/node/1495) where supported for some operations.
It's implemented in FF: https://blog.mozilla.org/javascript/2015/03/10/state-of-simd-js-performance-in-firefox/
Chromium: https://groups.google.com/a/chromium.org/forum/m/#!topic/blink-dev/2PIOEJG_aYY
Chakra: http://channel9.msdn.com/Events/Build/2015/2-763
And the elephant in the room - Crosswalk: https://crosswalk-project.org/documentation/samples/simd.html It would give JS wrapped games a great boost on android and tizen platforms.