toji / gl-matrix

Javascript Matrix and Vector library for High Performance WebGL apps
glmatrix.net
MIT License
5.36k stars 719 forks source link

Performance is now extremely slow on modern browsers #359

Open krisnye opened 5 years ago

krisnye commented 5 years ago

This library used to be fast, but modern browsers are now allocating simple objects so fast that they are effectively free. Array allocation is still much slower and ArrayBuffer allocation is extremely slow.

A new math library is needed to take advantage of this. It should use simple classes, all functions can return new instances instead of using awkward out parameters and there should only be conversion to ArrayBuffers as a final function when assigning to webgl. (technically those arraybuffers can be reused as webgl doesn't retain references to them after you assign)

Check the performance of different approaches here:

https://jsperf.com/webgl-math-library-comparisson/20

stefnotch commented 5 years ago

https://github.com/toji/gl-matrix/issues/358#issuecomment-477527304

I'll take a closer look at the JSPerf later

stefnotch commented 5 years ago

I added a new test case where I'm caching glMatrix's Float32Arrays, which greatly increases the performance, however vanilla JS objects are still faster: https://jsperf.com/webgl-math-library-comparisson/21

However, when using those rather limited test cases, glMatrix seems to outperform vanilla JS objects: https://jsperf.com/glmatrix-getclosesttonline/ Feel free to expand them with more libraries. I'd love it, if you could prove me wrong.

A WebAssembly port of glMatrix is planned, however WebAssembly still has a few limitations.

krisnye commented 5 years ago

This is weird. You have any ideas on why the results in the first comparison are so different from the second?

krisnye commented 5 years ago

@stefnotch OK, I witnessed the previous performance of your sample with literal objects being slower. I updated the sample to use a small Vector class and the performance is back to what I witnessed in the other samples. Maybe Chrome/Firefox are more aggressively optimizing when they can see functions being called with instances sharing a consistent prototype. (I'm guessing that those object literal samples end up using a slow lookup dictionary internally in Chrome and the newer class instances are turned into efficient hidden classes and the function is optimized with them in mind)

Vector classes are doing 235 K-ops/sec vs 78 optimized in gl-matrix. That's even when noting that the objects are pure functional and are doing 2 new allocations within that function (in the .subtract implementation).

On Firefox the difference is even more significant. 1.2 million ops per second (which makes me think that it may be optimizing the whole function away. That's why I added some side effect to other samples.)

https://jsperf.com/glmatrix-getclosesttonline/12

Please check it out and see if you can replicate my findings.

krisnye commented 5 years ago

I added back your older object test so we can compare the difference between using literal objects and using class instances and added Array as well which is also very fast.

https://jsperf.com/glmatrix-getclosesttonline/15

stefnotch commented 5 years ago

Thank you very much for taking your time to write those tests. I can replicate your findings. Apparently objects are now significantly faster than Float32Arrays.

I guess as it stands, glMatrix will have to improve, either by using objects or by using WebAssembly. Since I currently don't have time to rewrite this library with objects and other libraries that use objects already exist, I suppose I'll wait for WebAssembly to become better.

krisnye commented 5 years ago

@stefnotch I actually wrote a C implementation AND a WebAssembly implementation of the add/cross/dot sample both with and without linear memory access (to make it more realistic) and my findings indicate that going to WebAssembly will NOT gain you any significant performance benefits (except in Safari which is an outlier with excellent Wasm, but bad everything else).

Here is the document. I can send you my source code for the C and the Web Assembly and any of the other node versions if you like.

https://docs.google.com/spreadsheets/d/1SQPU4OwV5QeA8_peuTk49PDvT_djVVrq1TGd-OvWtzA/edit?usp=sharing

krisnye commented 5 years ago

A quick improvement you could do or recommend to others is just to set the array type to use normal arrays and not Float32Arrays, only converting to Float32Arrays before assigning to a WebGL API that expects them. Other than that, not much can be done without being an entirely new API. If you do write one I recommend treating all objects as immutable (while not actually calling Object.freeze because that's super slow).

Or not convert to Float32Array at all since setUniform seems fine with normal arrays.

karhu commented 5 years ago

I recently did a similar evaluation, but rather approaching this from the memory consumption side instead of the performance perspective. My findings also point in the direction of using normal arrays instead of Float32Arrays as backing store.

In short, new Float32Array(3) has a lot of overhead, I measured 200 bytes in Chrome, as opposed to an unrealistic but optimal 12 bytes or 20 bytes if you account for a pointer to the data. This number is confirmed by a V8 developer here: https://stackoverflow.com/questions/45803829/memory-overhead-of-typed-arrays-vs-strings#45808835

In contrast I measured ~96 bytes for a normal array.

karhu commented 5 years ago

Small update to the above, a class with x, y & z fields uses ~56 bytes, which also matches with other sources:

krisnye commented 5 years ago

@karhu There is a lot of overhead for allocating a typed array. You would not want to create them for small objects as happens currently in gl-matrix. You STILL would want to use them though when you are allocating a very large array of data as it keeps it much more tightly packed than when using javascript objects.

So, use idiomatic javascript classes for your vectors and matrixes but then store the results in a large typed buffer (which you are probably going to use as a webgl vertexbuffer)

karhu commented 5 years ago

@krisnye Yes of course, that's how we mainly use them in our code base. Unfortunately they tend to lead to less idiomatic and maintainable code, so I started to investigate alternatives ;) Was just a bit surprised that the typed arrays had that much overhead and figured it might be useful information in the discussion here.

tobx commented 5 years ago

Do you have any information about garbage collection? Creating many objects led to animation hickups when the garbage collector stepped in in older browser versions. Of course we could also reuse objects, but I would still like to know if the garbage collection of objects is as free as the allocation of those in modern browsers.

krisnye commented 5 years ago

I didn't do any testing of garbage collection in older browsers. As the title says, "performance is now extremely slow on modern browsers". I think most of us writing high performance graphics or games are primarily concerned about modern browsers.

tobx commented 5 years ago

Agreed, but I was asking if "garbage collection of objects is as free as the allocation of those in modern browsers". The fact that allocation is faster does not imply that garbage collection is faster too. I am not asking about Typed Arrays vs. Objects allocation performance, but about your statement "all functions can return new instances instead of using awkward out parameters".

The question is: If you do not reuse your Objects or Arrays or Typed Arrays the garbage collector has to clean them up. So even if allocation is super fast, in an asynchronous game loop every few seconds when the garbage collector kicks in this could create hickups - theoretically - at least it was like that in older browsers. Do you know if this is no problem anymore in new browsers if you create e.g. thousands of objects each frame for particle effects?

krisnye commented 5 years ago

The times I was seeing led me to suspect that they either optimized away the allocations entirely or else they were allocating the objects as if they were structs right on the stack.

It's a good question though. When I get a chance I will re-run some of the tests over a longer period of time and inspect the memory usage.

maierfelix commented 5 years ago

Just wanna throw in a small test regarding memory allocation: https://jsperf.com/typed-array-allocation-pool

The fastest seems to be to pre-allocate a chunk of memory, then generate Float32Array sub-views on it and cache them. One problem here is, that once the pool gets OOM, all sub-views etc. have to be refreshed.

I'm not sure if moving to normal arrays is good, as in order to upload them it would duplicate memory usage. AFAIK V8 can take the pointer of the raw memory an ArrayBuffer holds and send it directly to OpenGL.

Edit:

Using a pool allocator would allow using a class based interface, e.g.:

class Vec3 {
  constructor() {
    this._memory = MemoryPool.allocatePortion(Vec3.byteLength);
  }
  get raw() {
    return this._memory;
  }
  get x() {
    return this._memory[0];
  }
  set x(v) {
    this._memory[0] = v;
  }
  ..
}

Vec3.byteLength = 3 * Float32Array.BYTES_PER_ELEMENT;

Vec3.prototype.add = function(b) {
  let memA = this._memory;
  let memB = b._memory;
  memA[0] += memB[0];
  memA[1] += memB[1];
  memA[2] += memB[2];
};

let a = new Vec3();
let b = new Vec3();
a.add(b);

console.log(a.x);
console.log(a.raw);
stefnotch commented 4 years ago

Another way to avoid the overhead of creating an arraybuffer every time is to write functions like this https://github.com/toji/gl-matrix/blob/fbaca33fdf6a35de58b7a090fa70373b27ccde87/src/quat.js#L716

I wonder how this solution fares performance wise.

kevzettler commented 4 years ago

@stefnotch that is the pattern I generally use when working with gl-matrix.

The out arguments were not intended to be re instantiated and allocated each time the function is called. Assuming we have this code in some high frequency game loop:

// bad
const vec1 = [0,1,0];
setInterval(() => {
  const scaledVec = vec3.scale([], vec1, 5);
}, 60)

They are intended to be pooled and reused to restrict allocations which will minimize garbage collection hits that will disrupt the frame rate.

const vec1 = [0,1,0];
const scaleOut = vec3.create();
// This will reuse the pooled array 
setInterval(() => {
  vec3.scale(scaleOut, vec1, 5);
}, 60);

This thread is confusing as its arguing that the gl-matrix performance is slow but actually pointing to allocation performance. The gl-matrix interface, though awkward, discourages from rapid and redundant allocations by using the out parameter. That is the whole purpose of the out parameters you should be using them as an allocation cache. You should allocate your out parameters first and reuse them. You should treat any calls to gl-matrrix with a vanilla array allocation as a red flag.

mreinstein commented 3 years ago

@kevzettler is exactly right. A high frequency game/simulation loop that allocates a lot of mem absolute thrashes the garbage collector, whether you're talking about today's v8 or the one from 4 years ago.

Ditto for objects, and proxy getters/setters. These are not "free" syntaxes. They carry very real performance costs.

The web is filled with a lot of toy math libraries that don't understand this, and that's why a lot of games built on them suffer horrendous GC pauses.

It's definitely a good question as to whether plain vs typed javascript arrays offer better peformance, but if we're really concerned about perf, making everything expensive objects is going to hurt, not help the perf.

I don't want to simply trash the concept, I would definitely appreciate another performance pass.

Ideally we would have some kind of vanilla js syntax for static allocations:

function doStuff () {
   static const mv = vec2.create()   // non-existent feature in today's javascript
   // ...  
}

doStuff(); doStuff(); doStuff();  // despite being called thrice, mv would only allocate once

But since this doesn't exist, some kind of built-in, optional pooling module would be nice (right now I have my own very naive but effective implementation here for vec2s: https://github.com/mreinstein/vec2-gap/blob/main/pool.js)

krisnye commented 3 years ago

A high frequency game/simulation loop that allocates a lot of mem absolute thrashes the garbage collector

@mreinstein

Can you confirm that this is the case when allocating short lived objects? These Vectors etc aren't designed to be retained, they are just allocated for calculations and then discarded. I've ran more extensive tests, including examining the memory allocation and garbage collection and this pattern doesn't appear to put any significant stress on the garbage collector. They don't seem to make it past the first stage of garbage collection.

I think it would be helpful if you can prove your assertions more definitively. I won't argue with good evidence, and I would like to know either way.

tobx commented 3 years ago

I made a very simple benchmark that shows 2 things:

  1. @mreinstein Objects seem to be faster than arrays (google hidden classes to understand why). I have not tried Float64Array yet though.

  2. @krisnye Reusing objects or arrays is faster than creating new instances.

Here it is: https://jsben.ch/FgKVi

This is where my research has brought me. Did I miss something?

Note: The biggest speed advantages do not help if your game stutters every 5 seconds when garbage collection kicks in. Also, vector math is probably not the bottleneck of most games.

Anyway, I think a class is the cleanest and most readable solution. If this really has the best performance nowadays, should it not be the way to go?

tobx commented 3 years ago

I also want to point out that there is a similar discussion for three.js's vector math. It explains for example why Float64Array should be faster than Float32Array in JavaScript:

https://github.com/mrdoob/three.js/issues/8661#issuecomment-501603693

mreinstein commented 3 years ago

@tobx the (reuse) benchmarks are definitely interesting and unexpected.

A few notes on the way the benchmark is setup:

const count = 10000;
const inputsArr = new Array(count);
const inputsObj = new Array(count);

// realistic test data setup
for (let i=0; i < count; i++) {
    const x = randomFloat();
    const y = randomFloat();
    const z = randomFloat();

    inputsArr[i] = [ x, y, z ];
    inputsObj[i] = new Vec3(x, y, z);
}

// tests here...

Anyway, none of these points "refute" your benchmark, just additional points of interest maybe. Would welcome feedback on these. @kevzettler would def be curious to see what you think of both the benchmark and these follow up points.

kevzettler commented 3 years ago

@mreinstein

@tobx the (reuse) benchmarks are definitely interesting and unexpected.

I'm confused whats unexpected here, the (reuse) benchmarks performed better? Thats what the conversation has been about until this point? why is that unexpected? You've even shared this opinion in previous post? Unless you're talking about array vs object reuse. Which I think is important to compare the 2 add functions

// array reuse
function add(out, a, b) {
  out[0] = a[0] + b[0];
  out[1] = a[1] + b[1];
  out[2] = a[2] + b[2];
}
// object reuse
 add(vector) {
    this.x += vector.x;
    this.y += vector.y;
    this.z += vector.z;
  }

Which indicates there is some over head to the out[index] = index array assignment vs direct value assignment this.x +=

Separately, I agree the benchmark is not realistic with the consistent adding of the same values. In fact I believe the V8 JIT compiler maybe able to detect that and optimize it away into a cached operation. The benchmark you proposed with with the randomFloats would be much more beneficial. Additionally, the benchmark doesn't even use gl-matrix methods so we can't with 100% certainty use it as an example.

mreinstein commented 3 years ago

I'm confused whats unexpected here, the (reuse) benchmarks performed better?

No, that part is expected. Obviously not allocating in a tight loop is going to be faster.

Ignoring the non-reuse cases, what is unexpected (to me at least) is that the Vector based implementation would run that much faster than the array based version.

there is some over head to the out[index] = index array assignment vs direct value assignment this.x +=

I agree with that, but I would be surprised that difference would be largely responsible for the significant perf gap shown in the benchmark.

kevzettler commented 3 years ago

Ok I agree with that. I was also surprised to see that perf gap between out[index] = vs this.x += and not sure why thats so large. I can imagine it might be a JIT optimization. Like the JIT won't inline operations that use array index accessors or something. Would take some deeper investigation.

tobx commented 3 years ago

@mreinstein

  • having a single class and a single array that stores the input data is not a very realistic usage example.
    • this benchmarks only writing operations.

Good points, I will give that a try soon.

  • this benchmarks only addition operation. Also curious about other operations

Hm, it would be rather simple to replace all summations with multiplications, but I think this should not affect array vs. object performance at all, do you? But yes, one should never expect the results but measure it. Maybe give it a try.

  • I wonder if the shape of the API between add(out, a, b) vs Vector.add(a) has any impact on the benchmark.

Good point, I expected some different results and had them in the first run, but after a few more runs it seems this has no performance impact at all, see:

https://jsben.ch/Anheh

tobx commented 3 years ago

I made the benchmark as proposed by @mreinstein:

https://jsben.ch/vIcnR

At my system reusing Objects is still the fastest after multiple runs, but the difference is less. In this version every test has more array accesses anyway so I guess the difference between Object and Array becomes less relevant.

EDIT: @kevzettler The glMatrix add method is almost the same as in my first bench. The only difference is that it returns out. I wanted to include it (the function is still in there, but not used, forgot to remove it), but it is also not using +=, so I thought I want to focus on Array vs. Object as close as possible.

tobx commented 3 years ago

Array reading is slightly faster in Chrome, but slower in Firefox (macOS):

https://jsben.ch/9qQua

More interestingly (for me), Object allocation is slightly faster than Array allocation (Chrome & Firefox):

https://jsben.ch/J54f7

EDIT: I did not find enough information about this and was interested in it and thought I share my results. I would be very happy though if the people with more experience in this topic would adapt the benchmarks to more realistic versions.

mreinstein commented 3 years ago

I'm starting to think this is discussion is a bit of a moot point. The fact that this library is specifically designed to interact with webgl means that regardless of how Float32Arrays perform, it probably makes the most sense to use them, because that is what webgl is using, and it's more natural to avoid having to do conversions everywhere.

@stefnotch I know you've mentioned hesitation in offloading cacheing/pooling to the end user. I'd be curious to see what you think of this very simple pool implementation I've got here https://github.com/mreinstein/vec2-gap#pool I'm using this in some of my own stuff, it's got a few tests and both single/bulk free interfaces.

krisnye commented 3 years ago

The fact that this library is specifically designed to interact with webgl means that regardless of how Float32Arrays perform, it probably makes the most sense to use them, because that is what webgl is using, and it's more natural to avoid having to do conversions everywhere

It is true that you have to eventually write your results to a Float32Array, but the final conversion to that is trivial. Math libraries are designed as an intermediate representation for doing calculations. The intermediate format needs to be fast and easy to use.

All of the tests I have done recently show that immutable classes are very fast and mutable, classes are only 10% faster.

Because of this, I use an immutable math library because it's semantically so clean and I haven't yet seen GC pressure due to quick calculations with them. I don't know enough about the Chrome runtime, but it seems that it might as well be allocating these on the stack in some cases or else realizing it can completely eliminate them from some functions.

stefnotch commented 3 years ago

The things that'd currently be interesting to benchmark would probably be

I sadly currently don't have a whole lot of time to look into any of those things.

StEvUgnIn commented 3 years ago

I have been having issues with my setup and I still didn't go through resolving to use as-bind to have an encapsulated interface that allows full compatibility with gl-matrix.js on the webassembly port.