Implement batch drawing on WebGL

agmcleod commented 9 years ago

This should reduce heavy usage of buffers, and number of draw calls required. Leading to a performance boost.

parasyte commented 9 years ago

The referenced commit is not batching. All I did was reuse buffers in drawImage. The same buffer reuse is needed for fillRect.

The process of batching involves keeping a record of everything you want to draw which uses the same texture, then constructing the triangle vertices/texture coordinates/indices, and sending the whole thing in a single drawElements call.

This will make tile layers very fast (an entire map layer can be drawn in a single command), and sprites that use the same texture atlas will get the same benefit.

agmcleod commented 9 years ago

Ah yes, i did mis understand your original post. But regardless this will definitely help :)

parasyte commented 9 years ago

It's a good first step, I agree.

Thinking this morning how to structure the batching API, it seems natural to use me.TextureAtlas The whole purpose of this data structure is to store texture coordinates. The coordinate range is a bit different between Canvas and WebGL (pixels vs float between 0.0 .. 1.0 inclusive) but they require identical behavior.

So I want to rename me.TextureAtlas to me.CanvasRenderer.Texture, and then extend it as me.WebGLRenderer.Texture. That takes care of the texture coordinates. To build batching on top of that, we have multiple options! yay! My favorite so far is lazy batching:

For each me.video.renderer.drawImage(texture, index, x, y, w, h) where index is an atlas index name or number, a reference to the last texture is remembered:
- If the last reference is the same as the new reference, add index, x, y, w, h to a Batcher object.
- Else flush the Batcher object (draw the batch, then reset itself) and add index, x, y, w, h
At the end of frame drawing, blitSurface is called. This function will flush the Batcher object (draw the batch, then reset)

My second favorite is explicit batching, where something has to create a new Batcher, add items to draw, then flush the Batcher when ready. IMHO, this will be harder to get working with sprite objects, since there's really no "dividing line" between how sprites should be grouped. TMX has ObjectGroup for this purpose, but there's no guarantee that each object uses the same texture.

The implementation of this Batcher class is TBD, but basically it just needs to take the x, y, w, h screen coordinates, and assemble the vertices for these, appending them to the vertex list. It also uses the texture atlas index to fetch the texture coordinates, and appends them to the texture coordinate list for the batch. The index list (for drawElements) is also updated likewise. It has a second method to flush these lists to the GPU before resetting.

As you might tell from this last description, the texture batching will actually "undo" a lot of the bufferData reuse work that I just did! :wink: But it was a super useful exercise that helped me get more familiar with WebGL.

@agmcleod @obiot Please weigh in with your thoughts, especially in regard to how I want to redesign the Texture Atlas.

agmcleod commented 9 years ago

My only concern with the second one is it could be limiting in some way. The first method is the one a developer at big viking games used, adding a webgl renderer to their canvas-like api. Similar to what we're doing really.

parasyte commented 9 years ago

Implicit batching is by far easier to manage. And if you cram everything into a single texture atlas (all tile sets, all sprites) then you can theoretically get the best possible performance by executing a single drawElements call per frame, without any extra coding, or other special setup.

agmcleod commented 9 years ago

Yep exactly. Been trying to get better practice at doing so. Nicer to see a shorter list of files getting uploaded when i SCP stuff to my site.

parasyte commented 9 years ago

:yum:

Also there's something to be said about at least exposing an API to allow custom batching operations. Just in case someone actually wants to manage it themselves for some reason.

agmcleod commented 9 years ago

https://github.com/agmcleod/melonjs-spine/issues/1 :)

parasyte commented 9 years ago

I'll start this one. ;)

parasyte commented 9 years ago

The TextureAtlas constructor now accepts a new "internal" texture atlas format (for now it's just the same as TexturePacker with "melonJS" in the meta.app field.)
The atlas regions each have a uvMap property that specifies texture coordinates (in WebGL triangles) for the region. These can be passed directly to WebGL. For batching, these need to be concatenated together and used with drawElements(gl.TRIANGLES, ...)

obiot commented 9 years ago

But merge "ticket-620" into master first ;)

On 21 déc. 2014, at 03:29, Jay Oster notifications@github.com wrote:

I'll start this one. ;)

— Reply to this email directly or view it on GitHub.

parasyte commented 9 years ago

The next step is adding the batcher to the WebGL Renderer. I think this won't be too much trouble. It just needs to do a little bit of memory accounting to avoid GC. I'll start with ~1KB memory buffers, and do the usual growth by 2x pattern (never shrinking).

Second, I don't want to change the function signature for drawImage, since that's already well-established. Instead I'll add a public bindTexture method to the Renderer API (there's one right now, but it needs to be replaced) that will remember the provided Texture region, and use it for batching. Using the method will look something like this:

me.video.renderer.bindTexture(texture.getRegion(name)).drawImage(texture, x, y, ...);

That's the best thing I can think of...

Otherwise it should be really straightforward.

parasyte commented 9 years ago

On the other hand ... The drawImage method can just introspect the first argument with instanceof. For Image it will use the original signature. And for me.video.renderer.Texture it will expect the region info as the second parameter. Kind of weird, but easy enough!

agmcleod commented 9 years ago

Nice work on this so far jason. Do you think using instanceof and checking for different parameter types might duplicate/increase logic in the method? Could potentially refactor it into two methods if you think it would be worth it.

parasyte commented 9 years ago

It won't cause any duplicate logic, the code to select source coordinates will be different. It will also be the cleanest interface, IMHO. (This is probably the design I imagined when I proposed the batcher. I've just forgotten about the details.)

Anyway, let's try it with drawImage getting a new fourth function signature, and see how it works.

parasyte commented 9 years ago

This part of the work is actually quite involved (more than I had imagined!) It will require some shader rewrites. The best case scenario is the batcher uploads all of the textures, vertex buffers, UV maps, index buffers, etc to the GPU on the first frame, then only needs to provide transformation matrices for all of the sprites on each frame update.

Some new points need to be addressed:

The batcher will need to be reset on each state change (e.g. LoadingScreen -> PlayScreen) to remove old texture state.
The GPU has limited texture memory, so fewer textures are better than many. Power-of-two sizing is also important. We need to hard-code the fragment shader to expect a maximum number of textures, like 8. GPUs that have more texture units could benefit from dynamically modifying the shader code before compiling. The number of available texture units is in gl.MAX_COMBINED_TEXTURE_IMAGE_UNITS The batcher will throw an exception if too many textures are used simultaneously.
The fragment shader needs a texture index to select the correct texture. This can be done with an if-statement, similar to the one described here: http://webglsamples.org/sprites/readme.html We will use a single u8 instead of a vec4; the value of the u8 selects the texture.
The vertex shader needs to accept an array of 2D transformation matrices for each sprite. (The matrix allows arbitrary translation/rotation/scaling operation order outside of the shader.) Some optimizations to this data structure could be used, to reduce the total amount of data sent to the GPU each frame. This presentation has some great info on the subject: https://docs.google.com/presentation/d/12AGAUmElB0oOBgbEEBfhABkIMCL3CUX7kdAPLuwZ964/edit#slide=id.i177 e.g. each float element can be packed into a u16 fixed point format, and unpacked by the shader, reducing GPU bandwidth requirements by half.
We need to remove this magic code: https://github.com/melonjs/melonJS/blob/83304d2d573234973cd241ec0ff485199b9f6115/src/video/webgl/shader.js#L194-L195 to support hardware accelerated image repeating for me.ImageLayer (see: #603) -- requires power-of-two textures.
Allowing custom shaders also means allowing a custom batcher; since the batcher is an API for the shaders. Custom shaders may use the same API, but we should provide the flexibility to use a custom API.
One day melonJS will host 3D games with the right set of shaders and batcher.

parasyte commented 9 years ago

The transformation matrices are way too bulky to send to the GPU for every vertex. The right solution is transforming the vertex in JavaScript, and streaming the result to the GPU!

Each 2D matrix encodes three vectors. It is most efficient to send the matrices once at startup if they rarely change, and let the GPU multiply them repeatedly. WebGL is a state machine; any state you set will only be changed by you.

I have finished merging the two matrix classes into a single class. This is a nice win for code footprint, and fixes the translate method. Our Matrix2d class is sparse; it doesn't do full matrix multiplication. The "hidden row" is never multiplied into (as an optimization). This is fine for most purposes, but will give incorrect results when the hidden row is not 0, 0, 1. We don't have any matrices like this anyway. :) And affine transformations require it to be 0, 0, 1 for homogenous coordinates; projecting a 2D plane in 3D space.

parasyte commented 9 years ago

There are two problems to solve with the TextureCache:

Object hashing. We need a way to hash objects to a unique ID. One option is overriding Object.prototype.toString to return a string with a unique ID. A second option is adding a unique ID to the object itself. Another option is to use an array instead of hash table for O(n) lookups vs O(1). A fourth option is using a different data structure, such as the hash map provided by mori or ES6 Map.
All Textures every created need to be placed into the TextureCache. The only way to do this unobtrusively is for the me.Texture constructor to add itself to the cache.

parasyte commented 9 years ago

TODO:

[x] Rename "Batcher" to "Compositor", or something better
[x] Fill a local Float32Array and copy its contents directly to the WebGL buffer with a single bufferData call. (Replace multiple bufferSubData calls.)
[x] Move the gl.clear call into the compositor, so that it is not executed out of order. (Flush the buffer, then run gl.clear)
[x] Move the remaining GL initialization into the compositor. (Clear color, blend function, etc.)
[ ] Implement fonts (hard), lines (easy), stroke (hard)
[x] Figure out the weird rendering bug that causes the last few rects to be duplicated
[ ] Create a packet inspector for debugging

These are just a few thoughts, there is still a lot more to be done...

parasyte commented 9 years ago

As I was writing my next blog article on the new WebGL Compositor, I came up with a set of additional TODOs:

[x] Separate the "static" and "dynamic" attributes into two different attribute arrays. "Static" attributes change infrequently (texture unit, color) and "dynamic" attributes change often (vertex, texture coordinates). The static attribute array will be updated as necessary with bufferSubData, and the dynamic attribute array will be sent in a single bufferData call.
[x] A prerequisite to implementing a static attribute buffer is remembering where (in the attribute array) draw operations were in the last frame. This won't work well when objects are added/removed dynamically, or the draw order is changed. But these buffers can be implemented in a smart way to do truncation and refills to handle such cases.
[ ] Pack multiple attributes into each float. E.g. texture coordinates can be packed into a single float, since each axis will never be bigger than 65,536 pixels (fits into 16-bits). The packing algorithms will provide less than 16-bits of precision, probably 15-bits, which is still plenty for texture coordinates. Another candidate is colors, and much bigger win: pack a vec4 into a single float with almost no loss in precision!
[ ] Enable mipmapping on power-of-two textures.

parasyte commented 9 years ago

This last series of patches was to resolve performance issues spotted with the Chrome profiler. It represents an overall improvement in CPU usage from 35% originally, down to about 13% now. Here are some profiler screenshots for comparison:

Before: Before optimizations

After: After optimizations

Both profile snapshots were taken over a ~10 second period. The first screen shows that the majority of CPU time is spent pulling objects out of the pool! That's how I spotted the first HUGE win. Notice the little :warning: icon. Its tooltip reveals information about why the JIT compiler failed to optimize the function.

The second screen shows the incredible performance improvement which puts the Compositor and WebGL bufferSubData at the top of the profile. These things should be addressed by the TODOs above.

The takeaway here is that with these low-hanging fruit out of the way, any changes to the Compositor will now have a greater (or lesser!) effect on total performance. In other words, it will now be much easier to spot the actual performance gains by implementing each TODO item. And even better, it will make it more obvious when any of these items actually causes a performance regression!

parasyte commented 9 years ago

About the save/restore performance, it will be better to move these operations into the methods on the Renderer class. I'll provide an example and describe my thoughts on the issue and how this change will make it better.

If we look at the particle class, we can see right away that each particle must save and restore the context state around every drawImage call: https://github.com/melonjs/melonJS/blob/640e5038ccad75f0c4edbca5c8408f2fc0f99bab/src/particles/particle.js#L133-L155 This is just one example, but it highlights a worst-case scenario where hundreds of images are being drawn, each saving and restoring the context. This is actually necessary for Canvas 2D operations to work correctly, since it only has a shared context. However, WebGL does not have this limitation, and we shouldn't emulate it!

Instead, we can push any context information we want to the GPU with WebGL on a per-triangle basis. It is therefore more efficient to send the required contextual information as part of the drawImage call, rather than maintaining a stack and globally shared context.

The drawImage method signature will change like this:

drawImage(Image|Object, sx, sy, sw, sh, dx, dy, dw, dh)

Where Object is a key-value pair that provides additional contextual information like blend color, transformations, and whether to update or preserve the global context:

{
  "image" : image,
  "transform" : transform,
  "update" : false
}

update=false is default, which will cause the CanvasRenderer to wrap the drawImage call in save/restore. And in WebGLRenderer, nothing special will happen. update=true is the opposite; nothing happens in CanvasRenderer, and the global context gets updated by WebGLRenderer.

Here's a list of settings available:

Required:

texture : a Texture instance OR...
- image : an Image instance

Optional:

update : a Boolean that causes the global context to be updated with the following values:
transform : a Matrix2d instance OR...
- pos : a Vector2d instance AND...
- angle : a Number [0..2pi] AND...
- scale : a Vector2d instance
color : a Color instance OR...
- alpha : a Number [0..1]

We will need the same kind of API update for the fill* and stroke* methods.

This proposed change will reduce unnecessary overhead in WebGLRenderer by removing the global context emulation. Updating the global context is opt-in, and will no longer require a stack for general purpose image rendering. With these changes, we can get rid of some silly workarounds like setting the global context color to white at the end of drawing: https://github.com/melonjs/melonJS/commit/080c87d8219d6e2521ebb7515534295881389bcb

obiot commented 9 years ago

it's hard to provide negative comments, since you have been on your own on this one for a couple of weeks now, but to be honest I'm not sure I really like your last idea as in my opinion, you should provide the same API either for canvas and/or webgl, and that should keep being transparent for the final end user.

However I'm not sure why you propose this change, as today the save/restore functions are called throuhg the renderers and that should then be managed from there (with the webgl one being basically an empty function),

parasyte commented 9 years ago

@obiot Sorry for not being clear. The API will be the same on both renderers. The internal workings will be quite different, though. What I really want to do here is remove the need to call save/restore at all, even by code like the particle emitter (linked above) and the animation sheet class.

Preserving the semantic of a global context state is a bad idea, in my honest opinion. WebGL doesn't need it, and we can hide it entirely for the CanvasRenderer. In other words, the renderer API should reflect the best case scenario; it should expose everything we can do with WebGL, and not abstract it away to look like Canvas 2D.

obiot commented 9 years ago

oh I see, so then yes I think it makes sense :P

obiot commented 9 years ago

FYI, under FF 28 (and higher) , the shader compilation fails with the following message :

me.video.Error: ERROR: 0:10: break disallowed outside switch/loop body
ERROR: 0:17: break disallowed outside switch/loop body
ERROR: 0:24: break disallowed outside switch/loop body
ERROR: 0:31: break disallowed outside switch/loop body
ERROR: 0:38: break disallowed outside switch/loop body
ERROR: 0:45: break disallowed outside switch/loop body
ERROR: 0:52: break disallowed outside switch/loop body
ERROR: 0:59: break disallowed outside switch/loop body
ERROR: 0:66: break disallowed outside switch/loop body
ERROR: 0:73: break disallowed outside switch/loop body
ERROR: 0:80: break disallowed outside switch/loop body
ERROR: 0:87: break disallowed outside switch/loop body
ERROR: 0:94: break disallowed outside switch/loop body
ERROR: 0:101: break disallowed outside switch/loop body
ERROR: 0:108: break disallowed outside switch/loop body
ERROR: 0:115: break disallowed outside switch/loop body

ldd commented 9 years ago

Under Firefox 34, on windows 8.1 everything works (after removing the comments in the shaders) with the comments there, I get a similar error.

parasyte commented 9 years ago

Yeah, there are problems in this shader. The issue with the comment is in the grunt-replace task; it doesn't remove comments properly on Windows: https://github.com/melonjs/melonJS/blob/389e6435e75c67c2099e9441b898f27f3fafcdc5/Gruntfile.js#L87-L88 I suspect that added $ to the double-slash side of the RegExp will fix that, but I am unable to test on Windows.

The problem with FireFox 28 (really old browser, BTW!) looks like it is optimizing the GLSL by unrolling the loop (GOOD!) but the break statement confuses it in the unrolled output.

I get similar weird behavior with this shader when passing it to glsl-optimizer. It fails to compile unless I replace the "dynamic index" with a constant. This is because glsl-optimizer does not unroll the loop (BAD!)

The obvious thing would be to run it through an optimizer that unrolls the loop and strips comments and whitespace, etc. glsl-optimizer isn't doing that for us, so I might have to just write a dumb parser specifically for this.

Also take note of the comment in the fragment shader; to get the most out of the GPU, the loop needs to be sized appropriately for the GPU. I have it hardcoded to use the number of texture units in my laptop's GPU at the moment. Unrolling the loop would have to be done at runtime to make that happen.

obiot commented 9 years ago

Indeed FF28 is old, but i had the same issue after updating to 31, on my side it was on OSX though (but does not certainely change anything)

parasyte commented 9 years ago

@obiot The fragment shader will be replaced. But FTR what GPU is in your Mac?

obiot commented 9 years ago

MBA Intel HD5000 :)

obiot commented 9 years ago

i just tried on my windows machine (FF34, Intel HD4000) and firefox just crahsed when I tried to open the platformer :P:P:P

parasyte commented 9 years ago

Lots of stuff happening here. :smiley: Finally I replaced the multiple bufferSubData calls with a single call. Next I will split the stream buffer into two buffers; one that is updated often (vertices and texture coordinates) and one that it update infrequently (color and texture index). Technically the color will change often when objects are fading and such. So we'll have to do some tuning on this stuff. But this should be a good start.

parasyte commented 9 years ago

I added the separate buffers patch to a new branch: https://github.com/melonjs/melonJS/compare/experimental/WebGL_static_buffer (See previous commit log for details)

This new code requires some additional CPU time for hashing the static buffer. Hashing should be the fastest way to check whether the static attributes for a quad have changed... Although to be honest I haven't tried a naïve approach with a ton of if-statements for each array element. ;) Something tells me that won't perform quite as well.

The additional CPU time required is:

0.25% with 140 quads.
0.56% with 256 quads.
1.02% with 496 quads.

The scale is roughly linear, so expect 32% additional CPU usage for 16,000 quads (the current limit for a single batch operation). It's unclear if this additional CPU overhead is worth the GPU bandwidth reduction. I'll maintain this new branch in parallel with master. At least until it can be determined if it's useful or not.

parasyte commented 9 years ago

@ldd The fragment shader should now build properly on your environment. The trick was using the preserveOrder option in the grunt-replace task. :smiley:

obiot commented 9 years ago

yep good job ! I confirm it fixes the build issue at least on my MBA (using Firefox), will try tomorrow on my windows machine.

unrelated, but since I did a nom install the jasmine task is now however failing :

Running "jasmine:src" (jasmine) task
Testing jasmine specs via PhantomJS

>> Error caught from PhantomJS. More info can be found by opening the Spec Runner in a browser.
Warning: SyntaxError: Parse error Use --force to continue.

Aborted due to warnings.

and I don't see anything in the spec runner. Do you also have that issue ?

parasyte commented 9 years ago

Yeah, I do see that now. It didn't happen in my earlier tests last night. But I got the build failure notice from Travis-CI this morning.

Working on it!

parasyte commented 9 years ago

Alright, I think it's finally in a pretty stable state! The stuff I did tonight focuses on customizability of the WebGL environment. I don't want to tie any users down in regards to how they use WebGL. So now it's possible to use an entirely custom Compositor class by passing the compositor option to me.video.init()! I don't think I'll be writing a different compositor any time soon, but I like that we can provide this flexibility for others who wish to experiment.

Another important change is that the me.video.shader.createShader() method is now independent of the singleton, allowing it to compile multiple shader programs. This I will be using in our default compositor for the line-rendering (e.g. fillStroke) shader program. The compositor just needs to flush when switching between shader programs.

Most of my TODO lists are already done, which is exciting! And I heard today from @ldd that his tests show a nice improvement in rendering speed. Apparently in his tests, CanvasRenderer is capable of 142 objects max and WebGLRenderer is capable of over 500. This is a good start, but I want more! :)

parasyte commented 9 years ago

With the last few commits, stroke (line rendering) is finally in place. It's not efficient, though. At first glance, it appears that the depth buffer can make it very efficient; we just need a way to get the Z-coordinate information into the compositor. That will likely depend on the work in #637

In the meantime, there are a few FIXME comments that need to be addressed (especially with how the uniform variables are set, and the attribute bindings are handled).

Second to that, getting fonts working (and in particular replacing the RTT thing in the debugPanel) is a priority for release. There's also a weird ghosting effect seen on the debugPanel with WebGLRenderer. That needs to be investigated further.

parasyte commented 9 years ago

Started working on font support in WebGL. The hack in the branch is pretty ugly, but it does make the me.Font API consistent! (Solves #619)

It's currently very slow with the font_text example, because it spends most of its time creating and uploading massive textures. :laughing: The secondary texture cache (proposed in the commit) will help that a little bit.

A better way to support fonts in WebGL will be important long-term, but this will work for 2.1!

agmcleod commented 9 years ago

Awesome! For now we can recommend keeping usage of the me.Font api simple, or to use Canvas instead :)

obiot commented 9 years ago

@parasyte if you don't mind, could you maybe create one or several small tickets to better identify what's left to be done for this one ?

parasyte commented 9 years ago

Everything left to do is a task here and here

obiot commented 9 years ago

oh sorry, missed that, but for my defense this ticket is super long now ;P

obiot commented 9 years ago

did you guys see that ? http://patriciogonzalezvivo.com/2015/thebookofshaders/

parasyte commented 9 years ago

Closing this. Followup ticket is #637

melonjs / melonJS

Implement batch drawing on WebGL #591