Closed agmcleod closed 9 years ago
The referenced commit is not batching. All I did was reuse buffers in drawImage
. The same buffer reuse is needed for fillRect
.
The process of batching involves keeping a record of everything you want to draw which uses the same texture, then constructing the triangle vertices/texture coordinates/indices, and sending the whole thing in a single drawElements
call.
This will make tile layers very fast (an entire map layer can be drawn in a single command), and sprites that use the same texture atlas will get the same benefit.
Ah yes, i did mis understand your original post. But regardless this will definitely help :)
It's a good first step, I agree.
Thinking this morning how to structure the batching API, it seems natural to use me.TextureAtlas
The whole purpose of this data structure is to store texture coordinates. The coordinate range is a bit different between Canvas and WebGL (pixels vs float between 0.0 .. 1.0 inclusive) but they require identical behavior.
So I want to rename me.TextureAtlas
to me.CanvasRenderer.Texture
, and then extend it as me.WebGLRenderer.Texture
. That takes care of the texture coordinates. To build batching on top of that, we have multiple options! yay! My favorite so far is lazy batching:
me.video.renderer.drawImage(texture, index, x, y, w, h)
where index
is an atlas index name or number, a reference to the last texture
is remembered:
index, x, y, w, h
to a Batcher
object.index, x, y, w, h
blitSurface
is called. This function will flush the Batcher object (draw the batch, then reset)My second favorite is explicit batching, where something has to create a new Batcher, add items to draw, then flush the Batcher when ready. IMHO, this will be harder to get working with sprite objects, since there's really no "dividing line" between how sprites should be grouped. TMX has ObjectGroup for this purpose, but there's no guarantee that each object uses the same texture.
The implementation of this Batcher
class is TBD, but basically it just needs to take the x, y, w, h
screen coordinates, and assemble the vertices for these, appending them to the vertex list. It also uses the texture atlas index
to fetch the texture coordinates, and appends them to the texture coordinate list for the batch. The index list (for drawElements
) is also updated likewise. It has a second method to flush these lists to the GPU before resetting.
As you might tell from this last description, the texture batching will actually "undo" a lot of the bufferData
reuse work that I just did! :wink: But it was a super useful exercise that helped me get more familiar with WebGL.
@agmcleod @obiot Please weigh in with your thoughts, especially in regard to how I want to redesign the Texture Atlas.
My only concern with the second one is it could be limiting in some way. The first method is the one a developer at big viking games used, adding a webgl renderer to their canvas-like api. Similar to what we're doing really.
Implicit batching is by far easier to manage. And if you cram everything into a single texture atlas (all tile sets, all sprites) then you can theoretically get the best possible performance by executing a single drawElements
call per frame, without any extra coding, or other special setup.
Yep exactly. Been trying to get better practice at doing so. Nicer to see a shorter list of files getting uploaded when i SCP stuff to my site.
:yum:
Also there's something to be said about at least exposing an API to allow custom batching operations. Just in case someone actually wants to manage it themselves for some reason.
I'll start this one. ;)
meta.app
field.)uvMap
property that specifies texture coordinates (in WebGL triangles) for the region. These can be passed directly to WebGL. For batching, these need to be concatenated together and used with drawElements(gl.TRIANGLES, ...)
But merge "ticket-620" into master first ;)
On 21 déc. 2014, at 03:29, Jay Oster notifications@github.com wrote:
I'll start this one. ;)
— Reply to this email directly or view it on GitHub.
The next step is adding the batcher to the WebGL Renderer. I think this won't be too much trouble. It just needs to do a little bit of memory accounting to avoid GC. I'll start with ~1KB memory buffers, and do the usual growth by 2x pattern (never shrinking).
Second, I don't want to change the function signature for drawImage
, since that's already well-established. Instead I'll add a public bindTexture
method to the Renderer API (there's one right now, but it needs to be replaced) that will remember the provided Texture
region, and use it for batching. Using the method will look something like this:
me.video.renderer.bindTexture(texture.getRegion(name)).drawImage(texture, x, y, ...);
That's the best thing I can think of...
Otherwise it should be really straightforward.
On the other hand ... The drawImage
method can just introspect the first argument with instanceof
. For Image
it will use the original signature. And for me.video.renderer.Texture
it will expect the region info as the second parameter. Kind of weird, but easy enough!
Nice work on this so far jason. Do you think using instanceof and checking for different parameter types might duplicate/increase logic in the method? Could potentially refactor it into two methods if you think it would be worth it.
It won't cause any duplicate logic, the code to select source coordinates will be different. It will also be the cleanest interface, IMHO. (This is probably the design I imagined when I proposed the batcher. I've just forgotten about the details.)
Anyway, let's try it with drawImage
getting a new fourth function signature, and see how it works.
This part of the work is actually quite involved (more than I had imagined!) It will require some shader rewrites. The best case scenario is the batcher uploads all of the textures, vertex buffers, UV maps, index buffers, etc to the GPU on the first frame, then only needs to provide transformation matrices for all of the sprites on each frame update.
Some new points need to be addressed:
gl.MAX_COMBINED_TEXTURE_IMAGE_UNITS
The batcher will throw an exception if too many textures are used simultaneously.me.ImageLayer
(see: #603) -- requires power-of-two textures.The transformation matrices are way too bulky to send to the GPU for every vertex. The right solution is transforming the vertex in JavaScript, and streaming the result to the GPU!
Each 2D matrix encodes three vectors. It is most efficient to send the matrices once at startup if they rarely change, and let the GPU multiply them repeatedly. WebGL is a state machine; any state you set will only be changed by you.
I have finished merging the two matrix classes into a single class. This is a nice win for code footprint, and fixes the translate method. Our Matrix2d
class is sparse; it doesn't do full matrix multiplication. The "hidden row" is never multiplied into (as an optimization). This is fine for most purposes, but will give incorrect results when the hidden row is not 0, 0, 1
. We don't have any matrices like this anyway. :) And affine transformations require it to be 0, 0, 1
for homogenous coordinates; projecting a 2D plane in 3D space.
There are two problems to solve with the TextureCache:
Object.prototype.toString
to return a string with a unique ID. A second option is adding a unique ID to the object itself. Another option is to use an array instead of hash table for O(n) lookups vs O(1). A fourth option is using a different data structure, such as the hash map provided by mori or ES6 Map.me.Texture
constructor to add itself to the cache.TODO:
Float32Array
and copy its contents directly to the WebGL buffer with a single bufferData
call. (Replace multiple bufferSubData
calls.)gl.clear
call into the compositor, so that it is not executed out of order. (Flush the buffer, then run gl.clear
)These are just a few thoughts, there is still a lot more to be done...
As I was writing my next blog article on the new WebGL Compositor, I came up with a set of additional TODOs:
bufferSubData
, and the dynamic attribute array will be sent in a single bufferData
call.vec4
into a single float with almost no loss in precision!This last series of patches was to resolve performance issues spotted with the Chrome profiler. It represents an overall improvement in CPU usage from 35% originally, down to about 13% now. Here are some profiler screenshots for comparison:
Before:
After:
Both profile snapshots were taken over a ~10 second period. The first screen shows that the majority of CPU time is spent pulling objects out of the pool! That's how I spotted the first HUGE win. Notice the little :warning: icon. Its tooltip reveals information about why the JIT compiler failed to optimize the function.
The second screen shows the incredible performance improvement which puts the Compositor
and WebGL bufferSubData
at the top of the profile. These things should be addressed by the TODOs above.
The takeaway here is that with these low-hanging fruit out of the way, any changes to the Compositor
will now have a greater (or lesser!) effect on total performance. In other words, it will now be much easier to spot the actual performance gains by implementing each TODO item. And even better, it will make it more obvious when any of these items actually causes a performance regression!
About the save/restore performance, it will be better to move these operations into the methods on the Renderer class. I'll provide an example and describe my thoughts on the issue and how this change will make it better.
If we look at the particle class, we can see right away that each particle must save and restore the context state around every drawImage
call: https://github.com/melonjs/melonJS/blob/640e5038ccad75f0c4edbca5c8408f2fc0f99bab/src/particles/particle.js#L133-L155 This is just one example, but it highlights a worst-case scenario where hundreds of images are being drawn, each saving and restoring the context. This is actually necessary for Canvas 2D operations to work correctly, since it only has a shared context. However, WebGL does not have this limitation, and we shouldn't emulate it!
Instead, we can push any context information we want to the GPU with WebGL on a per-triangle basis. It is therefore more efficient to send the required contextual information as part of the drawImage
call, rather than maintaining a stack and globally shared context.
The drawImage
method signature will change like this:
drawImage(Image|Object, sx, sy, sw, sh, dx, dy, dw, dh)
Where Object
is a key-value pair that provides additional contextual information like blend color, transformations, and whether to update or preserve the global context:
{
"image" : image,
"transform" : transform,
"update" : false
}
update=false
is default, which will cause the CanvasRenderer to wrap the drawImage
call in save/restore
. And in WebGLRenderer, nothing special will happen. update=true
is the opposite; nothing happens in CanvasRenderer
, and the global context gets updated by WebGLRenderer.
Here's a list of settings available:
Required:
texture
: a Texture instance OR...
image
: an Image instanceOptional:
update
: a Boolean that causes the global context to be updated with the following values:transform
: a Matrix2d instance OR...
pos
: a Vector2d instance AND...angle
: a Number [0..2pi] AND...scale
: a Vector2d instancecolor
: a Color instance OR...
alpha
: a Number [0..1]We will need the same kind of API update for the fill*
and stroke*
methods.
This proposed change will reduce unnecessary overhead in WebGLRenderer by removing the global context emulation. Updating the global context is opt-in, and will no longer require a stack for general purpose image rendering. With these changes, we can get rid of some silly workarounds like setting the global context color to white at the end of drawing: https://github.com/melonjs/melonJS/commit/080c87d8219d6e2521ebb7515534295881389bcb
it's hard to provide negative comments, since you have been on your own on this one for a couple of weeks now, but to be honest I'm not sure I really like your last idea as in my opinion, you should provide the same API either for canvas and/or webgl, and that should keep being transparent for the final end user.
However I'm not sure why you propose this change, as today the save/restore
functions are called throuhg the renderers
and that should then be managed from there (with the webgl one being basically an empty function),
@obiot Sorry for not being clear. The API will be the same on both renderers. The internal workings will be quite different, though. What I really want to do here is remove the need to call save/restore at all, even by code like the particle emitter (linked above) and the animation sheet class.
Preserving the semantic of a global context state is a bad idea, in my honest opinion. WebGL doesn't need it, and we can hide it entirely for the CanvasRenderer. In other words, the renderer API should reflect the best case scenario; it should expose everything we can do with WebGL, and not abstract it away to look like Canvas 2D.
oh I see, so then yes I think it makes sense :P
FYI, under FF 28 (and higher) , the shader compilation fails with the following message :
me.video.Error: ERROR: 0:10: break disallowed outside switch/loop body
ERROR: 0:17: break disallowed outside switch/loop body
ERROR: 0:24: break disallowed outside switch/loop body
ERROR: 0:31: break disallowed outside switch/loop body
ERROR: 0:38: break disallowed outside switch/loop body
ERROR: 0:45: break disallowed outside switch/loop body
ERROR: 0:52: break disallowed outside switch/loop body
ERROR: 0:59: break disallowed outside switch/loop body
ERROR: 0:66: break disallowed outside switch/loop body
ERROR: 0:73: break disallowed outside switch/loop body
ERROR: 0:80: break disallowed outside switch/loop body
ERROR: 0:87: break disallowed outside switch/loop body
ERROR: 0:94: break disallowed outside switch/loop body
ERROR: 0:101: break disallowed outside switch/loop body
ERROR: 0:108: break disallowed outside switch/loop body
ERROR: 0:115: break disallowed outside switch/loop body
Under Firefox 34, on windows 8.1 everything works (after removing the comments in the shaders) with the comments there, I get a similar error.
Yeah, there are problems in this shader. The issue with the comment is in the grunt-replace task; it doesn't remove comments properly on Windows: https://github.com/melonjs/melonJS/blob/389e6435e75c67c2099e9441b898f27f3fafcdc5/Gruntfile.js#L87-L88 I suspect that added $
to the double-slash side of the RegExp will fix that, but I am unable to test on Windows.
The problem with FireFox 28 (really old browser, BTW!) looks like it is optimizing the GLSL by unrolling the loop (GOOD!) but the break statement confuses it in the unrolled output.
I get similar weird behavior with this shader when passing it to glsl-optimizer. It fails to compile unless I replace the "dynamic index" with a constant. This is because glsl-optimizer does not unroll the loop (BAD!)
The obvious thing would be to run it through an optimizer that unrolls the loop and strips comments and whitespace, etc. glsl-optimizer isn't doing that for us, so I might have to just write a dumb parser specifically for this.
Also take note of the comment in the fragment shader; to get the most out of the GPU, the loop needs to be sized appropriately for the GPU. I have it hardcoded to use the number of texture units in my laptop's GPU at the moment. Unrolling the loop would have to be done at runtime to make that happen.
Indeed FF28 is old, but i had the same issue after updating to 31, on my side it was on OSX though (but does not certainely change anything)
@obiot The fragment shader will be replaced. But FTR what GPU is in your Mac?
MBA Intel HD5000 :)
i just tried on my windows machine (FF34, Intel HD4000) and firefox just crahsed when I tried to open the platformer :P:P:P
Lots of stuff happening here. :smiley: Finally I replaced the multiple bufferSubData
calls with a single call. Next I will split the stream buffer into two buffers; one that is updated often (vertices and texture coordinates) and one that it update infrequently (color and texture index). Technically the color will change often when objects are fading and such. So we'll have to do some tuning on this stuff. But this should be a good start.
I added the separate buffers patch to a new branch: https://github.com/melonjs/melonJS/compare/experimental/WebGL_static_buffer (See previous commit log for details)
This new code requires some additional CPU time for hashing the static buffer. Hashing should be the fastest way to check whether the static attributes for a quad have changed... Although to be honest I haven't tried a naïve approach with a ton of if-statements for each array element. ;) Something tells me that won't perform quite as well.
The additional CPU time required is:
The scale is roughly linear, so expect 32% additional CPU usage for 16,000 quads (the current limit for a single batch operation). It's unclear if this additional CPU overhead is worth the GPU bandwidth reduction. I'll maintain this new branch in parallel with master. At least until it can be determined if it's useful or not.
@ldd The fragment shader should now build properly on your environment. The trick was using the preserveOrder
option in the grunt-replace task. :smiley:
yep good job ! I confirm it fixes the build issue at least on my MBA (using Firefox), will try tomorrow on my windows machine.
unrelated, but since I did a nom install
the jasmine task is now however failing :
Running "jasmine:src" (jasmine) task
Testing jasmine specs via PhantomJS
>> Error caught from PhantomJS. More info can be found by opening the Spec Runner in a browser.
Warning: SyntaxError: Parse error Use --force to continue.
Aborted due to warnings.
and I don't see anything in the spec runner. Do you also have that issue ?
Yeah, I do see that now. It didn't happen in my earlier tests last night. But I got the build failure notice from Travis-CI this morning.
Working on it!
Alright, I think it's finally in a pretty stable state! The stuff I did tonight focuses on customizability of the WebGL environment. I don't want to tie any users down in regards to how they use WebGL. So now it's possible to use an entirely custom Compositor class by passing the compositor
option to me.video.init()
! I don't think I'll be writing a different compositor any time soon, but I like that we can provide this flexibility for others who wish to experiment.
Another important change is that the me.video.shader.createShader()
method is now independent of the singleton, allowing it to compile multiple shader programs. This I will be using in our default compositor for the line-rendering (e.g. fillStroke
) shader program. The compositor just needs to flush when switching between shader programs.
Most of my TODO lists are already done, which is exciting! And I heard today from @ldd that his tests show a nice improvement in rendering speed. Apparently in his tests, CanvasRenderer is capable of 142 objects max and WebGLRenderer is capable of over 500. This is a good start, but I want more! :)
With the last few commits, stroke (line rendering) is finally in place. It's not efficient, though. At first glance, it appears that the depth buffer can make it very efficient; we just need a way to get the Z-coordinate information into the compositor. That will likely depend on the work in #637
In the meantime, there are a few FIXME comments that need to be addressed (especially with how the uniform variables are set, and the attribute bindings are handled).
Second to that, getting fonts working (and in particular replacing the RTT thing in the debugPanel) is a priority for release. There's also a weird ghosting effect seen on the debugPanel with WebGLRenderer. That needs to be investigated further.
Started working on font support in WebGL. The hack in the branch is pretty ugly, but it does make the me.Font API consistent! (Solves #619)
It's currently very slow with the font_text example, because it spends most of its time creating and uploading massive textures. :laughing: The secondary texture cache (proposed in the commit) will help that a little bit.
A better way to support fonts in WebGL will be important long-term, but this will work for 2.1!
Awesome! For now we can recommend keeping usage of the me.Font api simple, or to use Canvas instead :)
@parasyte if you don't mind, could you maybe create one or several small tickets to better identify what's left to be done for this one ?
oh sorry, missed that, but for my defense this ticket is super long now ;P
did you guys see that ? http://patriciogonzalezvivo.com/2015/thebookofshaders/
Closing this. Followup ticket is #637
This should reduce heavy usage of buffers, and number of draw calls required. Leading to a performance boost.