Optimize draw batches by using VAOs, UBOs, buffering, single buffer uploads and separate thread for rendering

soywiz commented 1 year ago

While the new SDF-based UI looks really beautiful, introduces new performance requirements that weren't here before. The previous UI was using a 9-patch, and it was less flexible, but worked great along bitmap fonts, since the current KorGE batcher supports up to 4 textures in a single batch. So we could render buttons and text in a single batch call. There were some introduced optimizations to cache whole sub-graphics, so we can for example cache a whole UI window, or the whole UI or part of it, so it doesn't affect the rest of the game updating, but still, we want to be faster.

Now, rendering the background of the button and the text itself are two separate batches. We need to set the vertices, the uniforms and the data for the background, and then go back to the text rendering.

Each batch is slow because requires setting the vertex data, attributes and uniforms. Doing a VisualVM profiling we can identify the hot points here: Screenshot 2022-11-16 at 14 43 16

Since we are already using lists that would allow to execute stuff in parallel, we can start optimizing some stuff while being future-proof for other backends. The idea here is to use proper VAOs and UBOs and also buffer everything to the end of the list. This will require either extensions available, or WebGL >= 2 and Open GL ES >= 3.0. It is possible to read more about this here: https://webgl2fundamentals.org/webgl/lessons/webgl2-whats-new.html

On desktop, we should have that functionality already.
On iOS, that's already there for ages.
On Android 6,7% is still having only 2.0 available: https://developer.android.com/about/dashboards
On Web: https://caniuse.com/webgl2 <- 93.74% over https://caniuse.com/webgl <- 98.12%

On discord people is heavily supporting going forward with this:

Screenshot 2022-11-17 at 08 52 40

So for example, if we have the layout of the attributes and uniform beforehand, we can construct a single vertex buffer and a single uniform buffer with everything. Then upload it once to the GPU, and then execute really small commands selecting memory areas in those buffers to do small render batches. This should improve the amount of batches we can do per fram substantially. In addition to that, if we can keep the code in a separate thread (now K/N should support that), and then the rendering code in the UI thread consuming commands, we should improve the performance like a lot. Probably we can reach one to two orders of magnitude, that will pave the way for future needs, specially with 3D later.

This is an epic, that will require a lot of small tasks to reach here. Since this is an optimization improvement, shouldn't affect other areas of work. Even if code is slow, this will allow to eventually be faster, as we are already doing indirect buffered rendering through rendering lists.

soywiz commented 1 year ago

Already prepared for UBOs, separate thread rendering not implemented, but deferred. Single buffer uploads already implemented in batcher.

soywiz commented 1 year ago

UBOs & VAOs implemented here: https://github.com/korlibs/korge/pull/1503

korlibs / korge

Optimize draw batches by using VAOs, UBOs, buffering, single buffer uploads and separate thread for rendering #1116