mikke89 / RmlUi

RmlUi - The HTML/CSS User Interface library evolved
https://mikke89.github.io/RmlUiDoc/
MIT License
2.84k stars 310 forks source link

Batch rendering #458

Open Warhate opened 1 year ago

Warhate commented 1 year ago

Hello,

I am integrating RmlUi into my custom engine, RmlUi looks cool and as for me, there are no alternatives to it.

When I load the benchmark sample there is a huge performance drop.

How I draw it:

Everything is good but performance. As I have 2.5k calls to DrawIndexed This can be optimized by having all geometry in the same buffers, like 1 buffer for all vertices and 1 for all indices, it will improve performance but we still have 2.5k calls to DrawIndexed which is too big a number.

The other solution is batch rendering.

How I draw it batched:

I separate geometry based on texture id (TextureHandle) if I have the same texture id for a geometry during CompileGeometry then I put it into the same batch. And then render those batches. With this approach, I have 8 calls to DrawIndexed instead of 2.5k and much better performance.

But with this approach, there is one big issue - the order of drawing, as now it cannot be preserved I have an issue when some elements are rendered under others.

For instance:

struct GuiBatch 
{
     VertexBuffer v;
     IndexBuffer i;
     int indexCount;
     int textureID; 
};

This can be resolved by batching geometry by some textureID + OrderLayer, OrderLayer - when all geometry are sorted by order layers and many of them can have the same order layer, for instance, a background: orderLayer = 0, a button: orderLayer = 1 and with this the batch rendering will work without any issues.

Another solution is adding a z-coordinate to translation and using the depth buffer, it will require some additional steps on the rendering side but it also should work well. (While writing this text one idea come to my mind. I can increase z-coordinate manually during RenderCompiledGeometry and it should work. I will reply with more updates if any)

Could you please assist if there is any way to get this additional information or maybe there is a better way to batch render geometry to gain better performance?

Warhate commented 1 year ago

The approach with manual z-ordering works so far, maybe there will be another issues with it but now it works

image

maybe this issue will be helpful for someone.

mikke89 commented 1 year ago

Hey, nice to hear that you like the library!

I've experimented with batching myself, and I find this topic quite interesting. Please also see this discussion I posted here: https://github.com/mikke89/RmlUi/discussions/440.

First of all, I want to say that CSS has a very specific, and quite complex render order. The render commands are assumed to be rendered in the submitted order. Anything messing with this order will probably result in incorrect result at least in some situations.

Due to transparent objects, we cannot use depth buffers, since the render order is decisive when using alpha blending.

What I've found from the OpenGL 3 renderer, is that a lot of the performance issues come from generating and switching the vertex array objects (VAOs). Instead, I tried to stream all the geometry every frame: Submit all geometry in one single call. Then submit each of the render commands (indices into this buffer) individually, switching textures and other state as necessary. These changes alone made it much, much faster. From there on it would be possible to submit all the geometry with the same state in single draw calls.

A further step would be to bind all the textures used simultaneously, and then use an index to fetch the correct texture. For example, in OpenGL you could use an array texture, though this has some limitations to be worked around. Then you can submit all the geometry in one call regardless of texture (and possible use a white texture for non-textured calls). Same with other state, for example translations and transforms could be placed in look-up arrays. The clipping and stenciling would probably need state changes still, but this should be rare enough to not cause any issues.

I'm also looking into experimenting a bit with alternatives to streaming the geometry. For example in OpenGL, we can use a separate attribute format to separate geometry format and data buffers. I'm curious to see if we can get some of the same speed-ups using this approach as streaming. It's not fully clear, but it sounds like you're working with DirectX, I'm not sure how this specifically translates here.

wh1t3lord commented 1 year ago

@mikke89 you need to use glVertexAttribPointer but fundamentally your speedup is based on general streaming e.g. sending all collected geometry into one draw call using one buffer for vertecies and indecies.

About textures do you think about generating atlas? Because there you can't do anything better than just send required textures on demand and of course only those elements that are visible to user.

But some textures you have to cache in order to minimize disk operations and not to load/unload it. Like button, grid and etc textures should be always loaded because likely they used in page rendering.

But other custom images better to stream using separate thread for that in order to not getting main thread blocked by IO operations.

Would be perfect if we can generate low-res textures but user can provide that thing by its own and it is should! It is good feature when we talk about low memory budget GPUs like 32 Mb. (Real variant is about 256 Mb/1 Gb)

With a such memory you can't handle page with 4K images in one visible region. It is better to keep in mind these things because they motivate to implement universal approach for streaming depending on memory budget that user has.

Otherwise you repeat the thing from modern trends like write bad-lazy program and just for loading simple page that consists of two buttons you required to have 6 Gb VRAM minimum at least. I am exaggerating but you got the point. (The point is still valid at some approximations sadly, real examples can exist...).

Warhate commented 1 year ago

@mikke89 Thank you for your reply.

440 sounds interesting, some of my thoughts about it:

As I understand, it will be a single buffer, but there will be many RenderCommandGeometry with offsets and it will be the same approach as before but now we do not need to have separate buffers and we can do something like:

SetVB(...);
SetIB(...);

for (auto& geometry: m_renderData)
{
    gfxApi->DrawIndexed(count, offset);
}

it will help with minimizing transferring data but the issue with many draw calls will be still there. However, maybe we can compile it additionally.

I use some gfx api agnostic abstraction (for now, the backend is DirectX, but there will be others).

The heaviest performance hit for me is the number of draw calls, yes, it is possibly due to switching geometry buffers, but I think it will be still there even with a single buffer.

How it works (almost), there is still the issue with alpha blending, but I think it can be resolved in some way (maybe order-independent transparency).

On compile geometry, I create/update batch structures (I need to keed vertex/index data on the CPU side, but it is a small amount and should not be an issue for now). I added or remove vertices/indices from CPU buffers and update all indices in those buffers according to offsets.

Also, I need to handle translation on the CPU side (but it is possible to do it on GPU as well, just later, as you said by look-up tables. However, it should be checked, maybe there will not be a big difference on managing it on CPU and GPU).

Additionally to that, there is an issue with rendering geometry when some part of a batch should not be rendered this call. For now, I use some hacky way to do that, just by moving it behind the screen with a z-coordinate. But it works well.

At the current moment of time, the integration resolves all my needs, but as further steps, I see next:

Thank you again and I am looking forward to the new rendering API =)