Performances and display lists

natinusala commented 7 years ago

I've been trying to run nanovg on a Raspberry Pi 3 B to see how it performs (GLES2). Well, it's very slow (<10fps). I made a profile of the example : https://pastebin.com/rY3LfLAF

You can easily see that almost 50% of all the calls are from drawing and calculation functions. I already implemented the AFD tesselation function from issue #328 with no great improvement.

This leads me to the display lists, I saw that multiple issues were talking about them, is it still a thing ? I believe it would fix the performance issues of embedded devices such as Android or the Rpi.

ytsedan commented 7 years ago

Let me share some notes and observations on performance, as these kind of questions pop up here regularly.

Display list are still a thing if your performance problems are CPU bound. They will help if you render the same shape over multiple frames and you want to only apply affine transforms to them (Note: heavy scaling will be noticeable, as edges become blurry due to scaling up the aa fringe, and you might see the tessellation error). However, if you have lots of dynamic shapes that change each frame, display lists will not help at all as you still need to tesselate the shape again and again. On a side note, if you want to render the same static shape multiple time in the same frame (as you often do in 2d tile-based games), the shape will be transferred multiple times to the GPU which is costly, too. For this use case it’s a good idea to render the shape to a bitmap and then draw the bitmap multiple times. You can check out my attempt on display lists in my fork of this project.

Regarding lower performance devices, I did some optimizations to improve rendering performance on older iOS devices (iPad 3 generation). These devices especially have poor pixel shader performance; not sure what the problem is with raspberries, though. Here are some things you might try to improve GPU performance

1) Spit the Übershader into multiple smaller shaders. nanovg uses a single shader which has multiple branches for different tasks, splitting that into separated shaders and switch shaders between draw calls helped a lot for me.

2) Scissoring. nanovg does scissoring right in the pixel shader, the advantage is, that scissor rects do not have to be axis-aligned. However, removing this code from the shader and just using glScissor did help performance in my use-case (drawback: you have to get along with axis-aligned clipping, which was fine for me).

3) In my app I’m drawing mainly axis-aligned rectangle (for a UI system), for this the whole nanovg tessellation is kind of overkill. So I added support for drawing simple rectangles, that will just boil down to two triangles (no anti-aliasing) and a solid color (or textured) pixel shader. However, nanovg will still create a draw call for each of these rectangles - so there is room for improvement here.

A new project on GitHub appeared, that addresses these issues by basically re-implementing most of nanovg. It’s not finished yet, but looks very promising and you might want to keep an eye on it. https://github.com/jdryg/vg-renderer

natinusala commented 7 years ago

Thanks for your very detailled answer !

Here, I think that the bottleneck is the CPU, as you can see on the profile. The Rpi is capable of rendering PS1 games @60FPS, I guess that the GPU is good enough for nanovg.

Your fork of nanovg doesn't compile for me, because of a C99 compliance issue on a for loop. The premake4 might be outdated, since your fork isn't up to date with the original repository.

I willl try the improvements you suggested tho, but my skills in graphics programming are not very good x)

olliwang commented 7 years ago

@ytsedan Thanks for your hints. Would you mind to share the code you modified?

lieff commented 7 years ago

I think CPU usage can be heavily reduced using tessellated shapes cache. I.e. before nvgFill/Stroke we call something like nvgGetTesselatedShape, then reuse it on next frame if it's not changed (or even can apply matrix transform on it).

olliwang commented 7 years ago

I have an app that usually renders more than 30 thousands of lines per frame, and it seems the CPU part is not the problem at all.

Here's a simple benchmark when I called all drawing functions followed by nvgCancelFrame(). screen shot 2017-06-21 at 17 59 40

And here's the same code followed by nvgEndFrame(). screen shot 2017-06-21 at 17 58 53

SpinyOwl commented 6 years ago

Hello guys? do you have any updates?

lieff commented 6 years ago

@ShchAlexander I make some experiments and found that tessellated cache consumes too much memory. We still can optimize by storing shapes to objects and use them instead of re-send data using nvgLineTo/nvgBezierTo/nvgQuadTo, but I do not expect much benefit.

Walther commented 5 years ago

Hello! I was profiling VCV Rack for its performance hot spots and one branch of my search ended up here: nvgEndFrame calls take quite a significant amount of the processing time, even with a fairly simple scene loaded.

Are there any low-hanging fruits on improving this? Is there any way I could try to help here?

edit: possibly related, or tangential? https://github.com/memononen/nanovg/issues/451

memononen commented 5 years ago

End frame is where the data is sent to gpu for rendering. How does fairly simple scene look like? how many shapes and what kind?

On Wed, Feb 20, 2019 at 7:52 AM Veeti Haapsamo notifications@github.com wrote:

Hello! I was profiling VCV Rack for its performance hot spots and one branch of my search ended up here: nvgEndFrame calls take quite a significant amount of the processing time, even with a fairly simple scene loaded.

[image: screen shot 2019-02-19 at 21 44 47] https://user-images.githubusercontent.com/2943750/53069200-f6f91f00-348f-11e9-9708-3574933d2ad0.png

Are there any low-hanging fruits on improving this? Is there any way I could try to help here?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/memononen/nanovg/issues/371#issuecomment-465435151, or mute the thread https://github.com/notifications/unsubscribe-auth/AFAbyJLY3dQ1o_RVZ2EeWqNI25WN0D1Uks5vPOKngaJpZM4NhFkk .

Walther commented 5 years ago

I just realized that "simple" is probably a highly relative term 😄 Screenshot from Rack below.

memononen commented 5 years ago

@Walther Yeah :) That is huge amount of geometry. Those rails with holes alone is a lot of geometry to pass to GPU.

I think the only way to speed that up is to reduce the amount of data that is passed to the GPU each frame. In practice it means to render as as you can to textures and then just draw rectangles. Things like those wires would be rendered using nvg on top.

One option for example could be to draw each module to a texture. Maybe a module could have a texture, which is updated when user interacts with the UI (i.e. not every frame), and then some elements (i.e. blinking leds, things that animate each frame) would be drawn on top.

Alternatively you could cache at component level too. It might require some testing to see what is the right spot for caching. Anyways, rendering to textures is the way to speed things up here.

Walther commented 5 years ago

Thank you so much for the help! <3

mgood7123 commented 1 year ago

should this be closed?

memononen / nanovg

Performances and display lists #371