Open lekoder opened 6 years ago
macOS 10.14.1 (18B75), AMD FirePro D300 2048 MB
End-demo scene fps - 0.9 ... 1.5, GPU utilization 99 % In-demo fps - 36.4 (average), 21.1 (lowest), GPU utilization 8-9 %
According to Xcode Time Profiler and OpenGL Profiler heaviest function (34% CPU weight, 97% GPU weight) is RasterizerStorageGLES3::_particles_process
-> glEndTransformFeedback
.
Keep in mind the AMD Windows driver offers less-than-ideal OpenGL performance compared to NVIDIA or even Intel. This should be solved when the renderer is rewritten to use Vulkan, which performs more consistently across drivers and platforms.
@Calinou aware of that, but right now it means that game released on 3.1 simply won't be playable on Radeon cards (the last scene is a benchmark - it uses all I intend to throw on GPU). That's 14% of the audience and it would probably be massive hit on the ratings.
also, many older devices simply don't run Vulkan (including many mobile devices)
also, many older devices simply don't run Vulkan (including many mobile devices)
The GLES2 renderer targets lower-end devices, so it should perform well on AMD graphics cards on Windows anyway.
...? You just said OpenGL performance is worse on AMD than NVidia?
This mostly affects GLES3. GLES2 is supposed to work fine - except for my game it does not work at all.
@Zireael07 My point is that the GLES2 renderer is less demanding, so AMD's OpenGL driver being slower on Windows should not be much of an issue. The GLES3 renderer we currently have is far heavier than the GLES2 one.
Even then, framerates between 1 and 10 FPS on @lekoder's 2D game are really low, this should be debugged further.
Fire up RenderDoc and see what the situation is ?
Shouldn't it be checked before 3.1 release?
Shouldn't it be checked before 3.1 release?
If it's already running fine for everybody who's working on it then it's not going to show anything out of the ordinary. Whoever is seeing huge performance issues however could take a look on their machine and report what they're seeing as far as the elapsed time for various GL calls.
I can't really ask a player to do this kind of in-depth debug, but @bruvzg seems to experience the problem and already shared some data above.
I can't really ask a player to do this kind of in-depth debug, but @bruvzg seems to experience the problem and already shared some data above.
Sure, this is a problem that should be solved by people who are more closely involved with Godot.
According to Xcode Time Profiler and OpenGL Profiler heaviest function (34% CPU weight, 97% GPU weight) is
RasterizerStorageGLES3::_particles_process
->glEndTransformFeedback
.
Does your game even use particles?
Here's some RenderDoc captures (.rdc + text files with call durations) from Windows 10 on same hardware: https://mega.nz/#!VsZgmS4C!21G0HJLz_KVVjw94xKBNppSeWuIabLAHeiRBaep_t3g
It's clearly lights causing problems.
Does your game even use particles?
I have no idea how _particles_process
is involved, CPU weight is probably accurate, GPU weight may be sum of all glEndTransformFeedback
calls, not just calls from _particles_process.
I think I have something.
The "end demo" scene sprites are using a shader with uniform static branching. There is "cheap" version of the shader and "expensive" one, with a uniform if between them. Uniform if's in the shaders is supposed to be optimized away by shader compiler, and it does indeed happen on NVidia cards.
Multiple light are probably exposing it by calling the fragment shader often enough to expose it. With the expensive part of the shader commented out, the game works as expected (except it looses features).
I think most important thing to determine right now is if the behaviour on Radeon cards is indeed a bug, or if it is intended behaviour and we should be working around that (ie. by manually switching shader materials). I can provide arguments for both sides.
Removing the branching from the shader indeed improves the FPS significantly, but the Asus Strix Vega 64 on which it was tested barely reaches 50FPS and uses 220W of power to render the "Demo End" scene. My 1070 gets 100FPS rendering same thing. Even with static uniform shader branching eliminated, there is still seems to be significantly worse performance on Radeon cards.
I think I figured out what the problem is:
Looking at the RenderDoc captures, yes, lights are taking up a huge amount of the render time, and it appears that it's due to each sprite being drawn multiple times, once for each little light, and the sprites making up the scene are HUGE, covering massive portions of the viewport/framebuffer, in spite of each light only illuminating a tiny area of each, but the whole sprite is redrawn anyway.
I took the liberty of showing just what is happening with the individual sprite draw calls. This is what rendering the frame looks like (with the big slow sprites) notice how many draw calls don't seem to have any effect whatsoever on the framebuffer, other than eating frametime:
What you can do right now to speed things up is break up your sprites into smaller ones that don't have so much empty space (which costs precious fill rate) or as big of a sprite radius that covers huge sections of the framebuffer/viewport to where each light causes the whole thing to be redrawn when only the area affected by the light could be. If a light is near a small sprite then only that tiny sprite will be drawn in a tiny area, leaving the rest of the framebuffer alone like it should be. The tiny sprites being drawn draw significantly faster, according to RenderDoc, I think because they're not huge swaths of the framebuffer that are recalculating for each little light. They're also within the radius of influence of less lights total compared to the huge sprites, so they incur less draw calls and less total fill rate consumption both ways.
As far as the engine itself is concerned I would think that individual draw calls for each and every light is not ideal. IMO there should be a single draw call for each sprite - one and only one call per sprite - with a uniform array or buffer containing all of the lights to be looped through in the fragment shader. That would be way faster than individual draw calls for each light, I would imagine. At the very least, the shaders should be able to handle sets of lights at a time, so that Godot isn't doing one single light per draw call, and could then eat through a handful of lights per draw call. Perhaps it could scale to the size of the sprite in question. If it's a big sprite it gets drawn with more lights at a time, or some number of lights proportionate with its size. If it's a fullscreen sprite then it draws once with all lights passed once for the draw call.
It does look like the number of draw calls changes per each sprite, so lights that are far from sprites do not trigger that sprite to be drawn with that light illuminating it, to prevent completely wasteful draw calls, but there is clearly still plenty of draws going on that aren't actually doing anything to the framebuffer texture output. That much is evident in RenderDoc too.
It looks like what is happening is that lights that aren't overlapping the viewport are still triggering sprite draws - lights that must be outside the viewport but still overlapping/intersecting the offscreen portion of the sprite and triggering a draw? Something funky is going on, that's for sure, but if that's the case then using smaller sprites should prevent that from happening nearly as much too.
It could be lights that are just totally underneath/behind the sprite, which maybe should be checked for too? I guess it depends on whether the normalmap should show light like that.
Anyway...
This sprite could be cropped more closely to its actual visible edges (the transparent area costs just as much to render as the opaque parts), and could be broken into ~3x3=9 separate pieces that would reduce wasted fill rate significantly:
..and this one could be cropped better, and broken up into maybe 4x4=16 separate pieces:
The total number of draw calls will probably remain mostly the same with more sprites that are smaller, but it's clear that big sprites are running way slower than smaller ones by orders of magnitude, and that the actual number of draw calls doesn't have as much of an impact as the amount of fill rate being wasted by the huge sprites redrawing dozens of times over half the framebuffer. If the radius of illumination of your lights were bigger then this problem would exist whether you had big or small sprites making up your scene.
I don't know why exactly AMDs hardware/drivers perform worse with this scenario, I don't have problems with branching fragment shaders on my RX 460 or RX 570, but I imagine it would perform way worse than one would think it should on mobile GPUs. I think this is more a case of alpha blending being enabled with huge sprites drawing. I've seen massive slowdowns for decades whenever there's big blended sprites on the screen - dozens of them. I guess Nvidia solved that problem better, or has some kind of early-out detection, who knows.
Thanks for the in-depth post @DEF7.
Rendering every single sprite for each light sounds very strange to me. At least some form of light batching has to be implemented (as it's usually seen in traditional forward rendering).
Deferred lighting would also be much faster as it keeps the amount of drawn pixels much lower than what you showed. It's tricky when it comes to transparency (usually handled as hybrid approach) but it would definitely be an option.
@DEF7 Thank you for a detailed post. I will certainly be optimizing this stage, but this is not a core of the issue.
With previous, branched shaders (static uniform branches) I observed over 2000% performance gap between similar NVidia and Radeon cards. This is highly unexpected.
With previous, branched shaders (static uniform branches) I observed over 2000% performance gap between similar NVidia and Radeon cards. This is highly unexpected.
I'm telling you what RenderDoc is showing as being very time-consuming and what can be done to remedy the situation, beyond removing branching code. AMD being slower than Nvidia is outside the scope of Godot, the "core" of the issue is nothing that can be done engine-side. Either the huge sprites can be done away with: so that massive portions of the framebuffer aren't drawing hundreds of times just to generate a single frame - or forever have AMD cards run the game slow. Those are the choices. Nobody is going to be able to fix the problem from within Godot.
EDIT: Not unless the whole lighting system is re-done so that all lights are passed to the shader at once, instead of as individual draw calls. In the meantime, simply breaking up the huge sprites will fix AMD performance, period.
This also may have to do with the fact that lighting in Godot 2D is done in multiple passes. My idea after 3.1 is to move it to single pass, but this is not possible currently with OpenGL due to lack of separation between sampler and textures.
@DEF7 that's why I asked if this is perceived as problem that will be fixed by the engine. And yes, engine could do that - even if simply by splitting a sprite internally to multiple smaller sprites, detecting if they are empty and removing them if they are. Which is - by the way - exactly how I will attempt to work around the problem.
@reduz so we are talking Vulkan & Godot 3.2, right? I can live with that.
If the problem is fill rate because of small lights overlapping large sprites, it should be possible to clip the sprites to the light area. There's also glScissor although manually clipping may be better because I'm not sure on the setup costs on each call to glScissor.
I've just done a test running glScissor for the clipped intersection between the light and the render item, it speeds up rendering in my simple test scene, it would probably solve the problem in @DEF7 's scene.
Still a couple of questions with this approach:
As I said above the other possibility is to clip rects to this area but that would be more difficult codewise, with UVs to deal with and multiple primitives. So if scissor isn't too expensive it might be a good solution.
This should be fixed by the latest AMD drivers (22.9.x), which have brought significant OpenGL performance improvements. Can anyone test this on Windows? Ideally, please test on both the old driver version (such as 22.5.x) and a new driver version if you can.
Note that this will not make Light2D rendering as fast as it would be if a single-pass approach was used in 3.x
[^1]. However, this should make performance more comparable to an equivalent NVIDIA GPU.
[^1]: In 4.0.beta, 2D light rendering is done in a single pass in both Vulkan and OpenGL renderers. This makes rendering several lights faster.
Godot version: 3.1 c025f526c
OS/device including version: Windows 10: AMD Radeon RX Vega 64
Windows 7: AMD Radeon HD 6870
MacOSX: Radeon Pro 450 2G
Issue description: I got multiple reports from players claiming poor performance. A common factor seems to be use of Radeon cards. According to spec, these cards should be able to run my game >60FPS. Actual real life performance:
AMD Radeon RX Vega 64: ~10FPS AMD Radeon HD 6870: ~1FPS AMD Radeon Pro 450 2G: ~2FPS
Steps to reproduce: Grab a demo from steam, start game, fly to the left to trigger end-demo sequence.
Might be related to #4151 - i do use lot of Light2D in that scene.