Add a project setting to enable deferred rendering

clayjohn commented 1 year ago

Describe the project you are working on

Performance improvements for Godot

Describe the problem or limitation you are having in your project

We are finding two places where we are losing a significant amount of performance:

In the depth prepass (which requires normals + roughness when using GI).
In the forward pass due to high VGPR usage

Both of these performance problems are inherent to using a Forward renderer and are typically solved by very aggressively hand optimizing your shaders to reduce VGPR usage. This typically means cutting out features to just the bare essentials and pre-baking as much as possible.

Unfortunately, since Godot aims to be flexible (and allows users to write shaders), we can't reduce VGPR count / features much more than we already have (although we will continue looking for ways to reduce VGPR count and improve performance).

Describe the feature / enhancement and how it helps to overcome the problem or limitation

Add a project setting to use deferred shading instead of forward shading in cases where users are willing to sacrifice flexibility for performance.

Describe how your proposal will work, with code, pseudo-code, mock-ups, and/or diagrams

The following is copy-pasted from a technical document prepared by @reduz We have already discussed it among a few rendering contributors and are now posting it here more publicly to get wider feedback before going ahead:

Single rasterizing pass (for opaque materials), depth pre-pass being no longer needed.
Simpler shaders during the rastering pass, significantly improving shader occupancy.
Single pass for lighting and GI, which can be significantly simplified thanks to using compute.

Implementation

Given so much would be shared between deferred and forward, the most likely use case scenario is that the rastering shader code is renamed to just clustered and contains both the deferred and the forward pass.

Deferred pass would be just adding a few more variants to the shader, mostly to simply not do any lighting, decal, or fog computation and write down the material values to the GBuffer.

The C++ side (for clustered rendering) may not be entirely reused. It should probably be designed so a base class (RenderClustered) is created, and then Forward and Deferred are derived, so it reuses as much as possible (specially for shadow, GI passes and other stuff that are shared between both).

G Buffer format

We want to find a compact G-Buffer format that is flexible enough for what we need. Remember we can take advantage of bit-packing to pack as much as we can. The following is proposed and used for opaque rendering. Existing shaders used in the clustered render are used for transparency.

Required Buffers

The base shader always writes to those.

Albedo / Metallic / AO buffer: R32, bit packed: 22 bits for RGB (787), 5 for Metallic, 5 for AO (they don’t need a lot of resolution).
Normal / Roughness buffer: This one should be shared with forward clustered since many post processes need it. Normal ideally should be encoded as Octahedron 24 bits (this should be changed to octahedron in the forward code, as it is the source of some problems with SSR and GI specular) , roughness as 8 bits.

Optional Buffers

These require more buffers to render to which are not always needed, so should be done on a different render pass combination.

Emissive Buffer: RGBE (32 bits) Emissive buffer. Not all materials write emissive, it could be argued that in most games most don’t. Maybe good to separate opaque render pass in two, first opaque materials, then materials with emissive. This should reduce bandwidth significantly. If using lightmaps, this buffer (and shader variant permutation) is also used as the lightmap must write to emissive. In short, the defines that enable this permutation also enable lightmapping (It's the same permutation, even if lightmapping or emissive are not used).
Specialization: R32. This is also an optional target, if the material uses a specific specialization or if the specular constant is different to the default (written to), a late render pass can be used to write these using this buffer. This is bit-mapped.
- Bits 0-7: Specular constant (mapped to 0 - 2.0).
- Bits 8-10: Material type: None, SSS, Aniso, RIM, Clearcoat, Backlight
- Bits 10-32, Arguments for material type.
Motion Vectors (16/32 bits?) (when using TAA), otherwise not written. It is important that an optimization also has to be done by not writing motion vectors to objects that did not move (or that do not write to motion vectors, such as a texture with moving UV that needs to use this custom logic). An invalid value has to be written on clear, then those values need to be computed at some later point by simply using the previous and current camera positions. This saves a lot of bandwidth and vertex execution. (As a note, this optimization has to be done to the forward renderer too).
Visibility Mask: This UINT32 32 bits buffer containing the visibility layer mask for each pixel, so lights, decals and refprobes can be properly masked. AFAIK Godot uses "1" by default, so this buffer could be just cleared to 1 and objects normally objects that write something different to 1 will need to enable this mask.

Ultimately, this means there are 17 shader permutations for deferred (base + 16 permutations of special versions).

Rendering logic

Remember that this is still a clustered renderer. The fog effects and the transparent pass require access to light clustering, so this is not going away. The rendering code is almost the same as forward clustered, the main difference is the opaque pass being deferred.

Step 1: Opaque rendering

As described above, the opaque rendering happens to the G buffers in one or multiple passes (depending what needs to be written) the pass with the least buffers often happening first (because it will be the most common), then the specialized passes.

Step 2: Post opaque effects

Here is when effects such as SSAO and SSGI can be computed. SSGI probably depends on a reprojection of the previous frame diffuse+ambient buffer.

Step 3: Shading

Shading will be performed by a compute shader, which will do the following:

Compute global radiance
Process decals from the cluster
Process positional lights from the cluster
Process directional lights from the cluster
Process reflection probes from the cluster
Process GI (voxel, SDFGI).
Process fog

This code is pretty much the same as the one found in the clustered renderer, not much change is needed. Shader includes will need to be reorganized for better reuse. Attention has to be paid to subgroup operations like in the forward render when reading the cluster to maximize SGPR usage, but this should be simpler in Compute.

There is one exception, though, which is that some code relies on geometric normals (specially the shadow biasing). As such, geometric normals will need to be computed from depth in the compute shader. Here is an article on how to do this:

https://wickedengine.net/2019/09/22/improved-normal-reconstruction-from-depth/ (also https://atyuwen.github.io/posts/normal-reconstruction/) If SSS or SSR are used, the ambient+diffuse and specular buffers need to be written separate for post processing, then merged. The reason for this is that reflections can mess both the subsurface scattering and screen space GI information. Otherwise, writing to a single buffer can be done.

Step 4: Post shading effects

Here is where subsurface scattering is processed (of course check that any material is using this, otherwise skip this step like we do on forward rendering).

Step 5: Transparency pass

From now on this is the same as the forward clustered renderer.

If this enhancement will not be used often, can it be worked around with a few lines of script?

It can't be worked around

Is there a reason why this should be core and not an add-on in the asset library?

It is a care enhancement

Ansraer commented 1 year ago

Ok, I just finished reading this for the first time and have some initial feedback. So far, this looks like a fairly typical deferred renderer implementation. While I personally would have been more interested in a VisBuffer variant, reduz has pointed out (on RC and the GPU renderer gist) that it would not support all possible use cases (without some complex workarounds). I personally still believe that, despite those shortcomings, the superior performance makes it an attractive option for many projects, but can also understand the decision to go with something more traditional instead.

Something I have to wonder while reading this proposal is why we want to keep the forward option alive. Right now, the Forward+ renderer is only really used on platforms that could easily support the higher bandwidth requirements of a deferred renderer, so I can't really think of any reason to use the forward option once this has been implemented. I suppose small scenes with simple geometry could be a bit faster with a forward renderer, but given the numbers I saw the last time I profiled Godot, I doubt that it would be significant enough to justify maintaining another rendering path.

Also, what happens when people want to render something without using the deferred light logic? I would imagine that this is a valid use case once you start writing advanced shaders, so I kind of expected to see a mask bit somewhere to deactivate the compute shader for certain fragments. Alternatively, maybe the proposed compositor could be used to run a forward pass after the deferred renderer?

Finally, what happened to the depth prepass? I would imagine that it still happens before the deferred passes, but didn't see it mentioned above. What I would really like to know is if we plan to continue writing normal + roughness there if GI is enabled.

EDIT: I just realized that I accidentally mentioned VisBuffer while writing this comment. Everyone else, please do NOT start arguing for/against it here. We had a lot of that discussion already over the last few weeks, and the team has valid reasons for the proposed architecture. Please try to keep this proposal focused on the deferred renderer in question, if you wish to discuss VisBuffer rendering feel free to open a second proposal.

clayjohn commented 1 year ago

Something I have to wonder while reading this proposal is why we want to keep the forward option alive. Right now, the Forward+ renderer is only really used on platforms that could easily support the higher bandwidth requirements of a deferred renderer, so I can't really think of any reason to use the forward option once this has been implemented. I suppose small scenes with simple geometry could be a bit faster with a forward renderer, but given the numbers I saw the last time I profiled Godot, I doubt that it would be significant enough to justify maintaining another rendering path.

Forward+ is way more flexible than deferred. Users can write varyings between shader stages (vertex->fragment, fragment->light), users can use MSAA which for many games is still preferable to TAA, certain effects like fog are much easier and more customizable in Forward rendering pipelines.

Also, what happens when people want to render something without using the deferred light logic? I would imagine that this is a valid use case once you start writing advanced shaders, so I kind of expected to see a mask bit somewhere to deactivate the compute shader for certain fragments. Alternatively, maybe the proposed compositor could be used to run a forward pass after the deferred renderer?

I think we could add that in the visibility mask (i.e. 0 would result in no lighting being added anyway)

Finally, what happened to the depth prepass? I would imagine that it still happens before the deferred passes, but didn't see it mentioned above. What I would really like to know is if we plan to continue writing normal + roughness there if GI is enabled.

There won't be a depth prepass. Normal/roughness will be written during the rasterization pass.

Ansraer commented 1 year ago

There won't be a depth prepass. Normal/roughness will be written during the rasterization pass.

Are you sure that's a good idea? I fear that this could really tank the rendering performance of alpha scissor materials such as foliage.

clayjohn commented 1 year ago

There won't be a depth prepass. Normal/roughness will be written during the rasterization pass.

Are you sure that's a good idea? I fear that this could really tank the rendering performance of alpha scissor materials such as foliage.

If foliage is such a problem, we could add an optional depth-prepass. But I highly doubt rendering every opaque object twice would give you enough benefit to justify the cost. Currently the depth prepass is a significant drain on performance (even when just rendering depth) due to our heavy vertex shader.

Ansraer commented 1 year ago

I have spent a lot of time looking into foliage rendering in the past, and one thing I learned was that having a depth prepass is absolutely essential. Using it and VK_COMPARE_OP_EQUAL allows you to abort before the full fragment shader has even been started, which is slightly faster than loading & starting it only to discard two lines later. And with many alpha scissor surfaces rendered on top of one another, this adds up.

Maybe we could use the compositor to have an optional opaque pass with a prepass before we shade the GBuffer?

Edit: I couldn't sleep and kept thinking about what other benefits (GI & AO maybe?) a depth prepass could have vs the drawbacks. In the end I came across https://interplayoflight.wordpress.com/2020/12/21/to-z-prepass-or-not-to-z-prepass/ which points out a number of possible usecases. Right now I think that if we want to support depth prepass (and we do, foliage really needs it!) we should do it properly and build the rest of the pipeline around it.

I am not that concerned about the vertex shader right now since I am fairly certain that we could optimize it a fair bit.

Koalamana9 commented 1 year ago

Deferred renderer would be very appreciated, it essentially eliminates performance issues caused by a high amount of draw calls, but any transparent surface has to be rendered separately in forward renderer and then blended on top of generated frame buffer, this is how rendering in GTA V works.

Calinou commented 1 year ago

Will this project setting work when using the Mobile rendering method? I've heard of some mobile games starting to adopt deferred rendering lately.

clayjohn commented 1 year ago

Will this project setting work when using the Mobile rendering method? I've heard of some mobile games starting to adopt deferred rendering lately.

We could implement a version for the mobile renderer. Deferred rendering on mobile is still pretty tricky. You have to be super careful about bandwidth, TAA is too slow, so you are stuck with FXAA, and you typically don't need that many lights anyway. But it still could be worth looking into, as I'm sure it would still be a net win for some use-cases

Koalamana9 commented 1 year ago

Is this planned for 4.3? Or there no specific deadline for deferred renderer yet?

clayjohn commented 1 year ago

Is this planned for 4.3? Or there no specific deadline for deferred renderer yet?

No specific deadline. This is a proposal to discuss the feature and implementation details. We won't assign a milestone until there is a PR ready to merge

jams3223 commented 10 months ago

I've been thinking about this; it seems like a deferred renderer is the way to go, but will the deferred renderer do tile shading?

clayjohn commented 10 months ago

I've been thinking about this; it seems like a deferred renderer is the way to go, but will the deferred renderer do tile shading?

No, we will still do clustered shading

hayahane commented 10 months ago

Will this project setting work when using the Mobile rendering method? I've heard of some mobile games starting to adopt deferred rendering lately.

Nowadays, games launched on both PC and mobile platforms are trying to use both forward and deferred shading to get nice looking and performant NPR, for example, Genshin Impact and a new game in development ZZZ. They use forward to get NPR characters rendered easily and combine it with a deferred rendered environment. Another open-source engine Bevy released their renderer using similar process combing both forward and deferred in version 0.12. Maybe we could have a look at their approach, since the flexible render graph sounds quite promising.

jams3223 commented 10 months ago

Will this project setting work when using the Mobile rendering method? I've heard of some mobile games starting to adopt deferred rendering lately.

Nowadays, games launched on both PC and mobile platforms are trying to use both forward and deferred shading to get nice looking and performant NPR, for example, Genshin Impact and a new game in development ZZZ. They use forward to get NPR characters rendered easily and combine it with a deferred rendered environment. Another open-source engine Bevy released their renderer using similar process combing both forward and deferred in version 0.12. Maybe we could have a look at their approach, since the flexible render graph sounds quite promising.

That's a cool idea.

Ansraer commented 10 months ago

Just since nobody else has mentioned it, something deferred rendering could really help with is lane usage. Basically, modern GPUs don't run a shader for every single pixel but instead run shaders for groups of 2x2 pixels, a process that is called quad fragment shading. This is because some shader operation, such as MIP calculation for texture lookups, require information about what the neighboring pixels would look like if they had the same shader, even if that isn't actually the case. This means that in the worst case, you have only one active lane (a pixel that is actually rendered) and 3 helper/passive lanes (pixels that are calculated but not actually needed for the final output), so 3/4 of your computation result are thrown away immediately.

If we used deferred rendering the entire lighting logic would be removed from the scene shader, which means that when rendering a frame the helper lanes are only active for a shorter amount of time, and we thus make better use of the GPU.

Do note that I really haven't looked at the performance of the fragment stage at all, so I am uncertain how much Godot is currently affected by this. Usually, lanes only become relevant when you have very small triangles (less than 10 px on average), but given how bad our register usage is, I wouldn't be surprised if we noticed it a bit earlier.

AresDevult commented 6 months ago

I hope the development is not abandoned

Calinou commented 6 months ago

I hope the development is not abandoned

This feature is still planned, but after 4.3.

Ikaroon commented 6 months ago

Forward+ is way more flexible than deferred. Users can write varyings between shader stages (vertex->fragment, fragment->light), users can use MSAA which for many games is still preferable to TAA, certain effects like fog are much easier and more customizable in Forward rendering pipelines.

Just to add this to the discussion but deferred shading can actually have various lighting solutions in the same image and give the developer enough flexibility. E.g. Breath of the Wild on Nintendo Switch is actually using deferred shading for all opaque objects, including the characters that have a complete cartoony look, even the water is using deferred shading despite being transparent. They do this by having something called a Material Mask Buffer that tells the lighting shader how to shade this area.

I definitely think Godot should go deferred. It would be very interesting though to give the developer control over which buffers are used, even allowing for more than the default buffers for custom effects that we cannot predict.

jams3223 commented 4 months ago

@clayjohn Hey, I'm back! I was just thinking, instead of going with deferred rendering, why don't we give triangle visibility buffer rendering a shot? It could open up the possibility of implementing software VRS, leading to significant performance boosts in scenes with numerous lights.

https://diaryofagraphicsprogrammer.blogspot.com/2018/03/triangle-visibility-buffer.html http://filmicworlds.com/blog/visibility-buffer-rendering-with-material-graphs/ http://filmicworlds.com/blog/software-vrs-with-visibility-buffer-rendering/

Many studies have demonstrated the advantages of utilizing this rendering method, such as detailed geometry that surpasses Nanite in performance. https://t.co/qkGxug0Wz3

and-rad commented 3 months ago

I am also getting the impression that deferred rendering is slowly falling out of favor, mainly because the limitations that caused deferred to become popular 10, 15 years ago are not the limitations that we often have to worry about today.

jams3223 commented 3 months ago

I am also getting the impression that deferred rendering is slowly falling out of favor, mainly because the limitations that caused deferred to become popular 10, 15 years ago are not the limitations that we often have to worry about today.

I found out from @Calinou that Godot has a unique forward+ renderer implementation that is similar to visibility buffer rendering. They've revamped their rendering backend to support texture and mesh streaming down the line, which could pave the way for a custom software variable rate shading implementation for mesh streaming later on. It also means that we can use MSAA.

Mautar55 commented 2 weeks ago

I think this is the right place to consider a post-post-processing (or post canvas effects) rendering pass so there can be a solid foundation for #2138 . This render pass should draw 3D objects after AA and tone mapping.

godotengine / godot-proposals