godotengine / godot

Godot Engine – Multi-platform 2D and 3D game engine
https://godotengine.org
MIT License
91.37k stars 21.25k forks source link

Baking Lights freezes godot when using Denoiser #84664

Open verypleasentusername opened 1 year ago

verypleasentusername commented 1 year ago

Godot version

4.1 branch [2d3b2ab], compiled locally

System information

windows 10, 4.1 stable engine version, Forward+

Issue description

If Denoiser is enabled, the progress goes up to 68% and editor freezes indefinitely. Console shows this errors. image

Steps to reproduce

Open scene_3d.tscn, choose LightmapGI in scene nodes list and click "Bake Lighmaps".

Minimal reproduction project

LightmapError.zip

AThousandShips commented 1 year ago

Can you please try this with a supported version, like 4.1.3 (only the latest patch version is supported, and it might already be fixed)

verypleasentusername commented 1 year ago

Can you please try this with a supported version, like 4.1.3 (only the latest patch version is supported, and it might already be fixed)

isn't the 4.1 branch on github the 4.1.3 one? if not i will surely try out 4.1.3.

AThousandShips commented 1 year ago

Yes but you said "4.1 stable", which means "4.1.0", an old version, if you are referencing a branch please add the commit hash as per the instructions in the bug report form 🙂

verypleasentusername commented 1 year ago

Yes but you said "4.1 stable", which means "4.1.0", an old version, if you are referencing a branch please add the commit hash as per the instructions in the bug report form 🙂

updated

AThousandShips commented 1 year ago

(Do add the commit hash, right now it is [2d3b2ab], but what ever you have, as "latest" will be outdated in the future and can make it harder to check this)

AThousandShips commented 1 year ago

Since you are compiling locally, would you mind testing on master as well? Either way is okay but would see if it is a bug that has been solved and might be cherry picked

verypleasentusername commented 1 year ago

(Do add the commit hash, right now it is [2d3b2ab], but what ever you have, as "latest" will be outdated in the future and can make it harder to check this)

did i make it right? sorry, my first big issue here

AThousandShips commented 1 year ago

You did 🙂 This might be a known issue, can't remember what other issue report it was, might be a different cause or issue though as I couldn't find it right now

verypleasentusername commented 1 year ago

Since you are compiling locally, would you mind testing on master as well? Either way is okay but would see if it is a bug that has been solved and might be cherry picked

the problem is, it probably will work on master, but because the error itself is pretty random. Even just now, i was able to make to work by deleting and them adding meshes. Even if it does work on master there is no reason to think that error was fixed. although i hope so. I will try newer release and then i update on results here.

verypleasentusername commented 1 year ago

quick update, i tried again and it failed again, but now at 62% baking lights (baking probes). Console showed Vulkan errors again but now there was a couple of error saying Out of Memory!, which makes me think that godot's lighmapGI is trying to push baking faster than it can and goes beyond memory limit. Thats actually weird cause rendering process should never be forced faster, and in most sofware it isn't. Its all just a speculation though. image

Calinou commented 1 year ago

Which graphics card model are you using?

Also, please upload a minimal reproduction project[^1] to make this easier to troubleshoot.

[^1]: A small Godot project which reproduces the issue, with no unnecessary files included. Be sure to not include the .godot folder in the archive (but keep project.godot).

Drag and drop a ZIP archive to upload it. Do not select another field until the project is done uploading.

Note for C# users: If your issue is not Mono-specific, please upload a minimal reproduction project written in GDScript or VisualScript. This will make it easier for contributors to reproduce the issue locally as not everyone has a Mono setup available.

verypleasentusername commented 1 year ago

Which graphics card model are you using?

Also, please upload a minimal reproduction project1 to make this easier to troubleshoot.

Footnotes

  1. A small Godot project which reproduces the issue, with no unnecessary files included. Be sure to not include the .godot folder in the archive (but keep project.godot).Drag and drop a ZIP archive to upload it. Do not select another field until the project is done uploading.**Note for C# users:** If your issue is not Mono-specific, please upload a minimal reproduction project written in GDScript or VisualScript. This will make it easier for contributors to reproduce the issue locally as not everyone has a Mono setup available.

my graphics core: AMD Radeon(TM) Vega 8 Graphics. Issue might be computer power related, in that case name of the issue should be changed. image

I finally managed to recreate issue on minimal reproduction project, updated initial comment. Again, might work fine on your computer. In that case its 99% computing power related.

Calinou commented 1 year ago

In the MRP, lightmaps bake in 1 second with the denoiser enabled on my i9-13900K + RTX 4090 setup, so this is definitely hardware-specific. It could also be a driver bug.

How much system RAM do you have? The amount of video memory you can use with integrated graphics is determined by the amount of system RAM.

verypleasentusername commented 1 year ago

In the MRP, lightmaps bake in 1 second with the denoiser enabled on my i9-13900K + RTX 4090 setup, so this is definitely hardware-specific. It could also be a driver bug.

Uhh.. what is MRP? google shows weird answers.

How much system RAM do you have? The amount of video memory you can use with integrated graphics is determined by the amount of system RAM.

4 gb. It might seem funny, trying to bake on such a low-spec computer, however light-releated operations in such software as Blender isn't forced, and works fine for me. Just slow, as it should be when baking lights in Godot. I was able to avoid crashes by tweaking settings Render->Lighmapper and it might be ment to be that way. In that case it's again another issue.

Calinou commented 1 year ago

Uhh.. what is MRP? google shows weird answers.

MRP stands for Minimal reproduction project.

4 gb. It might seem funny, trying to bake on such a low-spec computer, however light-releated operations in such software as Blender isn't forced, and works fine for me. Just slow, as it should be when baking lights in Godot. I was able to avoid crashes by tweaking settings Render->Lighmapper and it might be ment to be that way. In that case it's again another issue.

Which exact settings did you tweak to get it to work?

saierXP commented 1 year ago

Same GPU vega 8 but system memory is 8G. Godot-v4.2.beta5. Baking the MPR project provided by the author reports the following error and freezes at 50% (direct light baking process):

Vulkan: Device lost!
ERROR: Condition "err" is true.
   at: local_device_push_command_buffers (drivers/vulkan/vulkan_context.cpp:2796)
ERROR: Condition "!ld->waiting" is true.
   at: local_device_sync (drivers/vulkan/vulkan_context.cpp:2803)
ERROR: Condition "err" is true. Returning: ERR_CANT_CREATE
   at: _update_swap_chain (drivers/vulkan/vulkan_context.cpp:2135)
ERROR: Vulkan: Cannot submit graphics queue. Error code: VK_ERROR_DEVICE_LOST
   at: (drivers/vulkan/vulkan_context.cpp:2536)

Vulkan: Device lost!
ERROR: Condition "err" is true.
   at: local_device_push_command_buffers (drivers/vulkan/vulkan_context.cpp:2796)
ERROR: Condition "!ld->waiting" is true.
   at: local_device_sync (drivers/vulkan/vulkan_context.cpp:2803)
10 Times

ERROR: Condition "err" is true.
   at: local_device_push_command_buffers (drivers/vulkan/vulkan_context.cpp:2796)
ERROR: Condition "!ld->waiting" is true.
   at: local_device_sync (drivers/vulkan/vulkan_context.cpp:2803)
ERROR: Condition "err" is true. Returning: ERR_CANT_CREATE
   at: _update_swap_chain (drivers/vulkan/vulkan_context.cpp:2135)

After testing, changing the angular distance of the first DirectionalLight node from 15° back to the default value of , it works. Godot_v4 2-beta5_win64_VcWi4ArevG

Calinou commented 1 year ago

After testing, changing the angular distance of the first DirectionalLight node from 15° back to the default value of 0°, it works.

My guess is that this (very high) angular distance causes too many rays to be thrown or allocated. There should probably be a upper clamp on the angular distance in the inspector and/or the lightmapper, or sample count should be clamped to a maximum value so that higher values can be used without using too much memory (at the cost of having some visible banding).

Typical angular distance values are between 0° and 3° for real world renderings.

cc @DarioSamo

verypleasentusername commented 1 year ago

My guess is that this (very high) angular distance causes too many rays to be thrown or allocated. There should probably be a upper clamp on the angular distance in the inspector and/or the lightmapper, or sample count should be clamped to a maximum value so that higher values can be used without using too much memory (at the cost of having some visible banding).

Typical angular distance values are between 0° and 3° for real world renderings.

cc @DarioSamo

these are the settings changes i made and was able to avoid freezes with them(even speed up the baking by 2 times): image

(first one was lowered by just one (was 5 initially) second two were divided by 2).

Also a note that i used high angular distance to mimic light scattering in the clouds and making blobby shadows.

My guess is that this (very high) angular distance causes too many rays to be thrown or allocated

aren't rays emittet a fixed amout by pixel(ignoring such settings as angular distance)? Please feel free to correct me if im wrong.

verypleasentusername commented 1 year ago

Same GPU vega 8 but system memory is 8G. Godot-v4.2.beta5. Baking the MPR project provided by the author reports the following error and freezes at 50% (direct light baking process):

Vulkan: Device lost!
ERROR: Condition "err" is true.
   at: local_device_push_command_buffers (drivers/vulkan/vulkan_context.cpp:2796)
ERROR: Condition "!ld->waiting" is true.
   at: local_device_sync (drivers/vulkan/vulkan_context.cpp:2803)
ERROR: Condition "err" is true. Returning: ERR_CANT_CREATE
   at: _update_swap_chain (drivers/vulkan/vulkan_context.cpp:2135)
ERROR: Vulkan: Cannot submit graphics queue. Error code: VK_ERROR_DEVICE_LOST
   at: (drivers/vulkan/vulkan_context.cpp:2536)

Vulkan: Device lost!
ERROR: Condition "err" is true.
   at: local_device_push_command_buffers (drivers/vulkan/vulkan_context.cpp:2796)
ERROR: Condition "!ld->waiting" is true.
   at: local_device_sync (drivers/vulkan/vulkan_context.cpp:2803)
10 Times

ERROR: Condition "err" is true.
   at: local_device_push_command_buffers (drivers/vulkan/vulkan_context.cpp:2796)
ERROR: Condition "!ld->waiting" is true.
   at: local_device_sync (drivers/vulkan/vulkan_context.cpp:2803)
ERROR: Condition "err" is true. Returning: ERR_CANT_CREATE
   at: _update_swap_chain (drivers/vulkan/vulkan_context.cpp:2135)

After testing, changing the angular distance of the first DirectionalLight node from 15° back to the default value of , it works.)

Can you test settings changes i showed above(with angular distance: 15°)? It would be helpful to know if fix can be recreated as well.

DarioSamo commented 1 year ago

Running long compute jobs on weak hardware is pretty much a recipe for trouble if you don't disable the TDR. "Region Size" is probably the setting you want to mess with to significantly reduce the amount of work that will be dispatched on each compute call to run below the timeout threshold.

About anything else mentioned so far sounds pretty irrelevant to me to be honest, it just sounds like you're on the very edge of the timeout so it'll randomly work or not depending on the complexity of the scene.

verypleasentusername commented 1 year ago

Running long compute jobs on weak hardware is pretty much a recipe for trouble if you don't disable the TDR. "Region Size" is probably the setting you want to mess with to significantly reduce the amount of work that will be dispatched on each compute call to run below the timeout threshold.

About anything else mentioned so far sounds pretty irrelevant to me to be honest, it just sounds like you're on the very edge of the timeout so it'll randomly work or not depending on the complexity of the scene.

Running long compute jobs on weak hardware is pretty much a recipe for trouble if you don't disable the TDR. "Region Size" is probably the setting you want to mess with to significantly reduce the amount of work that will be dispatched on each compute call to run below the timeout threshold.

About anything else mentioned so far sounds pretty irrelevant to me to be honest, it just sounds like you're on the very edge of the timeout so it'll randomly work or not depending on the complexity of the scene.

why does timeout exists anyway? and why do compute calls have them?

DarioSamo commented 1 year ago

why does timeout exists anyway? and why do compute calls have them?

The timeout is at the driver level and pretty much for any GPU work, not just compute. We don't really control it from Godot's side, we can just make some estimates as to how much work should be dispatched, but obviously the amount we choose isn't gonna take the same on all hardware.

verypleasentusername commented 1 year ago

why does timeout exists anyway? and why do compute calls have them?

The timeout is at the driver level and pretty much for any GPU work, not just compute. We don't really control it from Godot's side, we can just make some estimates as to how much work should be dispatched, but obviously the amount we choose isn't gonna take the same on all hardware.

Makes sense. So its not an issue anymore? maybe documentation should be changes as its not very clear that Lighmapper settings should be configured individually for a specific computer.

Calinou commented 1 year ago

The most optimal way to avoid the issue is to increase the TDR duration, but this requires editing the registry with administrator privileges. We can provide a .reg file for doing so (or even make Godot execute the required task using PowerShell code when requested), but it won't be usable in every case.

This approach is also used by software like Substance Painter, which warns you on startup if the TDR isn't increased.

PS: This is a non-issue on Linux (and possibly macOS), since they don't have a concept of TDR in the first place. Drivers can happily hang forever there :slightly_smiling_face:

saierXP commented 1 year ago

Running long compute jobs on weak hardware is pretty much a recipe for trouble if you don't disable the TDR. "Region Size" is probably the setting you want to mess with to significantly reduce the amount of work that will be dispatched on each compute call to run below the timeout threshold.

After set TdrLevel to 0 , and set TdrDelay to 1000, then reboot pc,it does complete baking when the angular distance is 15°.

PS: Disable TDR link

DarioSamo commented 1 year ago

After set TdrLevel to 0 , and set TdrDelay to 1000, then reboot pc,it does complete baking when the angular distance is 15°.

PS: Disable TDR link

Yep, sounds about what I expected.

While this is an issue I think we're safe to close this and boil it down to some general proposal instead of how we could handle this behavior. There are some different approaches that could work (e.g. running small benchmarks with incremental region sizes until it reaches a safe amount of time) but it's very much an area where there's no universal solution to fix it due to the APIs giving no control over this timeout.

Calinou commented 10 months ago

While this is an issue I think we're safe to close this and boil it down to some general proposal instead of how we could handle this behavior. There are some different approaches that could work (e.g. running small benchmarks with incremental region sizes until it reaches a safe amount of time) but it's very much an area where there's no universal solution to fix it due to the APIs giving no control over this timeout.

Could we default to a lower region size on integrated graphics automatically? Vulkan reports the device type via RenderingServer.get_video_adapter_type().

I suppose this will be best implemented once we have support for a low_end_gpu feature tag, so it can be added as a project setting feature tag override. This is something I discussed with reduz recently, so it should be good to implement.

verypleasentusername commented 10 months ago

Could we default to a lower region size on integrated graphics automatically? Vulkan reports the device type via RenderingServer.get_video_adapter_type().

I want to note that i did make region_sizeof a really small number, image to be exact. Output still was loaded with lots of erros and crashed.

When i tried region_size of 2 same happened. Deleting .godot cache folder did not change result. either region_sizeproperty doesn't really work or error is not caused by region_sizevalue at all. though i must note that bake percentage was going more and more slower as i decreased region_sizevalue.

trouble might be caused by corrutped model or >2 same models being baked. Also, what src_tex error means anyway?

verypleasentusername commented 10 months ago

I was able to successfully render a complex scene without crashes, which makes me think my problem is absolutely not computing-power related. image sometimes 4 house models render works sometimes not, but region_size changes does not help it.

DarioSamo commented 10 months ago

I was able to successfully render a complex scene without crashes, which makes me think my problem is absolutely not computing-power related.

You're not approaching this test the right way in that case. You should look into if disabling TDR to confirm if it's computing-power related. If you're getting the same error the other user reported then it's absolutely related to that. If it's something else then you'd get a different kind of error than just device lost.

What errors are you getting on output when it crashes?

verypleasentusername commented 10 months ago

You're not approaching this test the right way in that case. You should look into if disabling TDR to confirm if it's computing-power related. If you're getting the same error the other user reported then it's absolutely related to that. If it's something else then you'd get a different kind of error than just device lost.

What errors are you getting on output when it crashes?

i see why TDR is brought up again but i want to clearify that i did set it to 60 and random freezes no more happen. I also want to clearify that "DEVICE LOST" error is nowhere to be seen and im at 4.2.2 version of godot rn. I made a new issue about all this changes and how errors with lightmapping still appear but i was redirected to this issue by @Calinou which is fair.

what happens is errors and progress-stop caused by them. These errors are: image and "src_tex" is missing image

a moment later tons of UNIFORM SET errors spawn. sometimes they dont, and at that times no crash happens, but the progress still stops. I still dont know why this error happens with house models but not with subdivided cube model. Maybe somethigs in it makes godot hiccup, i dunno. I lost all my nerves on this error.

DarioSamo commented 10 months ago

@verypleasentusername Yeah it seems your original post lacked that information so it wasn't possible to determine what the actual cause was.

-2 is VK_ERROR_OUT_OF_DEVICE_MEMORY. Do you perhaps not have enough video memory to run the baker?

Admittedly it can do a better job at deleting resources if it doesn't need them while it's baking, but that is sounding like the cause here.

verypleasentusername commented 10 months ago

@verypleasentusername Yeah it seems your original post lacked that information so it wasn't possible to determine what the actual cause was.

-2 is VK_ERROR_OUT_OF_DEVICE_MEMORY. Do you perhaps not have enough video memory to run the baker?

Admittedly it can do a better job at deleting resources if it doesn't need them while it's baking, but that is sounding like the cause here.

it looks like errors are actually texture size caused. when i tried making lightmap size-value of subdivided cubes(4) a 2000x2000 pixels it crashed with the same error.

is there any way to avoid this? use pre-made texture maybe? or at least is there a way to automatically divide tuxture size by any number(If i remember correctly @Calinou had simillar idea somewhere in proposals, something like preview bake)? Or is it max_texture_size in LightmapGI is what causing an error? cant it be set higher than 16384?

DarioSamo commented 10 months ago

is there any way to avoid this?

The mesh's texel size controls directly what the size of the resulting lightmap will be, you can look up the documentation of Lightmap of how to change this.

I feel whatever problem you're running into means you're just hitting the upper bound of what your video memory allows while baking. Like I said, there might be ways into looking to minimize this as much as possible, but that'd depend on how reasonable the costs are here vs what the system allows. What's your total VRAM?

verypleasentusername commented 10 months ago

The mesh's texel size controls directly what the size of the resulting lightmap will be, you can look up the documentation of Lightmap of how to change this.

I feel whatever problem you're running into means you're just hitting the upper bound of what your video memory allows while baking. Like I said, there might be ways into looking to minimize this as much as possible, but that'd depend on how reasonable the costs are here vs what the system allows. What's your total VRAM?

not... much (second line) image i thought baking light uses RAM instead.

DarioSamo commented 10 months ago

i thought baking light uses RAM instead.

Nope, it's a GPU baker, so it uses compute shaders and video memory textures. I don't think you'll get very far with it if you're limited on resources like that.

There could be more work on the engine's side to minimize the amount of memory used (and also be clearer what the true cause of the error is), but that is a pretty painfully low amount of memory to work with.

verypleasentusername commented 10 months ago

i thought baking light uses RAM instead.

Nope, it's a GPU baker, so it uses compute shaders and video memory textures. I don't think you'll get very far with it if you're limited on resources like that.

There could be more work on the engine's side to minimize the amount of memory used (and also be clearer what the true cause of the error is), but that is a pretty painfully low amount of memory to work with.

wow thats awful. i thought integrated graphics are somewhat ok with it? at least i have not experienced troubles working in programs like blender, but maybe the problem is really about tuxture type optimization and errors clearity.

I still dont know if keeping texture on disk and using VRAM only to render rays is an option to make? i know its too small of a feature for devs to bother making but i want to know if this even possible to achieve. Or is there any way to add on VRAM by memory-swapping or should i ask other people to render scenes?

DarioSamo commented 10 months ago

I still dont know if keeping texture on disk and using VRAM only to render rays is an option to make? i know its too small of a feature for devs to bother making but i want to know if this even possible to achieve. Or is there any way to add on VRAM by memory-swapping or should i ask other people to render scenes?

Keeping it on disk would not work, it only loads what it requires at a time for rendering the lightmaps, which consists of quite a few versions of the same texture at full size (diffuse, normals, light accumulation, etc). All of that has to be in VRAM to be able to render it. It can probably be looked into at some point if there's some potential savings or inefficiencies, but you can also probably just increase the amount of RAM your system dedicates to video memory on your BIOS.