godotengine / godot

Godot Engine – Multi-platform 2D and 3D game engine
https://godotengine.org
MIT License
88.84k stars 20.15k forks source link

C# Compute Shaders run significantly slower than GDScript compute shaders. (Godot 4.3 RC3) #95521

Open NilOmniscient opened 4 weeks ago

NilOmniscient commented 4 weeks ago

Tested versions

System information

EndeavorOS Linux (Arch based). CPU - Intel i7-1165G7 (iGPU). Drivers - vulkan-intel/mesa

Issue description

Project using a GPU Poisson Disk Sampling shader. When running via c# version of rd.Submit(), shader takes upwards of 300-500ms each to run. When calling from a GDScript version though, (even from inside a C# program) each shader runs in < 5ms.

Huge discrepancy, and not entirely sure if this is a bug, or just a current limitation of the C# implementation.

Steps to reproduce

Load up MRP, in Main, toggle "Gd Script Shader" export variable on and off. Time Elapsed in ms is posted to Output terminal. GDScript version of Shader Code runs in < 150ms total whereas the C# version of the Shader code takes almost 3.5s.

Both use the same glsl file.

Minimal reproduction project (MRP)

Archive.zip

tetrapod00 commented 4 weeks ago

I consistently get ~1100ms with C# and ~380ms with GDScript.

Godot v4.3.rc3.mono - Windows 10.0.19045 - Vulkan (Mobile) - dedicated NVIDIA GeForce GTX 1660 Ti with Max-Q Design (NVIDIA; 32.0.15.5612) - AMD Ryzen 7 3750H with Radeon Vega Mobile Gfx (8 Threads)

--- Debugging process started ---
Godot Engine v4.3.rc3.mono.official.03afb92ef - https://godotengine.org
Vulkan 1.3.278 - Forward Mobile - Using Device #0: NVIDIA - NVIDIA GeForce GTX 1660 Ti with Max-Q Design

C# Time Elapsed: 1113
--- Debugging process stopped ---
Set GDScriptShader
--- Debugging process started ---
Godot Engine v4.3.rc3.mono.official.03afb92ef - https://godotengine.org
Vulkan 1.3.278 - Forward Mobile - Using Device #0: NVIDIA - NVIDIA GeForce GTX 1660 Ti with Max-Q Design

GDScript Time Elapsed: 377
--- Debugging process stopped ---
NilOmniscient commented 4 weeks ago
Godot Engine v4.3.rc3.mono.official.03afb92ef - https://godotengine.org
Vulkan 1.3.278 - Forward Mobile - Using Device #0: Intel - Intel(R) Xe Graphics (TGL GT2)

C# Time Elapsed: 3312ms
Set GDScriptShader
Godot Engine v4.3.rc3.mono.official.03afb92ef - https://godotengine.org
Vulkan 1.3.278 - Forward Mobile - Using Device #0: Intel - Intel(R) Xe Graphics (TGL GT2)

GDScript Time Elapsed: 153ms

This is what I normally get. I think it's interesting that your C# is faster than mine, but the GDScript is slower for you.

clayjohn commented 4 weeks ago

I don't have dotnet set up on this device. But looking through the code, I can see that this is measuring not just the time to run the compute shader, but the time to create the rendering device, compile the shader, load/create all the resources, and read the data back from the shader afterwards.

As a next step, someone will need to do some more fine-tuned profiling to see where the difference is coming from. My gut tells me that it won't be from running the compute shader, its more likely going to come from reading the storage buffer back from the GPU. My guess is that C# ends up doing more memory allocations and copies the memory around more times

NilOmniscient commented 4 weeks ago

I don't have dotnet set up on this device. But looking through the code, I can see that this is measuring not just the time to run the compute shader, but the time to create the rendering device, compile the shader, load/create all the resources, and read the data back from the shader afterwards.

As a next step, someone will need to do some more fine-tuned profiling to see where the difference is coming from. My gut tells me that it won't be from running the compute shader, its more likely going to come from reading the storage buffer back from the GPU. My guess is that C# ends up doing more memory allocations and copies the memory around more times

In my initial Issue Description, I do also list the individual runtimes for just the rd.Submit() and rd.Sync(). I did some testing before I submitted the bug to make sure I wasn't causing most of my headache due to porting the code badly.

The actual runtimes on my machine for each rd.Submit() and rd.Sync() combo, no other things: ~5ms for GDScript, and ~300-500ms for C#.

It loops 9 times. (Algorithm is based on 2008 talk by Li-Yi Wei about parallel Poisson Disk Sampling involves processing on the GPU in 9 steps), so a vast majority of the discrepancy should be there.

clayjohn commented 4 weeks ago

@NilOmniscient Regardless of what language you are using, the shader runs exactly the same. Its definitely not the shader that is running differently. Which is why we need to profile the entire thing and see where the discrepancy is coming from.

NilOmniscient commented 4 weeks ago

@clayjohn I get that. The shader runs on the GPU, completely separate from everything else.

I was simply pointing out that when I originally had more logging in there, whatever happens inside rd.Submit() and rd.Sync() seemed to be what was causing the largest gaps, and was hoping that might help narrow things down more.

clayjohn commented 4 weeks ago

@clayjohn I get that. The shader runs on the GPU, completely separate from everything else.

I was simply pointing out that when I originally had more logging in there, whatever happens inside rd.Submit() and rd.Sync() seemed to be what was causing the largest gaps, and was hoping that might help narrow things down more.

Thank you for the clarification! That result is extremely weird as there should be no difference between calling submit and sync from C# or GDScript. In both cases you are just making a call directly into an internal engine function.

Maybe @raulsntos Has some ideas about how performance could be affected in such a case?

NilOmniscient commented 4 weeks ago

Let me add back the extra logging and resubmit the MRP (and add logs from my machine) just in case I'm remembering wrong.

Bear in mind I'm on a different machine right now, so it'll probably be more in line with tetrapod's results.

NilOmniscient commented 4 weeks ago

This machine is a Windows machine, with a Ryzen 5600X and RX6700XT CPU/GPU. New logs as follows. The biggest time difference is in rd.sync(). Uploading the project with more detailed logging inside each versions ShaderHelper.RunShader()

Archive_BetterLogs.zip

--- Debugging process started --- Godot Engine v4.3.stable.mono.official.77dcf97d8 - https://godotengine.org Vulkan 1.3.280 - Forward Mobile - Using Device #0: AMD - AMD Radeon RX 6700 XT

C# Uniform Set Creation: 1ms C# Pipeline Creation: 0ms C# Compute List Creation: 0ms C# rd.Submit(): 0ms C# rd.Sync(): 40ms C# rd.FreeRid(pipeline) && rd.FreeRid(uniformSet): 0ms Time Elapsed: 446 Set GDScriptShader --- Debugging process stopped --- --- Debugging process started --- Godot Engine v4.3.stable.mono.official.77dcf97d8 - https://godotengine.org Vulkan 1.3.280 - Forward Mobile - Using Device #0: AMD - AMD Radeon RX 6700 XT

GDScript Pipeline creation: 0ms GDScript Uniform Set Creation: 1ms GDScript Compute List Creation: 0ms GDScript rd.submit(): 0ms GDScript rd.sync(): 1ms GDScript rd.free_rid(pipeline) && rd.free_rid(uniform_set): 0ms Time Elapsed: 117