Open Vipitis opened 1 week ago
Not sure if it helps to solve but it might help investigate options... faulthandler.enable()
is my go to stdlib util to debug hard crashes. It will tell you exactly where and when the crash happened.
Aside of that, it's not technically possible to catch a segfault in the current process (threads or not). You can only detect if a subprocess has (likely) segfaulted by examining it's exit code.
The "proper" fix would be to modify the upstream rust libraries, such that they handle their internal exceptions without panicking or segfaulting.
Are you looking for bad shader code? Or anything that makes wgpu crash and burn? And if we find something, how do we send it to you. I know of some issues with RenderBundles.
exactly, I am concerned with shadercode that is buggy enough it will crash an otherwise working program. There will be plenty of ways to get wgpu to reach a panic, but assume you are just accepting user/generated shadercode, nothing else. You want to avoid crashing your scripts, no matter how bad or even malicious the shadercode is.
There is two examples in the diff right now with one or two more coming. For example, right now I am trying to root cause and minimize the problem with this shadertoy and another one I am trying to figure out is here which just exits somewhere in create_render_pipeline
with Vulkan but works in DX12
result = ob(*args)
This is where we make a call into the library, i.e. Rust land. Indeed if something bad happens there (like a Panic) there's no way to catch that.
Not sure if this helps, but maybe the test script can run the failing examples in a subprocess, so that pytest itself is still alive and can e.g. check the return code, and Rust tracebacks.
This is a WIP, it's purpose is to fail badly... so don't merge.
I am compiling all failure cases that cause the python process to crash or silently exit, to think about how to handle them as python exceptions. Debuggers will also just exit, so you have to step through the code to figure out which parts breaks it quite a lot. Where possible I link the appropriate upstream issues and minimal reproducers. Some cases are dependant on their backend, I usually default to Vulkan - but test DX12 for comparison too.
Motivation
My main use case is running a large amount of shaders to sort which work and which don't. This is part of my thesis work on evaluating (generated) shadercode. So there can be all kinds of messed up code. I am not interested on how to write working code. This is about being able to handle errors.
Previous attempts
GPUValidationError
in `create_shader_module``Plan
collect cases
I started a
test_wgpu_errors_fatal.py
file as part of the test suite (reused other test code mostly), now running this will also crash pytests - so maybe we skip it by default. It would be great to add some more cases and also note down how they panic and where in our code (usually calling the c function). Please contribute any kind of crashes you encounter, even if you don't have a minimal reproducer... I spent a few nights hunting bugs really far down so might see something.WGSL and GLSL doesn't really matter, since it's always translated to WGSL, so I just went with that.
find a solution
Problems will eventually be fixed upstream and make it to wgpu-py... but that can take months and some issues haven't been fixed upstream for nearly a year, but that will be the best solution.
This is the case where I am sorta lost myself. Maybe the changes to the device lost logging could lead to a raised issue, as tried in #547, otherwise changes to how SafeLibCalls to not drop the Python instance when the rust code reaches
panic!
. Usually the last row that gets executed is this https://github.com/pygfx/wgpu-py/blob/cf59eb012d87ac384e62744385c3df0ce9dddad4/wgpu/backends/wgpu_native/_helpers.py#L305