hoffstadt / DearPyGui

Dear PyGui: A fast and powerful Graphical User Interface Toolkit for Python with minimal dependencies
https://dearpygui.readthedocs.io/en/latest/
MIT License
12.89k stars 671 forks source link

Deadlocks when DPG is called from a (heavy loaded) user thread #2053

Open v-ein opened 1 year ago

v-ein commented 1 year ago

Version of Dear PyGui

Version: 1.8.0 Operating System: Windows 10

My Issue/Question

If there are several threads calling DPG functions, there's a chance they get into a deadlock where one thread is holding GIL and waiting for the mutex, while the other does the opposite. Here's an example of how this may happen:

The issue never occurs if DPG is only run on the main thread and the callbacks thread - the renderer always releases GIL before attempting to lock the mutex. The rest of the code, however, holds GIL when locking the mutex.

As soon as there's a DPG call in a thread other than those two "standard" threads, there's a chance for a deadlock. A call to dpg.mutex() is not necessary - the same deadlock may happen with any DPG call; dpg.mutex() just increases chances to bump into it.

To Reproduce

Note: to reproduce this, one needs PR #2004 to be merged first. Without that, the mutex is not held by DPG callers.

Steps to reproduce the behavior in general:

  1. Add a callback that's called pretty often (e.g. the "visible" or "hover" callback) and calls a DPG function.
  2. Start a new Python thread.
  3. Lock DPG mutex in that thread and start doing a heavy job - not necessarily calling DPG functions; any job would do as long as it lasts longer than 5 ms.
  4. Wait for application to hang up.

With the example below:

  1. Hover the "hover me" text and make sure it shows quickly changing numbers.
  2. Leave the checkbox unset and click the button to start a new thread. You'll see how CPU usage goes up.
  3. Hover the text again and wait for a while. Make sure it doesn't hang up.
  4. Close the application and restart it.
  5. Set the checkbox and click the button again.
  6. Hover the button and the checkbox to see DPG is still functioning.
  7. Hover the text and wait for a while. Eventually it hangs up (well, it should actually hang up right away; 5 ms is not something visible to the naked eye).

Expected behavior

No deadlocks.

Screenshots/Video

image

Standalone, minimal, complete and verifiable example

#!/usr/local/bin/python3

from random import randrange
from threading import Thread
import dearpygui.dearpygui as dpg

dpg.create_context()
dpg.create_viewport(title="Test", width=500, height=400)

def heavy_task():
    lock_it = dpg.get_value("lock-mutex")
    while True:
        count = 0
        if lock_it:
            dpg.lock_mutex()
        for _ in range(1000):
            count += 1
        if lock_it:
            # give other threads a chance to run DPG stuff (including DPG renderer)
            dpg.unlock_mutex()

dpg.setup_dearpygui()
with dpg.window(pos=(100, 100), width=350, height=200, no_title_bar=False):
    with dpg.item_handler_registry() as handlers:
        dpg.add_item_hover_handler(callback=lambda: dpg.set_value("my-text", f"{randrange(100000):5}"))
    dpg.add_text("Start a thread then hover me!", tag="my-text")
    dpg.bind_item_handler_registry(dpg.last_item(), handlers)

    dpg.add_checkbox(label="Lock mutex in the user thread", tag="lock-mutex")
    dpg.add_button(label="Start thread", callback=lambda: Thread(target=heavy_task).start())

dpg.show_viewport()
dpg.start_dearpygui()
dpg.destroy_context()
v-ein commented 1 year ago

Got a fix for this, going to open a PR later.

Breakthrough commented 11 months ago

@v-ein did you ever get a fix for this uploaded? I'm running into the same issue which makes it difficult to use modern versions of DPG for async video decoding.

This seems like a pretty critical issue for any applications requiring threading to offload CPU-heavy tasks. Are there any suggested workarounds? All guidance I can find suggests using a render callback, which is no longer supported. I also found that swapping the textures in the render loop seems to trigger the deadlock more frequently in some cases.

E.g. I run into this issue when I update a texture in a background thread while also dragging a slider quickly. The application suddenly deadlocks indefinitely.

Edit: Thank you for your work on the project as well!

v-ein commented 11 months ago

No, not yet. I'll see if I get some spare time to do that.

I've got more fixes related to synchronization, and hoped to get a chance some day to combine them into one and push as a single PR. Most of them are massive edits and thus are difficult to review, so my idea was to reduce the grind for reviewers. I'll see what I can do.

v-ein commented 11 months ago

Offloading CPU-heavy tasks is probably best done using a background thread, like you suggested. However, DPG is not entirely thread safe, not even if we fix all the sync issues. It maintains a state (dpg.last_item() and container stack, at the very least). Since the state is global, i.e. shared between threads, certain DPG functions are unsafe and may cause race conditions.

Each alternative has its own drawback:

I hope one day I can make DPG fully thread-safe by making its state thread-local. This is a far goal, and might not be 100% doable, so don't count on that.

axeldavy commented 1 month ago

The first alternative you describe is the easiest to put in place, but indeed if events pile up we can skip important callbacks.

One improvement to the behaviour of this alternative could be to be able to set a flag telling to 'skip' events if a new one is available.

For example a mouse handler might prefer to skip old mouse positions and go directly to the most recent one.

Imagine a plot with heavy rendering in callback (drawing lines, circles, etc). By heavy, I mean in terms on CPU, not GPU. Doing the rendering in a callback is a good solution, as the mouse can still interact smoothly with the plot. The rendering can be performed on a hidden layer, and then when it is done, the layer is shown, and the previous one hidden (double buffering).

DearPyGui is ideal for the above scenario. The only issue is if mouse/keyboard events pile up. What do you think of introducing handler configuration options to skip redundant events in the callback queue ?

v-ein commented 1 month ago

What do you think of introducing handler configuration options to skip redundant events in the callback queue ?

It's an interesting idea, though it doesn't fit well the current DPG API. I'll need to think about it. It can probably be done in Python as well, but DPG itself surely can skip events better (because it can forward-check the queue before running the handler, whereas on Python side we'd have to run all handlers first, and only then actually perform the task associated with this kind of event).

The handler could probably have kind of a "skip repetitive events" flag, but we need to clearly define what a repetitive event is. Also, I wonder if the mouse hadnler is the only handler where we can implement such filtering.

v-ein commented 1 month ago

BTW it can also be done in Python with manual callback management, but #2208 will need to get fixed first.

axeldavy commented 1 month ago

By releasing the gil in dpg.mutex(), and encapsulating all dpg calls with dpg.mutex(), I was able to avoid any deadlock despite having several threads submitting dpg commands. This can be a simple workaround for anyone affected.

v-ein commented 1 month ago

Exactly, that was my solution too. Most of the time when locking the mutex, DPG also needs to release GIL before owning the mutex.

axeldavy commented 1 month ago

Even this solution I have ended up with hangs. It looks like the more I add keyboard and mouse handlers, the more likely I get these... In the end I had to resort to not using the frame callback and only doing the default rendering loop, and do all drawings in the callback in a single thread.

v-ein commented 1 month ago

I hope to eventually (in a couple of months) push some fixes for synchronization issues - who knows, it might help. Not going to be soon though (the fixes are completed in my local repo, but I'll need to revisit the code and also take #2275 into account).