matplotlib / matplotlib

matplotlib: plotting with Python
https://matplotlib.org/stable/
20.04k stars 7.59k forks source link

[Bug]: `Tcl_AsyncDelete: async handler deleted by the wrong thread` despite being on the main thread #27713

Open bluenote10 opened 8 months ago

bluenote10 commented 8 months ago

Bug summary

I recently started running into the infamous Tcl_AsyncDelete: async handler deleted by the wrong thread crash quite frequently despite using matplotlib strictly from the main thread.

This occurs every time when using a pytorch DataLoader in multi-processing mode in combination with doing some plotting. I'm aware that matplotlib itself is not thread safe. Note however that the multiprocessing usage is internal to pytorch. The usage of matplotlib plotting is purely from the main process/thread, so there shouldn't be such an interference between pytorch and matplotlib leading to crashes.

Also, the usual work-around of enforcing the agg backend is not applicable in my use cases: In some cases I do want to do some interactive plotting during training neural networks, because it offers more possibilities to inspect what a neural model is doing exactly. (And not using the DataLoader in multi-processing mode slows down the training a lot.)

The pytorch version for reproduction is 2.1.2.

Code for reproduction

"""
This example needs pytorch, i.e., `pip install torch==2.1.2`.

The example reflects a minimal version of the use case pattern
that is causing the crashes.
"""

import os
import threading

import matplotlib
import matplotlib.pyplot as plt
import torch
from torch.utils.data import DataLoader, Dataset

class DummyDataset(Dataset):
    def __len__(self) -> int:
        return 1000

    def __getitem__(self, idx: int) -> dict[str, torch.Tensor]:
        return dict(
            X=torch.zeros(10),
        )

def plot_something(interactive: bool = False) -> None:
    print(f"[plot] pid = {os.getpid()}, thread id = {threading.get_ident()}")

    fig, ax = plt.subplots(1, 1, figsize=(14, 10))
    ax.plot([0.0] * 1000)
    fig.savefig("/tmp/test.png")

    if interactive:
        plt.show()

    plt.close(fig)

print("Matplotlib backend:", matplotlib.get_backend())
print(f"[main] pid = {os.getpid()}, thread id = {threading.get_ident()}")

dataset = DummyDataset()
data_loader = DataLoader(dataset, batch_size=16, num_workers=4)

while True:
    for batch in data_loader:
        X = batch["X"]

    # Most of the time the plot should run headless (plot to file), but under
    # certain conditions the plot should run interactively (using `.show()`).
    some_condition = False
    plot_something(some_condition)

Actual outcome

The snippets runs for a while and eventually crashes. The crashes are non-deterministic in nature, but the snippet crashes for me typically after half a minute or so. The output is (truncated the messages up to the crash):

Matplotlib backend: TkAgg
[main] pid = 64946, thread id = 139642307104768
[plot] pid = 64946, thread id = 139642307104768
[plot] pid = 64946, thread id = 139642307104768
<...>
[plot] pid = 64946, thread id = 139642307104768
[plot] pid = 64946, thread id = 139642307104768
Exception ignored in: <function Image.__del__ at 0x7f0049027250>
Traceback (most recent call last):
  File "/usr/lib/python3.10/tkinter/__init__.py", line 4056, in __del__
    self.tk.call('image', 'delete', self.name)
RuntimeError: main thread is not in main loop
Exception ignored in: <function Image.__del__ at 0x7f0049027250>
Traceback (most recent call last):
  File "/usr/lib/python3.10/tkinter/__init__.py", line 4056, in __del__
    self.tk.call('image', 'delete', self.name)
RuntimeError: main thread is not in main loop
Exception ignored in: <function Variable.__del__ at 0x7f0048f97ac0>
Traceback (most recent call last):
  File "/usr/lib/python3.10/tkinter/__init__.py", line 388, in __del__
    if self._tk.getboolean(self._tk.call("info", "exists", self._name)):
RuntimeError: main thread is not in main loop
Exception ignored in: <function Variable.__del__ at 0x7f0048f97ac0>
Traceback (most recent call last):
  File "/usr/lib/python3.10/tkinter/__init__.py", line 388, in __del__
    if self._tk.getboolean(self._tk.call("info", "exists", self._name)):
RuntimeError: main thread is not in main loop
Exception ignored in: <function Variable.__del__ at 0x7f0048f97ac0>
Traceback (most recent call last):
  File "/usr/lib/python3.10/tkinter/__init__.py", line 388, in __del__
    if self._tk.getboolean(self._tk.call("info", "exists", self._name)):
RuntimeError: main thread is not in main loop
Exception ignored in: <function Variable.__del__ at 0x7f0048f97ac0>
Traceback (most recent call last):
  File "/usr/lib/python3.10/tkinter/__init__.py", line 388, in __del__
    if self._tk.getboolean(self._tk.call("info", "exists", self._name)):
RuntimeError: main thread is not in main loop
Exception ignored in: <function Variable.__del__ at 0x7f0048f97ac0>
Traceback (most recent call last):
  File "/usr/lib/python3.10/tkinter/__init__.py", line 388, in __del__
    if self._tk.getboolean(self._tk.call("info", "exists", self._name)):
RuntimeError: main thread is not in main loop
Exception ignored in: <function Variable.__del__ at 0x7f0048f97ac0>
Traceback (most recent call last):
  File "/usr/lib/python3.10/tkinter/__init__.py", line 388, in __del__
    if self._tk.getboolean(self._tk.call("info", "exists", self._name)):
RuntimeError: main thread is not in main loop
Exception ignored in: <function Variable.__del__ at 0x7f0048f97ac0>
Traceback (most recent call last):
  File "/usr/lib/python3.10/tkinter/__init__.py", line 388, in __del__
    if self._tk.getboolean(self._tk.call("info", "exists", self._name)):
RuntimeError: main thread is not in main loop
Exception ignored in: <function Variable.__del__ at 0x7f0048f97ac0>
Traceback (most recent call last):
  File "/usr/lib/python3.10/tkinter/__init__.py", line 388, in __del__
    if self._tk.getboolean(self._tk.call("info", "exists", self._name)):
RuntimeError: main thread is not in main loop
Tcl_AsyncDelete: async handler deleted by the wrong thread
Tcl_AsyncDelete: async handler deleted by the wrong thread
Traceback (most recent call last):
  File "/home/fabian/git/repo/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1132, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/usr/lib/python3.10/multiprocessing/queues.py", line 113, in get
    if not self._poll(timeout):
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 424, in _poll
    r = wait([self], timeout)
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 931, in wait
    ready = selector.select(timeout)
  File "/usr/lib/python3.10/selectors.py", line 416, in select
    fd_event_list = self._selector.poll(timeout)
  File "/home/me/git/repo/venv/lib/python3.10/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 67648) is killed by signal: Aborted. 

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/me/git/repo/_debug_matplotlib_crash.py", line 40, in <module>
    for batch in data_loader:
  File "/home/me/git/repo/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/home/me/git/repo/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1328, in _next_data
    idx, data = self._get_data()
  File "/home/me/git/repo/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1294, in _get_data
    success, data = self._try_get_data()
  File "/home/me/git/repo/venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1145, in _try_get_data
    raise RuntimeError(f'DataLoader worker (pid(s) {pids_str}) exited unexpectedly') from e
RuntimeError: DataLoader worker (pid(s) 67648) exited unexpectedly

Note that the reported pid / thread id for the plotting code is obviously just the main pid / thread id.

Expected outcome

It should not crash.

Additional information

Has this worked in earlier versions?

I'm not sure, but it may have worked, because I'm using this pattern for a long time, and it only started crashing recently. However due to the non-deterministic nature of the crash it is hard to verify. It may just be that my previous usages of the pattern where not plotting "aggressively" enough.

Operating system

Ubuntu 22.04

Matplotlib Version

3.8.2

Matplotlib Backend

TkAgg

Python version

Python 3.10.12

Jupyter version

none

Installation

pip

tacaswell commented 8 months ago

Although you are only using Matplotlib from the main thread, some of the objects that Matplotlib creates are getting garbage collected on the background threads (note all of the tk related code is in __del__). The weakness of this explanation as to the problem is that it implies any mixing of Tk with threads runs the risk of this issue which seems a very broad swath of buggy programs.

A possible test to of this would be to add

def really_plot(*args, **kwargs):
    import gc
    plot_something(*args, **kwargs)
    gc.collect()

and then call that function in your loop instead.

bluenote10 commented 8 months ago

A possible test to of this would be to add

Indeed, with an explicit garbage collection this pattern seems to work fine, pretty cool, thanks!

Feel free to close the issue if there isn't anything actionable on the matplotlib side. Explicit garbage collection sounds like a reasonable work-around. For instance, I fancy just writing a small @with_garbage_collection decorator that I could wrap around my plotting functions.

A few random ideas what could be done on the matplotlib side (wild guesses):

tacaswell commented 8 months ago

Calling gc.collect() can be very expensive when there are a lot of user-space long lived collection like objects so we can not really call gc.collect(). To ammeliorate that we called gc.collect(1), however that lead to some pathological cases where cyclic gc never ran and we had effectively unbounded memory leakage (https://github.com/matplotlib/matplotlib/issues/23701#issuecomment-1223390883 and https://github.com/matplotlib/matplotlib/pull/23712).

We do have a destroy method on the FigureManager called from https://github.com/matplotlib/matplotlib/blob/ff0497cc042332313afd3d2e9ba8f949c3857896/lib/matplotlib/_pylab_helpers.py#L66 and https://github.com/matplotlib/matplotlib/blob/ff0497cc042332313afd3d2e9ba8f949c3857896/lib/matplotlib/backends/_backend_tk.py#L560-L584 is the implementation for tk. There is a bunch of complexity there to delay tearing things down to avoid other tk issues (but the way it is delayed should still put it on the main thread because it is run via a callback on the event loop).

However, I suspect that there is more we can do in there to manually break any reference cycles. Probably something like https://github.com/matplotlib/mpl-gui/blob/aec794a5f5f639208c6504d05a32bffe433b42a9/mpl_gui/__init__.py#L321-L326 to break the cycle between the canvas and the Figure (unfortunately we have had a hard cycle there ~forever and I am quite sure that making either one a weakref will break some users).

litchipi commented 1 month ago

If I may bring 2 cents here, I was using tkinter for some totally unrelated project to yours, and got the same behaviour (same error, same fix with gc.collect)

The problem for me was that I was calling the quit() function before destroy() function, reversing the order worked like a charm.

Hope that'll help

tacaswell commented 1 month ago

attn @richardsheridan Do we need to audit how we stop the main loop again?

richardsheridan commented 1 month ago

This is really a tkinter problem, releasing resources that need an event loop in the __del__ method. If you've got cycles, and we do, sometimes that can cause the items to be GC'd from a thread. So, if you're mixing tkinter and threads, you can't be stopping and starting the event loop. There is a small amount of wiggle room, which is probably why this is so nondeterministic:

https://github.com/python/cpython/blob/833eb106f5ebbac258f236d50177712d98a36715/Modules/_tkinter.c#L375-L393

If you hit one of these calls in a thread, you have one second to re-activate the event loop or it will error out. I've advocated for removing this grace period as being more harmful than helpful.

I do an always-on GUI and event loop, and have tasks like computation and data loading running in background threads or processes which call back to the event loop, so I never see things like this.

The only way matplotlib could help is by guaranteeing that its tkinter object structure is entirely cleaned up before plt.show() returns in the default case. plt.pause() is right out. Even then I'm sure there will be edge cases.