Open 1ad76835-25ac-4c6a-ae44-e68e5f87e208 opened 3 years ago
I've recently encountered a weird bug. To make a long story short, I’ve been using an external package to make sure my threads stop working in time, and appearently it’s been using ctypes.pythonapi.PyThreadState_SetAsyncExc. There is a slight chance, that this code that is literally injecting an exception in the thread, will throw an exception in the handle function of the logger, right after the acquire. This will make the whole process deadlocked (only the thread with the exception will be able to continue logging) once the other threads try to log.
While I totally understand that this is not a casual use case, I think that the logger should be able to handle situations such as this... The good news is that I’ve created a test (attached) that once you run it with pytest it’ll reproduce constantly (even though it’s a statistical bug), and the solution for this is relatively easy. Once we change the lock to use context manager the whole action is much more atomic, making the test pass constantly.
I’d be happy to help solve it, and replace locks to context manager usage throughout the file (or whatever the maintainers see fit for this).
A few thoughts:
We normally say "all bets are off" when it comes to manipulating Python internals with ctypes. This issue is just one of many ways to wreak havoc.
The programming pattern of killing running threads at arbitrary points in their execution is inherently fragile. I don't think other modules are obliged to try to support it.
The pattern of "acquire(); try: something(); finally release();" is common. Logging is only of many modules in the ecosystem that use this pattern.
ISTM we can't assume that the lock is a context manager because createLock() is a public method and users can do anything with it.
That said, we could add our own __enter and __exit methods to the Handler class.
Lastly, I have a vague memory that at one point there was a rejected proposal to switch logging to use the with-statement throughout. We should research that to see if the reasoning is still applicable or not.
Feel free to submit a PR to use context managers, though I'm not sure it solves the problem in a guaranteed way.
Nick, what is the official word wrt. interruptibility of context managers?
But also, please ask the "external package" that uses PyThreadState_SetAsyncExc to clear up its documentation and mentions the pitfalls.
Ha! Already gave them my analysis, and stopped using that package since. While I certainly think this is not the trivial, basic usage, and that the pattern used for the lock is common (and part of some PEP if I remember correctly), I'm kind of puzzled on that fact that context manager solves it. The test will pass constantly once the change is made (even though it looks like RLock __enter__ is just calling the acquire function), and I overall think this is a good change to do in the Handler class, even it's just for self.lock (can we find and understand why the last proposal was rejected?).
I'll submit a PR in a few days, but was wondering on the usage of the lock, and that it doesn't always exists on self
. Anyway, I assume these are questions that we can continue discussing in the PR if you also think this is a good direction..
I'm kind of puzzled on that fact that context manager solves it.
It "solves" the problem because of the happenstance rather than because of language guarantees.
The current implementation of the with-statement in CPython is to generate a SETUPWITH opcode that calls both \_enter() and PyFrameBlockSetup() within a single opcode while the GIL is held. And because RLlock.\_enter() is implemented in C there are no pure python steps before the block frame setup.
If we added a pure Python __enter and __exit to the Handler class (necessary because a user can override createLock), then pure python steps would occur before the frame block setup and the problem would reappear.
If RLock were implemented in pure python, the problem would also reappear.
If the opcodes were changed so that the call to __enter__() were in a different opcode than setting up the block frame, the problem would reappear.
I'll submit a PR in a few days
Let's wait to hear from Vinay and Nick before deciding to work on a PR. Since you are no longer using the package, you no longer have the problem to solve. The programming practice that gave rise to the problem is inherently fragile. The proposed mitigation only helps one module and doesn't solve the problem in general. As described above, the proposed mitigation is fragile and implementation specific. Lastly, it overrides an old decision to not use context managers in the logging module for reasons I no longer remember.
Hey folks, any update on this issue? Or is there any recommended workaround to interrupt a running thread safely? Thanks.
Not from a logging perspective, for the reasons noted above.
Hi All,
is there any workaround or fix version, I met this also and do not know how to deal with this, for now I just remove log part but I'd like keep that so I can know what happend with my program.
Note that we now use context managers in the locking code, see #109462 so it might be worth trying the test code to see what the current status is with main
branch code.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = 'https://github.com/vsajip' closed_at = None created_at =
labels = ['type-bug', 'library', '3.10']
title = 'Deadlock in logging'
updated_at =
user = 'https://github.com/DaRoee'
```
bugs.python.org fields:
```python
activity =
actor = 'rhettinger'
assignee = 'vinay.sajip'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation =
creator = 'DaRoee'
dependencies = []
files = ['49983']
hgrepos = []
issue_num = 43939
keywords = []
message_count = 6.0
messages = ['391862', '391874', '391886', '391888', '392129', '392158']
nosy_count = 5.0
nosy_names = ['rhettinger', 'vinay.sajip', 'ncoghlan', 'pitrou', 'DaRoee']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue43939'
versions = ['Python 3.10']
```