python / cpython

The Python programming language
https://www.python.org
Other
63.16k stars 30.24k forks source link

Deadlock in logging #88105

Open 1ad76835-25ac-4c6a-ae44-e68e5f87e208 opened 3 years ago

1ad76835-25ac-4c6a-ae44-e68e5f87e208 commented 3 years ago
BPO 43939
Nosy @rhettinger, @vsajip, @ncoghlan, @pitrou, @DaRoee
Files
  • reproduction_test.py: reproduction test, should run with pytest for proper reproduction
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = 'https://github.com/vsajip' closed_at = None created_at = labels = ['type-bug', 'library', '3.10'] title = 'Deadlock in logging' updated_at = user = 'https://github.com/DaRoee' ``` bugs.python.org fields: ```python activity = actor = 'rhettinger' assignee = 'vinay.sajip' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'DaRoee' dependencies = [] files = ['49983'] hgrepos = [] issue_num = 43939 keywords = [] message_count = 6.0 messages = ['391862', '391874', '391886', '391888', '392129', '392158'] nosy_count = 5.0 nosy_names = ['rhettinger', 'vinay.sajip', 'ncoghlan', 'pitrou', 'DaRoee'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue43939' versions = ['Python 3.10'] ```

    1ad76835-25ac-4c6a-ae44-e68e5f87e208 commented 3 years ago

    I've recently encountered a weird bug. To make a long story short, I’ve been using an external package to make sure my threads stop working in time, and appearently it’s been using ctypes.pythonapi.PyThreadState_SetAsyncExc. There is a slight chance, that this code that is literally injecting an exception in the thread, will throw an exception in the handle function of the logger, right after the acquire. This will make the whole process deadlocked (only the thread with the exception will be able to continue logging) once the other threads try to log.

    While I totally understand that this is not a casual use case, I think that the logger should be able to handle situations such as this... The good news is that I’ve created a test (attached) that once you run it with pytest it’ll reproduce constantly (even though it’s a statistical bug), and the solution for this is relatively easy. Once we change the lock to use context manager the whole action is much more atomic, making the test pass constantly.

    I’d be happy to help solve it, and replace locks to context manager usage throughout the file (or whatever the maintainers see fit for this).

    rhettinger commented 3 years ago

    A few thoughts:

    pitrou commented 3 years ago

    Feel free to submit a PR to use context managers, though I'm not sure it solves the problem in a guaranteed way.

    Nick, what is the official word wrt. interruptibility of context managers?

    pitrou commented 3 years ago

    But also, please ask the "external package" that uses PyThreadState_SetAsyncExc to clear up its documentation and mentions the pitfalls.

    1ad76835-25ac-4c6a-ae44-e68e5f87e208 commented 3 years ago

    Ha! Already gave them my analysis, and stopped using that package since. While I certainly think this is not the trivial, basic usage, and that the pattern used for the lock is common (and part of some PEP if I remember correctly), I'm kind of puzzled on that fact that context manager solves it. The test will pass constantly once the change is made (even though it looks like RLock __enter__ is just calling the acquire function), and I overall think this is a good change to do in the Handler class, even it's just for self.lock (can we find and understand why the last proposal was rejected?).

    I'll submit a PR in a few days, but was wondering on the usage of the lock, and that it doesn't always exists on self. Anyway, I assume these are questions that we can continue discussing in the PR if you also think this is a good direction..

    rhettinger commented 3 years ago

    I'm kind of puzzled on that fact that context manager solves it.

    It "solves" the problem because of the happenstance rather than because of language guarantees.

    The current implementation of the with-statement in CPython is to generate a SETUPWITH opcode that calls both \_enter() and PyFrameBlockSetup() within a single opcode while the GIL is held. And because RLlock.\_enter() is implemented in C there are no pure python steps before the block frame setup.

    If we added a pure Python __enter and __exit to the Handler class (necessary because a user can override createLock), then pure python steps would occur before the frame block setup and the problem would reappear.

    If RLock were implemented in pure python, the problem would also reappear.

    If the opcodes were changed so that the call to __enter__() were in a different opcode than setting up the block frame, the problem would reappear.

    I'll submit a PR in a few days

    Let's wait to hear from Vinay and Nick before deciding to work on a PR. Since you are no longer using the package, you no longer have the problem to solve. The programming practice that gave rise to the problem is inherently fragile. The proposed mitigation only helps one module and doesn't solve the problem in general. As described above, the proposed mitigation is fragile and implementation specific. Lastly, it overrides an old decision to not use context managers in the logging module for reasons I no longer remember.

    youtian001 commented 2 years ago

    Hey folks, any update on this issue? Or is there any recommended workaround to interrupt a running thread safely? Thanks.

    vsajip commented 2 years ago

    Not from a logging perspective, for the reasons noted above.

    Peng-wei-Yu commented 11 months ago

    Hi All,

    is there any workaround or fix version, I met this also and do not know how to deal with this, for now I just remove log part but I'd like keep that so I can know what happend with my program.

    vsajip commented 11 months ago

    Note that we now use context managers in the locking code, see #109462 so it might be worth trying the test code to see what the current status is with main branch code.