logging regression with threading + fork are mixed in 3.7.1rc2 (deadlock potential)

gpshead commented 5 years ago

BPO	36533
Nosy	@rhettinger, @gpshead, @ned-deily, @ambv, @cagney, @kevans91, @koubaa
PRs	python/cpython#12704 python/cpython#13170 python/cpython#21986 python/cpython#22205 python/cpython#22651

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = 'https://github.com/gpshead' closed_at = created_at = labels = ['3.7', '3.8', 'type-bug', 'library', 'release-blocker'] title = 'logging regression with threading + fork are mixed in 3.7.1rc2 (deadlock potential)' updated_at = user = 'https://github.com/gpshead' ``` bugs.python.org fields: ```python activity = actor = 'kevans' assignee = 'gregory.p.smith' closed = True closed_date = closer = 'gregory.p.smith' components = ['Library (Lib)'] creation = creator = 'gregory.p.smith' dependencies = [] files = [] hgrepos = [] issue_num = 36533 keywords = ['patch', '3.7regression'] message_count = 19.0 messages = ['339472', '339474', '339475', '339480', '339481', '339503', '339505', '339529', '339650', '339688', '339760', '339825', '339863', '339864', '339882', '341543', '341768', '341815', '341816'] nosy_count = 9.0 nosy_names = ['rhettinger', 'gregory.p.smith', 'ned.deily', 'lukasz.langa', 'cagney', 'kevans', 'hugh', 'ebizzinfotech', 'koubaa'] pr_nums = ['12704', '13170', '21986', '22205', '22651'] priority = 'release blocker' resolution = 'fixed' stage = 'commit review' status = 'closed' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue36533' versions = ['Python 3.7', 'Python 3.8'] ```

gpshead commented 5 years ago

I'm spawning a dicussion buried in the way too long thread of https://bugs.python.org/issue6721 over here into its own specific issue to treat as a 3.7 release blocker for a rollback or repair decision before 3.7.4.

https://github.com/python/cpython/commit/3b699932e5ac3e76031bbb6d700fbea07492641d

I believe that was released in 3.7.1 is leading to a behavior regression for an application (the Fedora installer's libreswan kvmrunner?). Full details can be found in the messages of the other issue starting with: https://bugs.python.org/issue6721#msg329474

TL;DR - logging.Handler instances each have their own threading.Rlock. libreswan implements at least one logging.Handler subclass. That subclass's custom emit() implementation directly calls potentially many other sub-handlers emit() methods. Some of those emit() methods (such as logging.StreamHandler) call flush() which acquires the handler's lock.

So they've got a dependency between these two locks, the first's must be acquired before the second.

But the logging module APIs have no concept of sub-handlers and lock ordering.

I see many flaws with the libreswan code's design (I'm already ignoring the futility of threading + fork) but this still caused a behavior regression in the stable 3.7 release.

(more comments coming as followups to avoid a wall of text with too many topics)

gpshead commented 5 years ago

That custom DebugHandler's emit() implementation that calls into one or more sub-handlers suggests that libreswan _might_ be able to fix it in the custom DebugHandler by implementing custom acquire() and release() methods... BUT that is a fundamentally flawed problem: It requires acquiring multiple locks in a single atomic operation OR guaranteeing that they will always be acquired in a single specific order.

Given the logging module's fork-time code maintains no order, even implementing a custom acquire() and release() method in your custom DebugHandler would not be sufficient as sub-handler locks could be acquired first during fork. also, that would be assuming your sub-handlers are not also used directly elsewhere within all possible coexisting logger configurations.

An implementable _gross workaround_ that libreswan or anything else in this boat could implement satisfying that constraint would be to force all potentially related handlers to share the same single RLock instance. :(

gpshead commented 5 years ago

Within the logging module we could go beyond using a WeakSet and maintain an ordering. But we'd need to allow a way for Handler subclasses to indicate what order matters; that'd require a new API (not suitable for 3.7)

An ordering itself may be insufficient as code can implement arbitrary Handler code with cyclic graphs of handlers calling directly into other handlers such that no possible order could ever satisfy them all.

This pulls us back into "stop having multiple locks" for any code that creates relationships between handlers. Which is good advice for Handler authors and something we should document if it isn't already.

Side note: We should also document that Handler.emit() as an API MUST be called with the Handler's lock acquired via Handler.acquire() first. That is the undocumented contracted way logging.Logger.handle()'s code always calls into handler emit().... The libreswan code calling other handler's emit() methods without doing that smells wrong. But that has no relation to their current problem: Even if it did, it'd still deadlock, just sooner rather than within the other handler's emit()->flush()->acquire() call chain.

Logging library users are better off never directly calling another handler. Put message onto queues with a dedicated thread processing those into the relevant handlers. Only block waiting for those queued items to have been processed _after_ releasing your own lock if you want to maintain the synchronicity of logging API calls returning only after the message has been handled.

gpshead commented 5 years ago

Now to back up: Why was this application using fork() in a threaded application at all? That is a fundamental flaw. Should we be doing work to support things that so far merely _happen_ to work on such broken designs?

Another alternative for the DebugHandler implementation in whatever code implemented it is for it to de-register itself from the logging library's private WeakSet of handler locks to acquire at fork time from its own constructor. We don't have a public API for this so the workaround doing that on 3.7.1 - 3.7.3 would look like:

   def __init__(self, ...):
       super().__init__(...)
       if hasattr(logging, '_at_fork_acquire_release_weakset'):
           logging._at_fork_acquire_release_weakset.discard(self)

This means it'd still have the bug the code already had on all prior to versions of Python before this logging library "acquire the locks before fork" fix went in: Namely if the forked child tries to log it could deadlock due to forking while the inherited logging handler lock held.

gpshead commented 5 years ago

A stdlib alternative to this whole mess would be to avoid acquiring the logging locks before fork() as we currently do and just blindly re-initialize all of them afterwards under the assumption that they "can't" be protecting anything in a newly forked child process. Handlers that need specific resource synchronization around fork would then be required to deal with their own os.register_at_fork() calls. (ex: to avoid multiple processes writing to the same file or fd at once)

5ce72146-384d-40c0-9c1f-3ebc748d5463 commented 5 years ago

On Fri, 5 Apr 2019 at 04:15, Gregory P. Smith \report@bugs.python.org\ wrote:

New submission from Gregory P. Smith \greg@krypto.org\:

I'm spawning a dicussion buried in the way too long thread of https://bugs.python.org/issue6721 over here into its own specific issue to treat as a 3.7 release blocker for a rollback or repair decision before 3.7.4.

https://github.com/python/cpython/commit/3b699932e5ac3e76031bbb6d700fbea07492641d

I believe that was released in 3.7.1 is leading to a behavior regression for an application (the Fedora installer's libreswan kvmrunner?). Full details can be found in the messages of the other issue starting with: https://bugs.python.org/issue6721#msg329474

Two separate applications have tripped up on this:

Fedora's installer aka anaconda

I don't know any further details.

libreswan's test framework aka kvmrunner

kvmrunner uses pexpect pexpect uses ptyprocess ptyprocess uses pty.fork() + os.exec*(); but, hey, who knew! I didn't until this week. this seems to be a long standing concern: https://github.com/pexpect/ptyprocess/issues/43

TL;DR - logging.Handler instances each have their own threading.Rlock. libreswan implements at least one logging.Handler subclass. That subclass's custom emit() implementation directly calls potentially many other sub-handlers emit() methods. Some of those emit() methods (such as logging.StreamHandler) call flush() which acquires the handler's lock.

# Implement log-level inversion. # # Ref: https://docs.python.org/2/howto/logging.html#logging-flow # # By default, a parent (root) logger, regardless of its log-level, # will log all the records logged by a child. For instance, if a # child logger is logging at DEBUG-level, then the parent will log # (display on the console) those DEBUG-level records even when it has # been configured to log only INFO-level records. This is because the # default log-level ("Logger enabled for level of call?") check is # only applied once at record-creation. # # This code allows DEBUG-level logging to a file, while simultaneously # (the inversion) restricting console log records to just INFO-level # say.

The logging.Handler's lock ensures that the two sub-streams are kept in ordered. It prevents threads interleaving their writes.

So they've got a dependency between these two locks, the first's must be acquired before the second.

But the logging module APIs have no concept of sub-handlers and lock ordering.

I see many flaws with the libreswan code's design (I'm already ignoring the futility of threading + fork) but this still caused a behavior regression in the stable 3.7 release.

Always first understand the problem.

(more comments coming as followups to avoid a wall of text with too many topics)

---------- assignee: gregory.p.smith components: Library (Lib) keywords: 3.7regression messages: 339472 nosy: cagney, gregory.p.smith, ned.deily, vstinner priority: release blocker severity: normal status: open title: logging regression with threading + fork are mixed in 3.7.1rc2 (deadlock potential) type: behavior versions: Python 3.7

Python tracker \report@bugs.python.org\ \https://bugs.python.org/issue36533\

5ce72146-384d-40c0-9c1f-3ebc748d5463 commented 5 years ago

On Fri, 5 Apr 2019 at 05:02, Gregory P. Smith \report@bugs.python.org\ wrote:

Gregory P. Smith \greg@krypto.org\ added the comment:

A stdlib alternative to this whole mess would be to avoid acquiring the logging locks before fork() as we currently do and just blindly re-initialize all of them afterwards under the assumption that they "can't" be protecting anything in a newly forked child process. Handlers that need specific resource synchronization around fork would then be required to deal with their own os.register_at_fork() calls. (ex: to avoid multiple processes writing to the same file or fd at once)

FYI, below is a simpler, but related, test. _at_fork_acquire_release_weakset doesn't seem to be locked:

Ignoring exception from logging atfork \<Handler (NOTSET)> release method: cannot release un-acquired lock

Exception ignored in: <function _after_at_fork_weak_calls at 0x7f7307550378>
Traceback (most recent call last):
  File "/home/python/v3.7.3/lib/python3.7/logging/__init__.py", line
269, in _after_at_fork_weak_calls
    _at_fork_weak_calls('release')
  File "/home/python/v3.7.3/lib/python3.7/logging/__init__.py", line
254, in _at_fork_weak_calls
    for instance in _at_fork_acquire_release_weakset:
  File "/home/python/v3.7.3/lib/python3.7/_weakrefset.py", line 60, in __iter__
    for itemref in self.data:
RuntimeError: Set changed size during iteration

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/home/python/v3.7.3/lib/python3.7/threading.py", line 917, in
_bootstrap_inner
    self.run()
  File "/home/python/v3.7.3/lib/python3.7/threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "./btc.py", line 11, in lockie
    h = logging.Handler()
  File "/home/python/v3.7.3/lib/python3.7/logging/__init__.py", line
824, in __init__
    self.createLock()
  File "/home/python/v3.7.3/lib/python3.7/logging/__init__.py", line
847, in createLock
    _register_at_fork_acquire_release(self)
  File "/home/python/v3.7.3/lib/python3.7/logging/__init__.py", line
250, in _register_at_fork_acquire_release
    _at_fork_acquire_release_weakset.add(instance)
  File "/home/python/v3.7.3/lib/python3.7/_weakrefset.py", line 83, in add
    self._commit_removals()
  File "/home/python/v3.7.3/lib/python3.7/_weakrefset.py", line 56, in
_commit_removals
    discard(l.pop())
IndexError: pop from empty list

import logging
import os
import sys
import threading

def lockie():
    while True:
        # this adds handle to _at_fork_acquire_release_weakset
        #sys.stdout.write(".")
        #sys.stdout.flush()
        h = logging.Handler()

threading.Thread(target=lockie).start()

for n in range(0,100000):
    if n % 100 == 0:
        sys.stdout.write("%d" % n)
        sys.stdout.flush()
    pid = os.fork()
    if pid != 0:
        os.wait()
    else:
        os._exit(0)

gpshead commented 5 years ago

Here's a PR that goes the re-initialization route rather than attempting to acquire/release anything other than the single module _lock.

5ce72146-384d-40c0-9c1f-3ebc748d5463 commented 5 years ago

I think the only pragmatic solution here is to add an optional parameter to logging.basicConfig() that specifies that the logger should use a single global lock; and then start documenting that thread locks and fork() don't work well together.

And note that this solution is pragmatic, not correct (@dhr, when we discussed this off line, pointed out that since python's using threads then a "correct" solution would be to use some sort of inter-process lock).

For instance, if I were to implement emit() as something contrived like:

    with lock:
        s = record.to-string()
        for i in 1 .. s'length:
            b[i] = s[i]
        for i in 1 .. b'length:
            stdout.write(b[i]).flush()
        b = []

then when fork() breaks 'lock' the guarantee that the code is atomic is also broken:

when the child enters the code 'b' is undefined
the guarantee that log records don't interleave is lost

while a global lock would help mitigate the first case it really isn't a "correct" fix.

gpshead commented 5 years ago

The logging library has never guaranteed that it wouldn't interleave or duplicate buffered io output when fork() in involved. We should not start trying to claim that it does. Too complicated and fragile. It can't. Users who want that should implement their own single logging handler that is well behaved in whatever regards they need. there are a ton of options for them.

A basicConfig setting to switch the mode of operation to a single lock instead of per-handler locks could only happen in 3.8 (new feature). I'll ponder if it is even feasible to add that after the re-init change goes in. Even with such a feature, library user authored Handler subclasses would always be free to violate that. We can't protect people from themselves.

5ce72146-384d-40c0-9c1f-3ebc748d5463 commented 5 years ago

I pointed out two issues with breaking the locks.

Your response addressed the less important issue:

the guarantee that log records don't interleave is lost

Since the documentation seems to be silent the guarantee (or expectation) is implied - logging systems emit records atomically - if this isn't true across fork() then the exception should be documented.

but it didn't address the more fundamental problem:

when the child enters the code 'b' is undefined

i.e., breaking the lock will allow the child to access data in an undefined state. This will result in either core dumps or hangs (presumably the motivation for the original change was to prevent this?).

For instance, if the fork() happens while a thread is holding the \<stdio.h> FILE lock on a log file, then the child trying to access that FILE will hang.

gpshead commented 5 years ago

""" Since the documentation seems to be silent the guarantee (or expectation) is implied - logging systems emit records atomically - if this isn't true across fork() then the exception should be documented. """

We can add a note to the docs. As a general rule: documentation being silent means there are zero guarantees. Behavior of the existing implementations over time is the only source of truth.

""" but it didn't address the more fundamental problem:

when the child enters the code 'b' is undefined

i.e., breaking the lock will allow the child to access data in an undefined state. This will result in either core dumps or hangs (presumably the motivation for the original change was to prevent this?). """

nope. the contrived emit() pseudocode from msg339650 never defined b in the first place. that code, if massaged into python syntax would raise an UnboundLocalError no matter who executed it. a fork() from another thread would not change that. There is no concept of undefined/uninitialized state from the Python perspective.

The motivation for the original fix was entirely deadlock avoidance in code that uses fork. It failed and introduced an alternate form of deadlock in code that had been lucky up until that point. Thus my new PR proposing to fix the regression by making that not fail yet still preventing locks from starting out held in the child process.

5ce72146-384d-40c0-9c1f-3ebc748d5463 commented 5 years ago

nope. the contrived emit() pseudocode from msg339650 never defined b in the first place. that code, if massaged into python syntax would raise an UnboundLocalError no matter who executed it. a fork() from another thread would not change that. There is no concept of undefined/uninitialized state from the Python perspective.

Wow! It is pseudo code, right? (I figured using Ada like attributes was a bit of a giveaway). Nitpicking it achieves nothing (if you feel that your language needs 'b' pre-defined then, hey!, pre-define it; I'm pretty sure that won't fix the problem of operations such b[i] = 2 not being atomic yet alone thread-safe in Python).

Lets instead focus on my point that breaking the locks won't fix the problem. For instance, above. Or "For instance, if the fork() happens while a thread is holding the \<stdio.h> FILE lock on a log file, then the child trying to access that FILE will hang."

5ce72146-384d-40c0-9c1f-3ebc748d5463 commented 5 years ago

BTW, non-visible change might be to use a global readers-writer lock where fork() is the writer.

https://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock

gpshead commented 5 years ago

The problem i am fixing is merely making Python's logging library not the source of this deadlock because it was not the source in the past. I am solving a regression in CPython behavior between 3.7.0 and 3.7.1 that led to a logging.Handler lock related deadlock in two identified problematic applications.

Breaking the logging.Handler owned locks does solve that unless someone has concrete proof from these specific applications that it does otherwise.

It would be helpful if someone *willing to actually be constructive* would test them with my PR's patch applied to their 3.7.3 interpreter as confirmation.

ambv commented 5 years ago

This deadlock is a release blocker for 3.8.0. FTR, I don't consider it blocking alphas and betas.

gpshead commented 5 years ago

New changeset 64aa6d2000665efb1a2eccae176df9520bf5f5e6 by Gregory P. Smith in branch 'master': bpo-36533: Reinit logging.Handler locks on fork(). (GH-12704) https://github.com/python/cpython/commit/64aa6d2000665efb1a2eccae176df9520bf5f5e6

gpshead commented 5 years ago

New changeset 3b4b28efbde63502709bede7c5f9403ec6f37428 by Gregory P. Smith in branch '3.7': [3.7] bpo-36533: Reinit logging.Handler locks on fork(). (GH-12704) (GH-13170) https://github.com/python/cpython/commit/3b4b28efbde63502709bede7c5f9403ec6f37428

gpshead commented 5 years ago

The regression should be fixed. It'd be helpful if owners of applications that were running into this could test their applications with this specific change.

python / cpython

logging regression with threading + fork are mixed in 3.7.1rc2 (deadlock potential) #80714