Signal tripped flags need memory barriers

njsmith commented 7 years ago

BPO	31119
Nosy	@arigo, @pitrou, @vstinner, @njsmith

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['type-bug', '3.7'] title = 'Signal tripped flags need memory barriers' updated_at = user = 'https://github.com/njsmith' ``` bugs.python.org fields: ```python activity = actor = 'pitrou' assignee = 'none' closed = False closed_date = None closer = None components = [] creation = creator = 'njs' dependencies = [] files = [] hgrepos = [] issue_num = 31119 keywords = [] message_count = 7.0 messages = ['299736', '299756', '299760', '299761', '299795', '299796', '299802'] nosy_count = 4.0 nosy_names = ['arigo', 'pitrou', 'vstinner', 'njs'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue31119' versions = ['Python 3.7'] ```

njsmith commented 7 years ago

Sometimes, CPython's signal handler (signalmodule.c:signal_handler) runs in a different thread from the main thread. On Unix this is rare but it does happen (esp. for SIGCHLD), and on Windows it happens 100% of the time. It turns out that there is a subtle race condition that means such signals can be lost.

To notify the main thread that a signal has been received, and which signal it is, CPython's signal handler sets two global variables:

Handlers[sig_num].tripped = 1; is_tripped = 1

And then PyErr_CheckSignals does:

  if (is_tripped) {
      is_tripped = 0;
      for (i = 1; i < NSIG; i++) {
          if (Handlers[i].is_tripped) {
              ... run signal handler ...
          }
      }
  }

As some comments in the source note, this logic depends strongly on the assumption that when the signal handler sets the two flags, they are immediately visible to PyErr_CheckSignals, and that the two assignments happen in the given order. For example, suppose that is_tripped were set before Handlers[i].is_tripped. Then we could have the following sequence:

Thread A sets is_tripped = 1
Thread B runs PyErr_CheckSignals
Thread B notices is_tripped == 1, and sets it to 0
Thread B scans through all the Handlers, sees that none are set, and carries on
Thread A sets Handlers[i].is_tripped = 1

In this case the signal is lost until another signal arrives, which may not happen until an arbitrary amount of time has passed.

Also, the fix for bpo-30038 (making sure that we set the flags before writing to the wakeup fd) assumes that setting the flags before writing to the wakeup fd means that, well, the flags will be written before the fd.

However, when you have code running in separate threads, none of this is actually guaranteed. In general, the compiler is free to re-order writes, and more perniciously, the hardware memory hierarchy is also free to arbitrarily delay when writes to RAM made by one CPU become visible to another CPU, which can also create arbitrary reorderings.

In an attempt to avoid these issues, signalmodule.c declares the signal flags as 'volatile sig_atomic_t'. However, the best one can hope for from this is that it'll stop the *compiler from reordering writes. It does nothing to stop the *hardware from reordering writes. Using volatile like this might be sufficient if the signal handler runs in the main thread, but not if it runs in another thread.

The correct way to handle this is to use the primitives in pyatomic.h to get/set the tripped flags.

----

I noticed this because of the same test case that was failing due to bpo-30038... this thing has been frustrating me for months because it's intermittent and never seems to happen on my local VM. But specifically what I'm seeing at this point is:

Thread A (the thread where the signal handler runs): A1. takes a lock. A2. with the lock held, calls the C function raise(SIGINT), which directly calls the CPython C-level signal handler. (Verified by consulting the Windows C runtime source.) A3. writes to the wakeup fd & sets both tripped flags, in whatever order A4. releases the lock

Simultaneously, in thread B (the main thread): B1. observes that the wakeup fd has been written to B2. takes the lock B3. releases the lock B4. calls repr(), which unconditionally calls PyObject_Repr, which unconditionally calls PyErr_CheckSignals

I expect the call in B4 to run the Python-level signal handler, but this does not happen. Instead it runs some time later, causing an unexpected KeyboardInterrupt.

My reasoning is: the lock guarantees that no matter what the internal details of the C-level signal handler actually are, they have to all be complete before my code checks for signals. In excruciating detail:

the write to the wakeup fd must happen-after the lock is acquired (A1)
therefore, step B1 happens-after step A1
B2 happens-after B1, so B2 happens-after A1.
B2 takes the same lock as A1, so if B2 happens-after A1, it must also happen-after A4, when the lock is released.
step B4 happens-after B2, therefore it happens-after A4 and A3. Therefore the tripped flags should be visible and the Python-level signal handler should run.

Yet this does not happen, so *something* really weird is going on. The hypothesis that it's the lack of memory barriers both explains how this could happen -- specifically I think thread B may be running PyErr_CheckSignals as part of the wakeup-fd reading logic, and it's clearing is_tripped before Handlers[i].tripped becomes visible, like in my code above. And it also explains why it happens in like 1/50 runs in Appveyor, but never on my local Windows 10 VM that only has 1 virtual CPU.

But regardless of whether this is the cause of my particular weird bug, it clearly *is* a bug and should be fixed.

njsmith commented 7 years ago

On further investigation (= a few hours staring at the ceiling last night), it looks like there's another explanation for my particular bug... which is good, because on further investigation (= a few hours squinting at google results) it looks like this probably can't happen on x86/x86-64 (I think? maybe?).

It's still true though that you can't just throw in a 'volatile' and expect cross-thread synchronization to work -- that's not C's semantics, and it's not true on popular architectures like ARM.

7aa6e20b-8983-474f-b2ae-de7eff1caa04 commented 7 years ago

For reference, no, it can't happen on x86 or x86-64. I've found that this simplified model actually works for reasoning about ordering at the hardware level: think about each core as a cache-less machine that always *reads* from the central RAM, but that has a delay for writes---i.e. writes issued by a core are queued internally, and actually sent to the central RAM at some unspecified later time, but in order.

(Of course that model fails on other machines like ARM.)

njsmith commented 7 years ago

@arigo: Technically we also need that the writes to memory are observed to happen-before the write() call on the wakeup fd, which is not something that Intel is going to make any guarantees about. But *probably* this is also safe because the kernel has to use some kind of cross-CPU synchronization to wake up the read()ing thread, and that imposes a memory barrier.

pitrou commented 7 years ago

I've recently merged https://github.com/python/cpython/pull/2417 which might address this. If you know of a way of reproducing the (theoretical) issue outlined here, it would be nice to know if it's fixed on git master.

vstinner commented 7 years ago

Since it's described as a bug, should the change be backported to 3.6? I don't think that 2.7 has required tools to backport this change (C99, atomic types).

pitrou commented 7 years ago

Backporting would be reasonable IMHO.

python / cpython

Signal tripped flags need memory barriers #75302