Open njsmith opened 7 years ago
Sometimes, CPython's signal handler (signalmodule.c:signal_handler) runs in a different thread from the main thread. On Unix this is rare but it does happen (esp. for SIGCHLD), and on Windows it happens 100% of the time. It turns out that there is a subtle race condition that means such signals can be lost.
To notify the main thread that a signal has been received, and which signal it is, CPython's signal handler sets two global variables:
Handlers[sig_num].tripped = 1; is_tripped = 1
And then PyErr_CheckSignals does:
if (is_tripped) {
is_tripped = 0;
for (i = 1; i < NSIG; i++) {
if (Handlers[i].is_tripped) {
... run signal handler ...
}
}
}
As some comments in the source note, this logic depends strongly on the assumption that when the signal handler sets the two flags, they are immediately visible to PyErr_CheckSignals, and that the two assignments happen in the given order. For example, suppose that is_tripped were set before Handlers[i].is_tripped. Then we could have the following sequence:
In this case the signal is lost until another signal arrives, which may not happen until an arbitrary amount of time has passed.
Also, the fix for bpo-30038 (making sure that we set the flags before writing to the wakeup fd) assumes that setting the flags before writing to the wakeup fd means that, well, the flags will be written before the fd.
However, when you have code running in separate threads, none of this is actually guaranteed. In general, the compiler is free to re-order writes, and more perniciously, the hardware memory hierarchy is also free to arbitrarily delay when writes to RAM made by one CPU become visible to another CPU, which can also create arbitrary reorderings.
In an attempt to avoid these issues, signalmodule.c declares the signal flags as 'volatile sig_atomic_t'. However, the best one can hope for from this is that it'll stop the *compiler from reordering writes. It does nothing to stop the *hardware from reordering writes. Using volatile like this might be sufficient if the signal handler runs in the main thread, but not if it runs in another thread.
The correct way to handle this is to use the primitives in pyatomic.h to get/set the tripped flags.
----
I noticed this because of the same test case that was failing due to bpo-30038... this thing has been frustrating me for months because it's intermittent and never seems to happen on my local VM. But specifically what I'm seeing at this point is:
Thread A (the thread where the signal handler runs): A1. takes a lock. A2. with the lock held, calls the C function raise(SIGINT), which directly calls the CPython C-level signal handler. (Verified by consulting the Windows C runtime source.) A3. writes to the wakeup fd & sets both tripped flags, in whatever order A4. releases the lock
Simultaneously, in thread B (the main thread): B1. observes that the wakeup fd has been written to B2. takes the lock B3. releases the lock B4. calls repr(), which unconditionally calls PyObject_Repr, which unconditionally calls PyErr_CheckSignals
I expect the call in B4 to run the Python-level signal handler, but this does not happen. Instead it runs some time later, causing an unexpected KeyboardInterrupt.
My reasoning is: the lock guarantees that no matter what the internal details of the C-level signal handler actually are, they have to all be complete before my code checks for signals. In excruciating detail:
Yet this does not happen, so *something* really weird is going on. The hypothesis that it's the lack of memory barriers both explains how this could happen -- specifically I think thread B may be running PyErr_CheckSignals as part of the wakeup-fd reading logic, and it's clearing is_tripped before Handlers[i].tripped becomes visible, like in my code above. And it also explains why it happens in like 1/50 runs in Appveyor, but never on my local Windows 10 VM that only has 1 virtual CPU.
But regardless of whether this is the cause of my particular weird bug, it clearly *is* a bug and should be fixed.
On further investigation (= a few hours staring at the ceiling last night), it looks like there's another explanation for my particular bug... which is good, because on further investigation (= a few hours squinting at google results) it looks like this probably can't happen on x86/x86-64 (I think? maybe?).
It's still true though that you can't just throw in a 'volatile' and expect cross-thread synchronization to work -- that's not C's semantics, and it's not true on popular architectures like ARM.
For reference, no, it can't happen on x86 or x86-64. I've found that this simplified model actually works for reasoning about ordering at the hardware level: think about each core as a cache-less machine that always *reads* from the central RAM, but that has a delay for writes---i.e. writes issued by a core are queued internally, and actually sent to the central RAM at some unspecified later time, but in order.
(Of course that model fails on other machines like ARM.)
@arigo: Technically we also need that the writes to memory are observed to happen-before the write() call on the wakeup fd, which is not something that Intel is going to make any guarantees about. But *probably* this is also safe because the kernel has to use some kind of cross-CPU synchronization to wake up the read()ing thread, and that imposes a memory barrier.
I've recently merged https://github.com/python/cpython/pull/2417 which might address this. If you know of a way of reproducing the (theoretical) issue outlined here, it would be nice to know if it's fixed on git master.
Since it's described as a bug, should the change be backported to 3.6? I don't think that 2.7 has required tools to backport this change (C99, atomic types).
Backporting would be reasonable IMHO.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at = None created_at =
labels = ['type-bug', '3.7']
title = 'Signal tripped flags need memory barriers'
updated_at =
user = 'https://github.com/njsmith'
```
bugs.python.org fields:
```python
activity =
actor = 'pitrou'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = []
creation =
creator = 'njs'
dependencies = []
files = []
hgrepos = []
issue_num = 31119
keywords = []
message_count = 7.0
messages = ['299736', '299756', '299760', '299761', '299795', '299796', '299802']
nosy_count = 4.0
nosy_names = ['arigo', 'pitrou', 'vstinner', 'njs']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue31119'
versions = ['Python 3.7']
```