python / cpython

The Python programming language
https://www.python.org
Other
62.41k stars 29.96k forks source link

signals not always delivered to main thread, since other threads have the signal unmasked #46267

Closed a91f06bd-743d-4cae-b831-7277996fa63e closed 12 years ago

a91f06bd-743d-4cae-b831-7277996fa63e commented 16 years ago
BPO 1975
Nosy @birkenfeld, @gpshead, @pitrou, @vstinner, @nh2
Files
  • pthread_sig.diff
  • thread_test.py: Test case for this issue
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = created_at = labels = ['interpreter-core', 'type-bug', 'docs'] title = 'signals not always delivered to main thread, since other threads have the signal unmasked' updated_at = user = 'https://bugs.python.org/bamby' ``` bugs.python.org fields: ```python activity = actor = 'neologix' assignee = 'none' closed = True closed_date = closer = 'neologix' components = ['Documentation', 'Interpreter Core'] creation = creator = 'bamby' dependencies = [] files = ['9333', '20047'] hgrepos = [] issue_num = 1975 keywords = ['patch'] message_count = 43.0 messages = ['61870', '61960', '62038', '62044', '79780', '79783', '79801', '82868', '82878', '82914', '82918', '83107', '91070', '92328', '96377', '96379', '96380', '96386', '96388', '96390', '96391', '96393', '96395', '96396', '96397', '96406', '96415', '96428', '96429', '96433', '96444', '96445', '96446', '96447', '116940', '121863', '121927', '122137', '124007', '149903', '149904', '149909', '152693'] nosy_count = 16.0 nosy_names = ['georg.brandl', 'gregory.p.smith', 'exarkun', 'Rhamphoryncus', 'pitrou', 'vstinner', 'movement', 'flub', 'ross', 'bamby', 'laca', 'duncf', 'mstepnicki', 'nh2', 'neologix', 'BreamoreBoy'] pr_nums = [] priority = 'normal' resolution = 'rejected' stage = 'resolved' status = 'closed' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue1975' versions = ['Python 3.3'] ```

    a91f06bd-743d-4cae-b831-7277996fa63e commented 16 years ago

    Hello,

    This issue is actually a follow up of bpo-960406. The patch applied there brought in a problem at least under FreeBSD. The problem is extremely annoying so I made an effort to uncover the source of it.

    Here is an example code that shows the problem:

        some_time = 6000000 # seconds
    
        class MyThread(Thread):
            def run(self):
                while (True):
                    time.sleep(some_time)
    
        t = MyThread()
        t.start()
        while(True):
            select.select(None, None, None, some_time)

    Start this script, then try to interrupt it by hitting Control-C. If you run this code under Linux you won't see any problem. But under FreeBSD the script won't stop until the timeout in main thread occurs or some activity takes place on descriptors passed to the select().

    My investigation showed that the source of the problem is in that how signals are processed. FreeBSD processes signals in opposite order than Linux. For example suppose we have a program that starts one user thread and allows both main and user threads to receive signals. Under Linux the signal handler always fires up in context of the main thread, but under FreeBSD the signal handler runs in context of the user thread. POSIX doesn't state which behavior is correct so both behaviors should be assumed to be correct and Python should be aware of them both. Before the patch from 960406 the Python made effort to deny signal handling in user threads but the patch dropped this code and all threads are allowed to handle signals.

    Let's return to the script. When running the script under Linux the select() call is the one that gets interrupted by the signal and this allows the script to shutdown quickly. But under FreeBSD the sleep() call is interrupted by the signal leaving the main thread to wait on select() until timeout.

    The description of bpo-960406 states:

    "This is a patch which will correct the issues some people have with python's handling of signal handling in threads. It allows any thread to initially catch the signal mark it as triggered, allowing the main thread to later process it."

    And yes it behaves exactly as described but this behavior is inconsistent between different OSes.

    To make things predictable I've restored the code that ensures that signal handler will run in context of main thread only:

    long PyThread_start_new_thread(void (func)(void *), void \arg) { ... + sigset_t set, oset; ... + sigfillset(&set); + SET_THREAD_SIGMASK(SIG_BLOCK, &set, &oset); pthread_create(...) + SET_THREAD_SIGMASK(SIG_SETMASK, &oset, NULL); ...

    and this works perfectly for me under FreeBSD and Linux at least for my needs. It doesn't bring any visible changes to readline behavior either. I'm using the 2.5.1 version of Python. In attach you can find this patch against the trunk.

    I'm not Python guru but let me try to display my vision of the situation.

    As I understand, my change does nothing for applications written in pure Python and running under Linux (without user modules written in C and using special thread and signal handling). Signals under Linux have absolutely no chance to be caught from within user threads as Python doesn't provide any way to alter the signal mask and with the default signal mask the signals always arrive to the main thread. So explicit prohibition to handle signals from within user thread doesn't change anything. On the other hand this change ensures that under FreeBSD things go exactly like under Linux.

    Of course this change can possibly break some C-written module that relies on signal handling in context of user thread (as the signal mask of the user thread must be modified explicitly now). But anyway this is how things are meant to work in order to be portable. So I'm considering this possibility as highly unlikely.

    I suppose the Mac OS X is affected also as it's based on FreeBSD.

    gvanrossum commented 16 years ago

    Actually I see the same behavior under Linux and OSX: the first ^C interrupts the select() call, after that ^C is ignored.

    a91f06bd-743d-4cae-b831-7277996fa63e commented 16 years ago

    I'm sorry I've forgotten to add one important thing to the script - the t.setDaemon(True) call, as without it the main thread will wait for the user thread to stop explicitly. So the correct script is:

        some_time = 6000000 # seconds
    
        class MyThread(Thread):
            def run(self):
                while (True):
                    time.sleep(some_time)
    
        t = MyThread()
        t.setDaemon(True)
        t.start()
        while(True):
            select.select(None, None, None, some_time)
    gvanrossum commented 16 years ago

    Well okay than I can confirm that OSX is *not* affected by this OS bugginess.

    3b1c5871-4fff-449a-b2cf-9991c772fecb commented 15 years ago

    This issue also affects Solaris (and in particular xend is broken). Is there a reason bamby's fix isn't yet applied?

    pitrou commented 15 years ago

    Is there any reason not to simply catch KeyboardInterrupt in the user thread, and then notify the main thread?

    3b1c5871-4fff-449a-b2cf-9991c772fecb commented 15 years ago

    Yes, Python guarantees the behaviour under discussion:

    http://docs.python.org/library/signal.html

    8d8a8db7-faf5-4c09-a2a3-2697dbaf0735 commented 15 years ago

    The readline API just sucks. It's not at all designed to be used simultaneously from multiple threads, so we shouldn't even try. Ban using it in non-main threads, restore the blocking of signals, and go on with our merry lives.

    gvanrossum commented 15 years ago

    Agreed. Multiple threads trying to read interactive input from a keyboard sounds like a bad idea anyway.

    3b1c5871-4fff-449a-b2cf-9991c772fecb commented 15 years ago

    Surely readline is irrelevant anyway. The Python spec guarantees behaviour, and that guarantee is currently broken.

    gvanrossum commented 15 years ago

    Hm, I'm not sure why Adam brought up readline. The behavior is certainly guaranteed (I put that guarantee in myself long ago :-) and it should be fixed. I have no opinion about the proposed patch, since I cannot test this and have long lost sufficient understanding of this part of CPython to understand all the ramifications, sorry.

    8d8a8db7-faf5-4c09-a2a3-2697dbaf0735 commented 15 years ago

    bpo-960406 broke this as part of a fix for readline. I believe that was motivated by fixing ctrl-C in the main thread, but non-main threads were thrown in as a "why not" measure.

    msg 46078 is the mention of this. You can go into readlingsigs7.patch and search for SET_THREAD_SIGMASK.

    3b1c5871-4fff-449a-b2cf-9991c772fecb commented 15 years ago

    Any progress on this regression? A patch is available... thanks.

    284fb00b-bb6d-44f0-9212-8a0900514d44 commented 15 years ago

    I have just got bitten by this bug - I usually run my software under Linux/Windows, this was the first time that my customer requested specifically FreeBSD platform and I was *really* surprised. Not to mention the fact that bug in Python came as the last thing to my mind - I was blaming my code of course :-).

    Anyway, I can confirm the patch works for me and I'd like to see it included in future versions. Can I do something to make it happen?

    Regards, Marcin

    pitrou commented 14 years ago

    I'm not sure there's any real issue here. The signal *does* get propagated to the main thread, it only takes some time to do so. If you want the program to be interruptible more quickly, just lower the timeout you give to select().

    8726d1eb-a365-45b6-b81d-c75988975e5a commented 14 years ago

    I don't like the suggestion to lower the timeout to select(). Lots of the rest of the world is pushing towards removing this kind of periodic polling (generally with the goal of improving power consumption). Python should be going this way too (the recent introduction of signal.set_wakeup_fd suggests that at least some Python developers are convinced of this).

    pitrou commented 14 years ago

    I don't like the suggestion to lower the timeout to select(). Lots of the rest of the world is pushing towards removing this kind of periodic polling (generally with the goal of improving power consumption).

    Yes, I'm aware of this. I was only suggesting the easiest solution to the problem at hand :-)

    8d8a8db7-faf5-4c09-a2a3-2697dbaf0735 commented 14 years ago

    The real, OS signal does not get propagated to the main thread. Only the python-level signal handler runs from the main thread.

    Correctly written programs are supposed to let select block indefinitely. This allows them to have exactly 0 CPU usage, especially important on laptops and other limited power devices.

    pitrou commented 14 years ago

    The real, OS signal does not get propagated to the main thread. Only the python-level signal handler runs from the main thread.

    Well, the signals /are/ delivered as far as Python code is concerned. I don't think Python makes any promise as to the delivery of signals at the C level. (actually, the best promise we may make is not to block signal delivery at all, so that third-party libs or extensions relying on threaded signal delivery don't break)

    8d8a8db7-faf5-4c09-a2a3-2697dbaf0735 commented 14 years ago

    You forget that the original report is about ctrl-C. Should we abandon support of it for threaded programs? Close as won't-fix?

    We could also just block SIGINT, but why? That means we don't support python signal handlers in threaded programs (signals sent to the process, not ones sent direct to threads), and IMO threads expecting a specific signal should explicitly unblock it anyway.

    pitrou commented 14 years ago

    You forget that the original report is about ctrl-C. Should we abandon support of it for threaded programs?

    We haven't abandoned support, have we? Where is the spec that is currently broken?

    Besides, as Jean-Paul pointed out, the user can now setup a file descriptor on which a byte will be written out as soon as a signal gets caught.

    Close as won't-fix?

    It is one possibility indeed. We could also add an API (or an optional argument to the existing APIs) to block signals in threads created by Python.

    3b1c5871-4fff-449a-b2cf-9991c772fecb commented 14 years ago

    The spec broken is here:

    http://docs.python.org/library/signal.html

    Namely:

    # Some care must be taken if both signals and threads are used in the same program. The fundamental thing to remember in using signals and threads simultaneously is: always perform signal() operations in the main thread of execution. Any thread can perform an alarm(), getsignal(), pause(), setitimer() or getitimer(); only the main thread can set a new signal handler, and the main thread will be the only one to receive signals (this is enforced by the Python signal module, even if the underlying thread implementation supports sending signals to individual threads). This means that signals can’t be used as a means of inter-thread communication. Use locks instead.

    pitrou commented 14 years ago

    The spec broken is here:

    http://docs.python.org/library/signal.html

    I would argue it is not broken. This documentation page is about a module of the standard library, it doesn't specify the underlying C implementation. That "the main thread will be the only one to receive signals" is true if you consider it from the Python code's point of view: signal handlers are always called in the main thread, even if the OS-level signal was delivered to (and caught by) another thread.

    I don't have any strong view over whether the interpreter should, theoretically, block signals in non-main threads. But, practically, blocking signals apparently produced issues with readline (and possibly other libs relying on signals), which is why they are not blocked today.

    284fb00b-bb6d-44f0-9212-8a0900514d44 commented 14 years ago

    I don't have any strong view over whether the interpreter should, theoretically, block signals in non-main threads. But, practically, blocking signals apparently produced issues with readline (and possibly other libs relying on signals), which is why they are not blocked today.

    I see your point of view, but the problem is that current behaviour is inconsistent between different operating system. As there are many people who brought up this issue, I think it should be at least documented somewhere.

    Regards, Marcin

    8726d1eb-a365-45b6-b81d-c75988975e5a commented 14 years ago

    > http://docs.python.org/library/signal.html

    I would argue it is not broken.

    If it's not broken, then the docs are at least confusing. They should make clear whether they are talking about the underlying signal or the Python signal handler. This makes a difference for many applications which deal with signals. I would even say that due to the very tricky nature of signals, the documentation *should* be discussing the way it is implemented. Without that information, it's very difficult to handle some situations correctly. This wouldn't necessarily mean that the implementation would have to stay the same, either - just that the implementation be documented for each version (of course, keeping it the same would be preferable, for all the normal reasons).

    pitrou commented 14 years ago

    As I said, a flexible solution would be for thread creation functions to take an optional argument specifying whether to block signals or not. (I don't mind the default value of the argument :-))

    8d8a8db7-faf5-4c09-a2a3-2697dbaf0735 commented 14 years ago

    A better solution would be to block all signals by default, then unblock specific ones you expect. This avoids races (as undeliverable signals are simply deferred.)

    Note that readline is not threadsafe anyway, so it doesn't necessarily need to allow calls from the non-main thread. Maybe somebody is using that way, dunno.

    a91f06bd-743d-4cae-b831-7277996fa63e commented 14 years ago

    Let me add my 2 cents. I understood the considerations about differences between Python code level interrupt handling and OS level interrupts.

    What I cannot get is why to preserve the handling of signals in the user threads on OSes like FreeBSD and Solaris. This functionality isn't used on Linux and Windows at all, as the interrupts on them are always delivered to the main thread. The patch simply assures the same behavior on the FreeBSD and Solaris, so why to keep things unpredictable when there is a way to solve the problem? Can anyone state what exactly purpose of not to make OS signal handling in Python predictable?

    This bug report was created mainly because there is no easy Python code solution for this problem. The Python documentation clearly states that there is no user accessible Python functions that can modify per-thread signal mask, so it is currently impossible to solve the problem with just Python code. Modification of timeouts isn't vital solution in far too many real life situations.

    BTW this patch is officially in the FreeBSD ports tree since Feb 27 2009 and there is no complains on this patch since then.

    pitrou commented 14 years ago

    Well, the history on this looks a bit complicated and I don't really know the details, but witness the first sentences of the initial message in bpo-960406:

    “This is a patch which will correct the issues some people have with python's handling of signal handling in threads. It allows any thread to initially catch the signal mark it as triggered, allowing the main thread to later process it. (This is actually just restoring access to the functionality that was in Python 2.1)”

    Apparently Python has been hesitating between both behaviours.

    The Python documentation clearly states that there is no user accessible Python functions that can modify per-thread signal mask, so it is currently impossible to solve the problem with just Python code.

    Well as I already said we could introduce this missing feature. Ideas and patches welcome.

    a91f06bd-743d-4cae-b831-7277996fa63e commented 14 years ago

    Well as I already said we could introduce this missing feature. Ideas and patches welcome.

    Well, this would be definitely a working solution.

    3b1c5871-4fff-449a-b2cf-9991c772fecb commented 14 years ago

    I still do not understand the objection you have to the simple patch which restores old behaviour, works the same across all OSes, and doesn't require new APIs. What is the objection?

    pitrou commented 14 years ago

    The objection is that the "old behaviour" was changed to solve another problem. We don't gain anything by switching back and forth between two different behaviours.

    3b1c5871-4fff-449a-b2cf-9991c772fecb commented 14 years ago

    To quote Andriy in the first comment:

    "It doesn't bring any visible changes to readline behavior either."

    Are you saying this is not the case?

    pitrou commented 14 years ago

    I'm just saying that I don't know, and I don't think an observation from one user is enough since these issues are notoriously platform-specific.

    If you want to revert the change made in bpo-960406, you should IMO demonstrate that somehow this change wasn't needed. But I don't know how this would be better than a more flexible API, except of course that the patch for that API doesn't exist (but wouldn't be difficult to produce by someone motivated).

    83d2e70e-e599-4a04-b820-3814bbdb9bef commented 14 years ago

    Is this still an issue? If yes can a *NIX type person action it. If no can we close it?

    c1cd71cc-34b3-40e2-a35b-9a6cbed4f6ea commented 13 years ago

    I think this is still an issue.

    If I register signal.signal(signal.SIGINT, handler) in the main thread, I get a race between

    Note that I cannot even declare a sigintlock = threading.Lock() and use with sigintlock in the handler - if I print threading.current_thread() and sys.exit() directly afterwards in the locked area, I sometimes get two print outputs which show MainThread and Thread-1 being inside the locked area at the same time!

    Is the synchronization broken or am I doing something wrong?

    Is synchronization over handlers really the way to cope with this? Disabling signals for non-main threads currently seems more sensible to me.

    Can someone please verify the race I get?

    pitrou commented 13 years ago

    nh2, your issue looks slightly different. In any case, can you tell us what your system is, and post a simple script to reproduce the issue?

    c1cd71cc-34b3-40e2-a35b-9a6cbed4f6ea commented 13 years ago

    My problem was actually related to the subprocess.Popen I use inside my threads in combination with signals. As Popen does not spawn a new thread, but a new process, it completely ignores my locks.

    However, it seems impossible to safely unregister signals for subprocesses. I tried using SIG_IGN in preexec_fn for Popen as well as a ignore/Popen/unignore wrapper, but both result in race conditions (for the first: if the signal arrives between fork and exec and for the second if it arrives within the wrapper).

    I think ignoring signals by default for everything but the main thread or process would solve this problem and make parallel programming with events in Python a lot easier.

    Explicit is better than implicit.

    80ecb1b9-0667-42ca-b060-2ce0190f0e4a commented 13 years ago

    This is definitely still an issue.

    With the "pthread_sig" patch attached to this bug, both on FreeBSD and on linux, any processes spawned from a thread seem to have their signals blocked, so they can not be killed.

    Without it, on FreeBSD, the behavior described by bamby is still a problem.

    I've attached a test case that adapts bamby's example code into a test, and shows the "unkillable subprocess" problem I described above.

    On FreeBSD without the patch, test_signal fails, and with the patch test_thr fails.

    On Linux and OS X, without the patch, all tests pass. With the patch, test_thr fails.

    I hope somebody can come up with a better fix.

    80ecb1b9-0667-42ca-b060-2ce0190f0e4a commented 12 years ago

    I've been digging into this quite a bit, and I've been able to dig up a little more info.

    I think I've found a suitable solution, that should resolve all of the issues:

    I will put together a patch, though I would like to see some consensus around this approach before I spend too much (more) time on this.

    Thanks.

    pitrou commented 12 years ago
    1. On FreeBSD, we must assume that every blocking system call, in *every thread*, can be interrupted, and we need to catch EINTR.

    2. On FreeBSD, we cannot block indefinitely in the main thread and expect to handle signals. This means that indefinite selects are not possible if we want to handle signals, and, perhaps more perversely, signal.pause() cannot be reliably used in the main thread.

    Well, I agree it makes matters more complicated, but if FreeBSD decides this behaviour is desireable, why would Python try to work around it? To solve the select() problem you can have the signal handler write on a pipe (using signal.set_wakeup_fd (*)) and the select() call wait on that pipe. This also should allow to emulate signal.pause() (basically a select() with only the signal pipe).

    IMHO, the general idea of Unix signals is a low-level kludge and any attempt to make it sane at the Python level is probably doomed to failure. Other synchronization methods should always be preferred, if possible.

    (*) Linux has signalfd, but we don't expose it yet

    79528080-9d85-4d18-8a2a-8b1f07640dd7 commented 12 years ago
    1. On FreeBSD, we must assume that every blocking system call, in *every thread*, can be interrupted, and we need to catch EINTR.

    That's true for every Unix. Every blocking syscall can return EINTR, and there are may non restartable syscalls. Writting code which depends on a specific OS behavior, or whether the code is run from the main thread, is broken.

    1. On FreeBSD, we cannot block indefinitely in the main thread and expect to handle signals. This means that indefinite selects are not possible if we want to handle signals, and, perhaps more perversely, signal.pause() cannot be reliably used in the main thread.

    The proper way to do so, in Python as in C, is to use pthread_sigmask (http://docs.python.org/dev/library/signal.html#signal.pthread_sigmask)

    • Block all *asynchronous* signals in user threads.

    What if some code expects to receive a signal from a thread other than the main thread?

    • Unblock all signals after a fork() in a thread, since the thread is now the main thread.

    What if a signal is delivered between fork() and unblock?

    Really, signals and threads don't mix well (kinda like fork() and threads), so I don't think we can expect to fix this in Python. There's no free lunch, every design will chose will break existing code, or won't work as expected on a platform. The best we can do is to keep a sensitive default behavior (i.e. the one chosen by the OS designers), and offer APIs allowing the user to fine-tune the behavior according to its needs.

    We've fixed many bugs pertaining to signals and threads during the last months, and we've gained a couple useful APIs to deal with signals and threads (pthread_sigmask(), sigwait(), etc).

    I'd suggest to close this.

    79528080-9d85-4d18-8a2a-8b1f07640dd7 commented 12 years ago

    Closing (see http://bugs.python.org/msg149904 and http://bugs.python.org/msg149909).