sagemath / cysignals

cysignals: interrupt and signal handling for Cython. Source repository for https://pypi.org/project/cysignals/
GNU Lesser General Public License v3.0
43 stars 23 forks source link

possible bug in verify_exc_value #126

Closed embray closed 3 years ago

embray commented 3 years ago

This is to track an issue I found while working on this ticket. So far I have only been able to reproduce the bug on the branch for that ticket, though I don't think it's directly related to anything else in that ticket (not outside the realm of possibility). Nevertheless I think there is a bug.

From what I can tell so far at a high level this is what's happening. The test where the bug is occurring looks like:

sage: _ = singular._expect.sendline('1+')  # unfinished input ## line 503 ##
sage: try:
    alarm(0.5)
    singular._expect_expr('>')  # interrupt this
except KeyboardInterrupt:
    pass ## line 504 ##
Control-C pressed. Interrupting Singular. Please wait a few seconds...

though it could in principle be any code using alarm(). Then what happens is:

  1. The alarm triggers a SIGINTR which gets handled by cysignals.
  2. We are not in a sig_on() block, so PyErr_SetInterrupt() is called to call the Python interrupt handler.
  3. This correctly sets the AlarmInterrupt exception and also sets cysigs.exc_value to the exception value.
  4. Here I'm not sure, but it seems the signal was received during garbage collection or something, because when the Python interpreter resumes it tries deallocating an Integer, and hits the sig_occurred() check which is performed before putting a possibly broken mpz_t on the free list. So this happens before Python has raised the AlarmInterrupt exception.
  5. I found that the first time this happens, cysigs.exc_value.ob_refcnt == 3.
  6. This results in running the garbage collector again here.
  7. This results in more integers deallocated which again goes into sig_occurred(). This happens a few more times until...
  8. At some point cysigs.exc_value.ob_refcnt is reduced to 1 after the call to collect() (I don't know why the exception is being derefed and that needs to be looked into more). This results in cysigs.exc_value set to NULL.
  9. We return to the other collect() call (remember, we are in nested garbage collect calls; I have no idea if this is even safe to do...)
  10. After the outer collect() call, verify_exc_value again checks the value of cysigs.exc_value.ob_refcnt and segfaults since cysigs.exc_value is NULL.

In summary there are at least two issues here:

  1. Why is the exception being derefed?
  2. verify_exc_value should make sure cysigs.exc_value is not set to NULL after garbage collection, but before checking its ref count.
  3. (Question) Is it even OK to call gc.collect() recursively?
embray commented 3 years ago

On question 3, it seems gc.collect() does prevent recursion using a global flag--if already in the GC it won't re-enter. Which means that in part 4. above it's not during GC (perhaps an int is just being immediately deallocated since its refcount reached 0).

Nevertheless, upon entering that first gc.collect() call, more integers get deallocated, resulting in recursive calls into verify_exc_value(). Eventually (question 1) the exception's refcount goes to 1, and then when returning from the gc.collect() call the segfault occurs.

embray commented 3 years ago

On 4) I have some more clarity I think:

Here I'm not sure, but it seems the signal was received during garbage collection or something, because when the Python interpreter resumes it tries deallocating an Integer, and hits the sig_occurred() check which is performed before putting a possibly broken mpz_t on the free list. So this happens before Python has raised the AlarmInterrupt exception.

That's not what's happening at all. It has nothing to do with garbage collection at this step. In fact the AlarmInterrupt is raised and handled just fine, and the test passes. At this point cysigs.exc_value is still assigned the AlarmInterrupt. On the next test, some integer gets its tp_dealloc called on it, which results in sig_occurred() being called. Here, verify_exc_value() works almost as intended: it correctly determines that this exception is no longer being handled, and should be cleared from cysignals.

The rest of what happens is pretty much as I described: calling gc.collect() results in a recursive call into sig_occurred(), which results ultimately in this segfault.

embray commented 3 years ago

At least naïvely, this seems to be all that's needed:

diff --git a/src/cysignals/signals.pyx b/src/cysignals/signals.pyx
index ee43c7f..2c658c3 100644
--- a/src/cysignals/signals.pyx
+++ b/src/cysignals/signals.pyx
@@ -372,6 +372,9 @@ cdef void verify_exc_value():
         # is not functional anymore.
         pass

-    if cysigs.exc_value.ob_refcnt == 1:
+    # Make sure we still have cysigs.exc_value at all; if this function was
+    # called again during garbage collection it might have already been set
+    # to NULL; see https://github.com/sagemath/cysignals/issues/126
+    if cysigs.exc_value != NULL and cysigs.exc_value.ob_refcnt == 1:
         Py_XDECREF(cysigs.exc_value)
         cysigs.exc_value = NULL

Now that I've better understood when and why this bug is being triggered, I think verify_exc_value() is otherwise working as intended, and just not taking care of the possibility that it will be called again from within gc.collect().