Closed embray closed 3 years ago
On question 3, it seems gc.collect() does prevent recursion using a global flag--if already in the GC it won't re-enter. Which means that in part 4. above it's not during GC (perhaps an int is just being immediately deallocated since its refcount reached 0).
Nevertheless, upon entering that first gc.collect()
call, more integers get deallocated, resulting in recursive calls into verify_exc_value()
. Eventually (question 1) the exception's refcount goes to 1, and then when returning from the gc.collect()
call the segfault occurs.
On 4) I have some more clarity I think:
Here I'm not sure, but it seems the signal was received during garbage collection or something, because when the Python interpreter resumes it tries deallocating an Integer, and hits the sig_occurred() check which is performed before putting a possibly broken mpz_t on the free list. So this happens before Python has raised the AlarmInterrupt exception.
That's not what's happening at all. It has nothing to do with garbage collection at this step. In fact the AlarmInterrupt
is raised and handled just fine, and the test passes. At this point cysigs.exc_value
is still assigned the AlarmInterrupt
. On the next test, some integer gets its tp_dealloc
called on it, which results in sig_occurred()
being called. Here, verify_exc_value()
works almost as intended: it correctly determines that this exception is no longer being handled, and should be cleared from cysignals.
The rest of what happens is pretty much as I described: calling gc.collect()
results in a recursive call into sig_occurred()
, which results ultimately in this segfault.
At least naïvely, this seems to be all that's needed:
diff --git a/src/cysignals/signals.pyx b/src/cysignals/signals.pyx
index ee43c7f..2c658c3 100644
--- a/src/cysignals/signals.pyx
+++ b/src/cysignals/signals.pyx
@@ -372,6 +372,9 @@ cdef void verify_exc_value():
# is not functional anymore.
pass
- if cysigs.exc_value.ob_refcnt == 1:
+ # Make sure we still have cysigs.exc_value at all; if this function was
+ # called again during garbage collection it might have already been set
+ # to NULL; see https://github.com/sagemath/cysignals/issues/126
+ if cysigs.exc_value != NULL and cysigs.exc_value.ob_refcnt == 1:
Py_XDECREF(cysigs.exc_value)
cysigs.exc_value = NULL
Now that I've better understood when and why this bug is being triggered, I think verify_exc_value()
is otherwise working as intended, and just not taking care of the possibility that it will be called again from within gc.collect()
.
This is to track an issue I found while working on this ticket. So far I have only been able to reproduce the bug on the branch for that ticket, though I don't think it's directly related to anything else in that ticket (not outside the realm of possibility). Nevertheless I think there is a bug.
From what I can tell so far at a high level this is what's happening. The test where the bug is occurring looks like:
though it could in principle be any code using
alarm()
. Then what happens is:SIGINTR
which gets handled by cysignals.sig_on()
block, soPyErr_SetInterrupt()
is called to call the Python interrupt handler.AlarmInterrupt
exception and also setscysigs.exc_value
to the exception value.sig_occurred()
check which is performed before putting a possibly brokenmpz_t
on the free list. So this happens before Python has raised theAlarmInterrupt
exception.cysigs.exc_value.ob_refcnt == 3
.sig_occurred()
. This happens a few more times until...cysigs.exc_value.ob_refcnt
is reduced to 1 after the call tocollect()
(I don't know why the exception is being derefed and that needs to be looked into more). This results incysigs.exc_value
set toNULL
.collect()
call (remember, we are in nested garbage collect calls; I have no idea if this is even safe to do...)collect()
call,verify_exc_value
again checks the value ofcysigs.exc_value.ob_refcnt
and segfaults sincecysigs.exc_value
isNULL
.In summary there are at least two issues here:
verify_exc_value
should make surecysigs.exc_value
is not set toNULL
after garbage collection, but before checking its ref count.gc.collect()
recursively?