Closed jdranczewski closed 8 months ago
@ccordoba12 did you want to look at this as well?
Yes, thanks for the ping @blink1073! I'll run our test suite against this PR and report back any issues in a couple of days.
@blink1073, just so you know, @jdranczewski is helping us to try to solve an additional issue in Spyder (referenced above), so this could take a bit longer than expected.
Understood, thank you both!
Hi @blink1073, while investigating https://github.com/spyder-ide/spyder/issues/21299 with @ccordoba12 I found that it's not caused by https://github.com/ipython/ipykernel/issues/1202 but likely a slightly different issue with how ipykernel schedules the next eventloop run.
schedule_next
in kernelbase.py puts advance_eventloop
on an io_loop call_later
, which means it will be called soon, but usually after the msg_queue
is dealt with. I've found that if the ZMQ socket event for a dispatch_shell
arrives before we enter the eventloop, there is sometimes a race condition between the advance_eventloop
call and process_one
running the call to dispatch shell
.
In a situation like this, schedule_dispatch
is called, and the dispatch_shell
call is put on the msg_queue
. Normally advance_eventloop
is supposed to guard against entering the eventloop if there are any unprocessed messages: https://github.com/ipython/ipykernel/blob/v6.29.2/ipykernel/kernelbase.py#L480-L484 Unfortunately, since there is already a getter registered for the queue in process_one
, the event is immediately consumed, the queue size is reported as zero, and the eventloop is entered before the dispatch_shell
can fire. This results in the kernel apparently hanging, as it has entered the eventloop and it won't process the shell request until the next ZMQ event is received.
I have not been able to meaningfully reproduce this in Jupyter, so this race condition may be fairly rare, but it's possible to reproduce in Spyder by spamming the run button until the request arrives at just the right time to trigger this situation, and for some users it apparently arises naturally. A fix I could suggest is putting the call to advance the eventloop on the message queue, so it's for sure only called once any dispatch events are processed: https://github.com/jdranczewski/ipykernel/commit/30da44ab4d355ce82750743ef3cd89bed7912f21 What do you think of doing it this way? Do you think it should be a separate PR (as it solves a different problem than stop_on_error_timeout
not working right), or would it fall under the umbrella of this one as 'eventloop scheduling'?
I'd say if it is easier for you to combine them in this PR, that is fine.
Ok, in that case I think it will be easier to just continue with this PR.
@ccordoba12 I've pushed the fix to this branch if you would like to run the Spyder test suite against it.
I'm not sure why the linting test started failing, I have not touched the line it's now failing at...
The lint failure is unrelated, from a typings change in ipython.
Thanks for the update @jdranczewski! I opened https://github.com/spyder-ide/spyder/pull/21834 to test your work on our side.
@blink1073, good news! Our tests are passing without issues, so this should be ready to be merged.
Also, we'd appreciate if you could release a new IPykernel version with this fix so we can depend on it in our next Spyder version. Thanks!
Excellent! I'll make a bug release tomorrow.
Fixes #1202 by introducing a timed exit to the Tk (on Linux) and Qt (on Windows and Linux) eventloops so
schedule_stop_aborting
can fire afterstop_on_error_timeout
in the kernel's mainio_loop
.This only affects the eventloops specified, as the others periodically hand control back to the kernel, but Tk (on Linux) and Qt normally wait for a ZMQ socket event to go back to the kernel. The
stop_on_error_timeout
does not fire a socket event, so the timer does not expire and future events are aborted when they should not.The proposed solution introduces an optional
_schedule_exit
method to eventloops that need it, and makes sure it is called if an eventloop exit is needed for thestop_on_error_timeout
to fire.