Eventloop scheduling improvements for stop_on_error_timeout and schedule_next

jdranczewski commented 9 months ago

Fixes #1202 by introducing a timed exit to the Tk (on Linux) and Qt (on Windows and Linux) eventloops so schedule_stop_aborting can fire after stop_on_error_timeout in the kernel's main io_loop.

This only affects the eventloops specified, as the others periodically hand control back to the kernel, but Tk (on Linux) and Qt normally wait for a ZMQ socket event to go back to the kernel. The stop_on_error_timeout does not fire a socket event, so the timer does not expire and future events are aborted when they should not.

The proposed solution introduces an optional _schedule_exit method to eventloops that need it, and makes sure it is called if an eventloop exit is needed for the stop_on_error_timeout to fire.

blink1073 commented 9 months ago

@ccordoba12 did you want to look at this as well?

ccordoba12 commented 9 months ago

Yes, thanks for the ping @blink1073! I'll run our test suite against this PR and report back any issues in a couple of days.

ccordoba12 commented 9 months ago

@blink1073, just so you know, @jdranczewski is helping us to try to solve an additional issue in Spyder (referenced above), so this could take a bit longer than expected.

blink1073 commented 9 months ago

Understood, thank you both!

jdranczewski commented 8 months ago

Hi @blink1073, while investigating https://github.com/spyder-ide/spyder/issues/21299 with @ccordoba12 I found that it's not caused by https://github.com/ipython/ipykernel/issues/1202 but likely a slightly different issue with how ipykernel schedules the next eventloop run.

schedule_next in kernelbase.py puts advance_eventloop on an io_loop call_later, which means it will be called soon, but usually after the msg_queue is dealt with. I've found that if the ZMQ socket event for a dispatch_shell arrives before we enter the eventloop, there is sometimes a race condition between the advance_eventloop call and process_one running the call to dispatch shell.

In a situation like this, schedule_dispatch is called, and the dispatch_shell call is put on the msg_queue. Normally advance_eventloop is supposed to guard against entering the eventloop if there are any unprocessed messages: https://github.com/ipython/ipykernel/blob/v6.29.2/ipykernel/kernelbase.py#L480-L484 Unfortunately, since there is already a getter registered for the queue in process_one, the event is immediately consumed, the queue size is reported as zero, and the eventloop is entered before the dispatch_shell can fire. This results in the kernel apparently hanging, as it has entered the eventloop and it won't process the shell request until the next ZMQ event is received.

I have not been able to meaningfully reproduce this in Jupyter, so this race condition may be fairly rare, but it's possible to reproduce in Spyder by spamming the run button until the request arrives at just the right time to trigger this situation, and for some users it apparently arises naturally. A fix I could suggest is putting the call to advance the eventloop on the message queue, so it's for sure only called once any dispatch events are processed: https://github.com/jdranczewski/ipykernel/commit/30da44ab4d355ce82750743ef3cd89bed7912f21 What do you think of doing it this way? Do you think it should be a separate PR (as it solves a different problem than stop_on_error_timeout not working right), or would it fall under the umbrella of this one as 'eventloop scheduling'?

blink1073 commented 8 months ago

I'd say if it is easier for you to combine them in this PR, that is fine.

jdranczewski commented 8 months ago

Ok, in that case I think it will be easier to just continue with this PR.

@ccordoba12 I've pushed the fix to this branch if you would like to run the Spyder test suite against it.

I'm not sure why the linting test started failing, I have not touched the line it's now failing at...

blink1073 commented 8 months ago

The lint failure is unrelated, from a typings change in ipython.

ccordoba12 commented 8 months ago

Thanks for the update @jdranczewski! I opened https://github.com/spyder-ide/spyder/pull/21834 to test your work on our side.

ccordoba12 commented 8 months ago

@blink1073, good news! Our tests are passing without issues, so this should be ready to be merged.

Also, we'd appreciate if you could release a new IPykernel version with this fix so we can depend on it in our next Spyder version. Thanks!

blink1073 commented 8 months ago

Excellent! I'll make a bug release tomorrow.

ipython / ipykernel

Eventloop scheduling improvements for stop_on_error_timeout and schedule_next #1212