Strange interaction with matplotlib magic

lumberbot-app[bot] commented 1 year ago

I have an app that uses jupyter_client to interact with an ipykernel. In that app I have an execution loop that sends an execute request and then waits for a reply on the iopub channel that has "execution_state": "idle". I assume that when this message is received it is done executing and ready to execute something else. I never send a second execute request until I get the idle-state reply.

This works until I use the %matplotlib magic to load a gui backend (e.g., %matplotlib qt). After that, things begin to behave strangely. If I send code to execute, and that code raises an exception, I get the error reply normally. But then whatever code I send next is aborted without ever being executed. I can only assume this is due to some interaction with the gui event loop that that matplotlib magic is creating, but I can't figure out exactly where the problem lies. It seems that the "stop aborting" event somehow is not processed until after the next execution request I send (even if the app just sits idle in between the exception-raising code and my next execution request). In other words things seem to happen out of order: I send a request, it raises an exception, the kernel "aborts the queue" (even though there are no other requests in the queue), I then send a new request, and the kernel then aborts that (even though it wasn't in the queue when the kernel decided to abort).

I'm not sure if this is a problem with ipykernel or with jupyter_client, but I'm posting here to see if there's something else I should be doing on the jupyter_client end. In particular, I want to know what message I need to wait for after submitting an execution request that tells me "the kernel is done processing, the next execution request you send will be processed and will not be aborted because of some lingering earlier error". I thought that was the idle-state message, but apparently not, because when I receive that message it seems the kernel might still be in a state where it is going to abort whatever I send it (because it thinks they were all part of one execution queue). Right now I fixed it by inserting a wait_for_ready call after each execution, but I'm not sure if that is overkill or could have any other unexpected effects.

Originally opened as jupyter/jupyter_client#904 by @BrenBarn, migration requested by @blink1073

lumberbot-app[bot] commented 1 year ago

@blink1073 commented: Hi @BrenBarn,

The logic to handle aborts is in ipykernel here.

And the qt loop is defined here.

I'm personally not familiar with the Qt machinery.

blink1073 commented 1 year ago

@ccordoba12 @tacaswell @haperilio do you have an ideas what might be happening here?

ccordoba12 commented 1 year ago

I think this requires a minimal code example to manually check what's happening.

BrenBarn commented 1 year ago

Here is some sample code showing the problem. I am running this in a conda environment where I have conda-installed matplotlib and ipykernel and "jupyter_client<7". (I am using 6.x Jupyter Client because of the issue described here with async in JC 7.x. But if the issue is with ipykernel then this shouldn't really matter.)

import jupyter_client as jc

class JCTest:
    def __init__(self):
        self.manager = jc.KernelManager()
        self.manager.start_kernel()
        self.client = self.manager.blocking_client()
        self.client.start_channels()
        info = self.client.kernel_info(reply=True)
        self.client.wait_for_ready()
        print(info['content']['banner'])

    def shutdown(self):
        self.manager.shutdown_kernel()

    def execution_loop(self, code):
        print(f"Executing: {code}")

        msgid = self.client.execute(code)
        done = False
        # loop while waiting for results
        while not done:
            # get any messages on io channel
            io_msgs = self.client.iopub_channel.get_msgs()

            #shell = self.kernel.shell_channel.get_msgs()

            # this is where we would update the GUI

            for msg in io_msgs:
                if msg['parent_header'].get('msg_id') != msgid:
                    print(">>> Received stale IO message")
                    print(msg)
                    print("<<<")
                    continue

                print(">>> Received IO message reply")
                print(msg)
                print("<<<")

                if msg['msg_type'] == 'status' and msg['content']['execution_state'] == 'idle':
                    print("\nWe think the kernel is done!\n")
                    done = True
                    break

        # go through shell messages in case we got some kind of magic reply
        msgs = self.client.shell_channel.get_msgs()
        for msg in msgs:
            if msg['parent_header'].get('msg_id') != msgid:
                print(">>> Received shell message")
                print(msg)
                print("<<<")
                continue
            if msg['msg_type'] == 'execute_reply':
                print(">>> Received execute reply")
                print(msg)
                print("<<<")
            else:
                print(">>> Received some other shell reply")
                print(msg)
                print("<<<")

        print("\nDone executing\n")

    def test_normal(self):
        self.execution_loop("print('This is a test')\n2+2")
        self.execution_loop("print('This causes an error')\n1/0")
        self.execution_loop("print('This should work again')\n2+3")

    def test_problem(self):
        self.execution_loop("print('This is a test')\n2+2")
        self.execution_loop("%matplotlib tk")
        self.execution_loop("print('This error should not affect the next execution request')\n1/0")
        self.execution_loop("print('This is never seen')")

def main():
    tester = JCTest()
    #print("Normal:")
    #tester.test_normal()
    print("\n*** Problem\n")
    tester.test_problem()

    tester.shutdown()

if __name__ == "__main__":
    main()

This code just submits some execution requests and shows the messages it gets back in response. The problem can be seen at the end of the output. After %matplotlib tk is run, the next line raises an exception and the error message is correctly returned in the client reply. But then, when a new execution request is sent to print the "This is never seen" message, it aborts.

This is a blocking client and the code only sends a single execution request at a time, and loops waiting for the corresponding exeuction-state-idle reply before submitting another request. As I understand it, this should mean that by the time that reply is received, the kernel is done with that execution request and the kernel should be as ready for a new request as it was before the exception-raising code was run. What I don't understand is why it then aborts a later execution request that was sent separately, after the first one was (supposed to be) resolved. It is behaving as if I had queued up multiple execution requests, but that's not what the code does.

The problem is somehow related to activating the matplotlib GUI backend. I commented out the line that runs the "normal" version, but if you uncomment it you will see that, in that case, after the exception-raising code is executed, the next print proceeds normally and returns output. (The problem also occurs with %matplotlib qt; I didn't try other backends.)

BrenBarn commented 1 year ago

@ccordoba12 Any thoughts?

ccordoba12 commented 1 year ago

Sorry, I missed this one. I'll take a look at it next week.

ccordoba12 commented 1 year ago

Sorry for the delay. You said

But then, when a new execution request is sent to print the "This is never seen" message, it aborts.

How does it abort? Does execution_loop never ends?

BrenBarn commented 1 year ago

No, an abort message is sent that looks like this:

{'header': {'msg_id': '3292d5aa-94a7ff5fa27337f927c9c7a4_27', 'msg_type': 'execute_reply', 'username': 'brenbarn', 'session': '3292d5aa-94a7ff5fa27337f927c9c7a4', 'date': datetime.datetime(2023, 2, 13, 21, 4, 6, 440994, tzinfo=tzutc()), 'version': '5.3'}, 'msg_id': '3292d5aa-94a7ff5fa27337f927c9c7a4_27', 'msg_type': 'execute_reply', 'parent_header': {'msg_id': 'aea1c6c4-c7171258182582782de0eb75_6', 'msg_type': 'execute_request', 'username': 'brenbarn', 'session': 'aea1c6c4-c7171258182582782de0eb75', 'date': datetime.datetime(2023, 2, 13, 21, 4, 6, 387759, tzinfo=tzutc()), 'version': '5.3'}, 'metadata': {'started': '2023-02-13T21:04:06.440986Z', 'dependencies_met': True, 'engine': '5ea9973d-9fb3-4688-b624-15534fad930f', 'status': 'aborted'}, 'content': {'status': 'aborted'}, 'buffers': []}

ccordoba12 commented 1 year ago

And I guess any additional execution aborts in the same way as well?

BrenBarn commented 1 year ago

No, a subsequent execution request will succeed normally. The exception seems to swallow just one execute request.

I don't really understand the workings of the kernel and the eventloop (which is why I raised the issue), but I attempted to trace the logic as best I could. I see that in kernebase.py there is a method _abort_queues which appears to be called when an execution request results in an error in the kernel. Based on the comments there, this method sets an _aborting flag and then schedules another "stop aborting" call to clear that flag; but while the flag is set, all requests are aborted. Apparently this is done to skip pending execution requests on error (e.g., if the user ran multiple cells in a notebook). The idea I guess is that it goes into an "abort everything until I tell you stop" mode, and then adds an "okay stop aborting" message to the end of the queue, so that any pending requests are aborted and when they're all handled it will hit the stop-aborting one and go back to a ready state.

This code does a flush on the shell stream and I see that the code in the event loops also does one. My hunch is that there is some kind of misfire here which causes the "stop aborting" call to be skipped (or maybe left there and not yet reached), so that it still thinks it is in "abort mode" when it shouldn't be. Perhaps the GUI event loop code somehow needs to know to somehow watch for a stop-aborting situation and remember to stop aborting in that case? But I could be totally wrong in my understanding of what's going on.

tacaswell commented 1 year ago

What version of Matplotlib? If you remove the qt bindings does it make tk work as expected?

ipython / ipykernel

Strange interaction with matplotlib magic #1066