jupyter / nb2kg

Other
73 stars 31 forks source link

NB2KG cannot cancel future websocket creation resulting in dangling websocket connection #14

Closed charlieeeeeee closed 5 years ago

charlieeeeeee commented 6 years ago

This issue can only be reproduced using scripts to invoke Jupyter REST API and using websocket for job submission. And immediately exit the script after job is submitted (Client side websocket closes)

Steps to reproduce: 1) Run script i) Create notebook session using /sessions/ ii) Open websocket to /channels/ and send a frame containing code for execution 2) Script exits immediately

The issue occurs intermittently, and the reason is as follows, -First, the script(client) creates a websocket and submits the job to Jupyter notebook -Second, NB2KG receives the on_open signal and tries to create a websocket for communication -Third, the script(client) exits after the job is submitted -Lastly, NB2KG (at client websocket close time) did not yet finish creating the websocket. It will run the following code.

def _disconnect(self):
        if self.ws is not None:
            # Close connection
            self.ws.close()
        elif not self.ws_future.done():
            # Cancel pending connection
            self.ws_future.cancel()

Since self.ws is not yet created. It will attempt to cancel the future task for creating the self.ws in the first place.

However, the future.cancel() is not supported by Tornado.

As noted in their source code,

def cancel(self):
        """Cancel the operation, if possible.

        Tornado ``Futures`` do not support cancellation, so this method always
        returns False.
        """
        return False

Therefore, this websocket creation task will eventually carry-out. As a result, Jupyter notebook will have dangling websocket connections leftover. One potential issue is, culling will be broken. because culling will not kick in with CULL_CONNECTED=FALSE

I will append some logs afterwards.

kevin-bates commented 6 years ago

Thanks @charlieeeeeee.

I've spent the better part of two days looking into this and have arrived at the conclusion that your test driver is not a practical use case. Although its a fire and forget application, in the context of kernels, the application should ensure that the kernel has received the job submission before closing the connection. Otherwise, the connection is closed between the client and Notebook before it can be established between the Notebook and Gateway Server. At a minimum, what should happen is that the job be submitted, then the code go into a short receive messages loop looking for the message type of execute-input with a state of busy. Once that is encountered (or the idle state is encountered), then the application can disconnect.

Here's some sample code:

        ws.send_frame(frame)
        self.get_response(ws, expected_msg_id)

    def get_response(self, ws, expected_msg_id):
        self.logger.info("Receiving response...")
        msg_state = ""
        cmd_received = False
        while not cmd_received:  # msg_id != orig_msg_id and not is_busy:
            response_message = json_decode(utf8(ws.recv()))
            msg_id = response_message['parent_header']['msg_id']
            msg_type = response_message['msg_type']
            if msg_type == 'status':
                msg_state = response_message['content']['execution_state']

            self.logger.debug("response id: {}, type: {}, state: {}, expected id: {}".
                              format(msg_id, msg_type, msg_state, expected_msg_id))
            cmd_received = Client.command_received(expected_msg_id, msg_id, msg_type, msg_state)

    @staticmethod
    def command_received(expected_msg_id, msg_id, msg_type, msg_state):
        received = False
        if msg_id == expected_msg_id:
            if msg_type == "execute_input" and msg_state == "busy":
                received = True
            if msg_type == "status" and msg_state == "idle":
                received = True
        return received

As noted in #15, I will be delivering a PR that performs appropriate error handling, but there really isn't anything that can be done if the application exits immediately after sending the payload.

Please note that another option is to simply add a delay of 1-2 seconds following the submission - which provides ample time for connection establishment prior to tear down.

kevin-bates commented 5 years ago

Closing issue - please reopen if the pull request was not sufficient.