Closed charlieeeeeee closed 5 years ago
Thanks @charlieeeeeee.
I've spent the better part of two days looking into this and have arrived at the conclusion that your test driver is not a practical use case. Although its a fire and forget application, in the context of kernels, the application should ensure that the kernel has received the job submission before closing the connection. Otherwise, the connection is closed between the client and Notebook before it can be established between the Notebook and Gateway Server. At a minimum, what should happen is that the job be submitted, then the code go into a short receive messages loop looking for the message type of execute-input
with a state of busy
. Once that is encountered (or the idle
state is encountered), then the application can disconnect.
Here's some sample code:
ws.send_frame(frame)
self.get_response(ws, expected_msg_id)
def get_response(self, ws, expected_msg_id):
self.logger.info("Receiving response...")
msg_state = ""
cmd_received = False
while not cmd_received: # msg_id != orig_msg_id and not is_busy:
response_message = json_decode(utf8(ws.recv()))
msg_id = response_message['parent_header']['msg_id']
msg_type = response_message['msg_type']
if msg_type == 'status':
msg_state = response_message['content']['execution_state']
self.logger.debug("response id: {}, type: {}, state: {}, expected id: {}".
format(msg_id, msg_type, msg_state, expected_msg_id))
cmd_received = Client.command_received(expected_msg_id, msg_id, msg_type, msg_state)
@staticmethod
def command_received(expected_msg_id, msg_id, msg_type, msg_state):
received = False
if msg_id == expected_msg_id:
if msg_type == "execute_input" and msg_state == "busy":
received = True
if msg_type == "status" and msg_state == "idle":
received = True
return received
As noted in #15, I will be delivering a PR that performs appropriate error handling, but there really isn't anything that can be done if the application exits immediately after sending the payload.
Please note that another option is to simply add a delay of 1-2 seconds following the submission - which provides ample time for connection establishment prior to tear down.
Closing issue - please reopen if the pull request was not sufficient.
This issue can only be reproduced using scripts to invoke Jupyter REST API and using websocket for job submission. And immediately exit the script after job is submitted (Client side websocket closes)
Steps to reproduce: 1) Run script i) Create notebook session using /sessions/ ii) Open websocket to /channels/ and send a frame containing code for execution 2) Script exits immediately
The issue occurs intermittently, and the reason is as follows, -First, the script(client) creates a websocket and submits the job to Jupyter notebook -Second, NB2KG receives the on_open signal and tries to create a websocket for communication -Third, the script(client) exits after the job is submitted -Lastly, NB2KG (at client websocket close time) did not yet finish creating the websocket. It will run the following code.
Since self.ws is not yet created. It will attempt to cancel the
future
task for creating the self.ws in the first place.However, the
future.cancel()
is not supported by Tornado.As noted in their source code,
Therefore, this websocket creation task will eventually carry-out. As a result, Jupyter notebook will have dangling websocket connections leftover. One potential issue is, culling will be broken. because culling will not kick in with CULL_CONNECTED=FALSE
I will append some logs afterwards.