jupyter / nb2kg

Other
73 stars 31 forks source link

Attempt to re-establish websocket connection to KG #42

Closed esevan closed 5 years ago

esevan commented 5 years ago

When nb2kg lost the connection to KG, nb2kg didn't connect to KG again although the websocket connection from the client was still alive.

This change recovers the connection to KG to prevent above anomaly.

Signed-off-by: Eunsoo Park esevan.park@gmail.com

esevan commented 5 years ago

This may be related to #39

Sorry for no idea about how to make the test for this. I attach manual test result instead. Test Env Cluster: kubernetes Gateway: EG on Kubernetes KG_REQUEST_TIMEOUT = KG_CONNECT_TIMEOUT = 10 Test Scenario

  1. Connect Python3 kernel
  2. Run the following code in a notebook cell.
    from time import sleep
    a = 0
    for i in range(100):
    a = i
    print(a)
    sleep(1)
  3. Shutdown EG while running the above code.

Test Result Session is recovered in around 1minutes, and kernel session restarts writing the output at the output area after recovery (See the sceenshot output jumping from 34 to 95) image

meeseeksmachine commented 5 years ago

This pull request has been mentioned on Jupyter Community Forum. There might be relevant details there:

https://discourse.jupyter.org/t/why-does-nb2kg-connect-to-the-kernel-via-eg-rather-than-directly/1619/18

IMAM9AIS commented 5 years ago

@esevan @kevin-bates I am sorry i could not get back on this quickly. We have been trying to debug around various points of nb2kg's code as well as EG's end. Although we have not been able to figure out the reason why connection is lost after some time, we do realise certain points :-

  1. When the connection is lost between NB2KG and EG, there is no on_connection_close() event being initiated by this close event(probably there is no close event being generated at all) which still keeps the state of connection unknown to NB2KG.

https://github.com/tornadoweb/tornado/blob/c92b883f806d99b35af52464980722c9c9a36a13/tornado/iostream.py#L536

This check still returns a false signal, due to which a write is finally triggered but since the stream is not available the final write operation does not work.

This is during the first shell message that you try to execute after the connection has been lost.

  1. Luckily after this happens, the state of stream gets updated, and the next time you try to execute any cell (trigger a shell message), the code finally hits this a Stream Closed Exception which is cached here :-

https://github.com/jupyter/nb2kg/blob/ddf6b7c3d119445f2bb4a03b8d8ea5a26a876bdc/nb2kg/handlers.py#L247

But since the exception is gracefully handled nothing happens and the client is still unaware of the loss.

IMAM9AIS commented 5 years ago

@esevan I will try the changes you have done, but i am not sure that on_close() event here :- https://github.com/jupyter/nb2kg/blob/ddf6b7c3d119445f2bb4a03b8d8ea5a26a876bdc/nb2kg/handlers.py#L249 will be called at all.

Also, from the changes you have done (from what i can understand), let's say even if the on_close() event is being generated, it still won't reconnect the sockets but just print the log to do it, which might not be user-friendly.

IMAM9AIS commented 5 years ago

Rather than doing this, I have another suggestion :-

What if while creating the connection between NB2KG and JEG, we also start a ping mechanism in the IOLoop, that pings every 1 min say. Now in scenario where we somehow lose a connection, the first ping will obviously do nothing but will help websocket realise that the stream is closed.

The second ping will raise an exception which will propagate back here :-

https://github.com/jupyter/nb2kg/blob/ddf6b7c3d119445f2bb4a03b8d8ea5a26a876bdc/nb2kg/handlers.py#L247

But rather than handling this exception, if you raise it this actually triggers a websocket close event back to jupyter lab.

If we look into jupyter lab's websocket close event here :-

https://github.com/jupyterlab/jupyterlab/blob/c3d6801290fd5cdb527d1a9a198f97c19f890a1f/packages/services/src/kernel/default.ts#L1269

Rather the immediately closing the websocket, it tries to do reconnect attempts with increasing timeouts which will immediately fix the connection if it can be fixed with the network state.

esevan commented 5 years ago

@kevin-bates Thanks for your review, kevin!

@IMAM9AIS Websocket connection closing event is also listened by read_message. It returns None if the connection between nb2kg and EG is closed. See more detail in https://www.tornadoweb.org/en/stable/websocket.html#client-side-support

The connection supports two styles of operation. In the coroutine style, the application typically calls read_message in a loop:

conn = yield websocket_connect(url)
while True:
    msg = yield conn.read_message()
    if msg is None: break
    # Do something with msg

In both styles, a message of None indicates that the connection has been closed.

One more thing for what it's worth, on_close callback is server-side websocket callback, which is the callback for browser-nb2kg websocket connection.

IMAM9AIS commented 5 years ago

@kevin-bates Thanks for your review, kevin!

@IMAM9AIS Websocket connection closing event is also listened by read_message. It returns None if the connection between nb2kg and EG is closed. See more detail in https://www.tornadoweb.org/en/stable/websocket.html#client-side-support

The connection supports two styles of operation. In the coroutine style, the application typically calls read_message in a loop:

conn = yield websocket_connect(url)
while True:
    msg = yield conn.read_message()
    if msg is None: break
    # Do something with msg

In both styles, a message of None indicates that the connection has been closed.

One more thing for what it's worth, on_close callback is server-side websocket callback, which is the callback for browser-nb2kg websocket connection.

@esevan Thanks a lot for clearing this. I will try out changes and will let you know.