jupyter / nb2kg

Other
73 stars 31 forks source link

Jupyter Lab Stuck on executing code with JEG--NB2KG setup. #39

Closed IMAM9AIS closed 5 years ago

IMAM9AIS commented 5 years ago

Hi,

Brief background:-

We are using jupyter lab setup with Jupyter Enterprise gateway setup and with NB2KG to override the required classes. We had noticed that the kernels spawned were closing the websockets after a timeout and we created a pull request here:- https://github.com/jupyter/enterprise_gateway/pull/698 to refrain kernels from closing the websocket connection and it seems to work fine. (At least the logs from servers end reflect this)

Problem:-

We launch a kernel remotely and if there are intermittent client disconnection the websocket connection still remains alive which is good. But in cases were we close the laptop for let's say 45 minutes and then come back and execute any cell, the execution gets stuck with * symbol.

I tried debugging with a lot of conditions, it seems at the end of the day, the websocket message from the client is conveyed properly to nb2kg which tries to send this to KernelGateway using this https://github.com/jupyter/nb2kg/blob/ddf6b7c3d119445f2bb4a03b8d8ea5a26a876bdc/nb2kg/handlers.py#L230

But the somewhere the final call in websocket library is indefinitely stuck (probably there is a closure in internal stream or something) which does not actually complete the write message process for this web socket client.

Solution :- Any ideas why this could be happening. From my understanding there is no websocket client close event being called because I have monitored the logs multiple times, so we always have ws object alive, but whenever we try to send the message there is obvious failure.

kevin-bates commented 5 years ago

As a data point, have you tried reproducing this same scenario with a local kernel using only jupyterlab (sans NB2KG/EG)? Please ensure debug logging is enabled. I'm wondering if this is related to the 'buffered message' stuff and the log should reflect that is happening, especially if the kernel is doing work during the laptop's 45 minute closure. Thanks.

IMAM9AIS commented 5 years ago

@kevin-bates I tried reproducing this with local kernel, without NB2KG/EG setup, and things seem to run fine even after laptop sleeping for 40-50 minutes ( I am able to execute cells). I have been in debug mode since the start of the issue but could not see anything useful except that if cell goes into buzy mode and I try to re run it multiple times, after some tries i see Exception writing message to websocket: error.

kevin-bates commented 5 years ago

Thanks for the update. This topic is beyond my knowledge level. Hopefully others can help here.

cc: @rolweber, @esevan - any ideas?

esevan commented 5 years ago

Sorry, I have no experience of this issue.

@IMAM9AIS Can I ask you full error log of Exception writing message to websocket: error.?

IMAM9AIS commented 5 years ago

@esevan this is only message i receive which is supposed to be executed from this part of the code. https://github.com/jupyter/nb2kg/blob/ddf6b7c3d119445f2bb4a03b8d8ea5a26a876bdc/nb2kg/handlers.py#L247 But this is something which is executed after i try executing the cell multiple times even when it is not executing results.

kevin-bates commented 5 years ago

@IMAM9AIS - sorry for the lack of help here. Just for grins (and another datapoint), can you try using the embedded nb2kg in Notebook 6.0 (now that its released). Instead of installing the extension, enabling it, and configuring the class overrides, you simply start Notebook with --gateway-url <gateway url>.

I suspect the results will be the same, but there are some changes wrt to handlers (but at the http level, not relative to websockets).

esevan commented 5 years ago

I've checked both nb2kg and notebook/gateway doesn't reconnect websocket to the gateway when the connection between nb2kg and EG is closed. As a result, browser-notebook connection is alive but notebook-EG connection is closed. -> Communication between browser and kernel is broken.

Tornado.websocket guide: a message of None indicates that the connection has been closed.

https://github.com/jupyter/nb2kg/blob/ddf6b7c3d119445f2bb4a03b8d8ea5a26a876bdc/nb2kg/handlers.py#L215-L216

@IMAM9AIS If you see 'connections': 0 log in your case, this is the case. Something like Kernel retrieved: {'id': 'd61a9037-1211-422f-86a0-ef5ed4b01789', 'name': 'python3', 'last_activity': '2019-07-18T07:08:55.641346Z', 'execution_state': 'starting', 'connections': 0}

@kevin-bates Could you give me a comment about this case? Do you agree with the idea of recovering connection between nb2kg and EG when it's closed? In my case, this also keeps jupyterlab from reconnecting the session when the session is recovered in EG.

kevin-bates commented 5 years ago

@esevan - yeah, I agree. Your PR looks promising. Actually glad both embedded and nb2kg behave the same. Let's focus on nb2kg for now since that's what is being used, and we'll port accordingly. Thanks!

@IMAM9AIS - as I mentioned in the PR, it would be great if you could take #42 for a spin prior to merge.

IMAM9AIS commented 5 years ago

@kevin-bates @esevan have some comments here:- https://github.com/jupyter/nb2kg/pull/42

IMAM9AIS commented 5 years ago

@esevan @kevin-bates . We tested the PR changes out and everything looks good. We haven't had a connection loss after this.

esevan commented 5 years ago

@IMAM9AIS I'm so glad to hear that! Thank you for issuing this and testing in your environment ;D

kevin-bates commented 5 years ago

Fantastic news! @esevan - I've merged the PR. Could you please apply the applicable changes to the gateway subsystem in Notebook?

kevin-bates commented 5 years ago

Closing via #42.

esevan commented 5 years ago

@kevin-bates Sure. I'll upload this patch for jupyter/notebook :D

esevan commented 5 years ago

@kevin-bates https://github.com/jupyter/notebook/pull/4777 Requested :)