jupyter / nb2kg

Other
73 stars 31 forks source link

[Question] Does nb2kg support buffer_offline_messages #19

Closed ktong closed 5 years ago

ktong commented 5 years ago

MappingKernelManager.buffer_offline_messages : Bool Default: True

Whether messages from kernels whose frontends have disconnected should be buffered in-memory.

When True (default), messages are buffered and replayed on reconnect, avoiding lost messages due to interrupted connectivity.

Disable if long-running kernels will produce too much output while no frontends are connected.

kevin-bates commented 5 years ago

Since NB2KG proxies kernel management to the corresponding gateway, this configuration option is controlled by the gateway server application. Because its enabled by default, then, yes, kernels created via NB2KG support the buffering of offline messages. That said, setting this configuration option on the client (NB2KG) side will be ignored.

If we find that too much memory is consumed from buffering, we may want to look at making this a per-kernel option and have Enterprise Gateway disable it by default.

I hope this answers your question.

ktong commented 5 years ago

Below is my steps:

  1. open notebook in JupyterLab connected to enterprise gateway
  2. run all cells (usually takes 1 hour to complete)
  3. wait to get first batch output from kernel
  4. close macbook pro (goes to hibernate)
  5. wait for 5 minutes
  6. open macbook pro (wake up)
  7. wait 2 minutes to expect output in last 5 minutes shows but not.
  8. reconnect to kernel to check kernel is still running
  9. new output shows

Any step or configuration do I miss?

kevin-bates commented 5 years ago

I'm not familiar with how the buffering works other than the grep I performed and seeing items logged in the trace log, so I just tried it with this cell content, then closing the browser tab...

import time

for i in range(500):
    time.sleep(1.0)
    print(i)

When closing the tab, I see the debug entries on the EG server transition to indicate that buffering is taking place...

[D 2018-10-09 06:59:41.377 EnterpriseGatewayApp] activity on 4724614d-810f-4f94-9559-564aae170e12: stream
[D 2018-10-09 06:59:42.381 EnterpriseGatewayApp] activity on 4724614d-810f-4f94-9559-564aae170e12: stream
[D 2018-10-09 06:59:43.381 EnterpriseGatewayApp] activity on 4724614d-810f-4f94-9559-564aae170e12: stream
[D 2018-10-09 06:59:44.382 EnterpriseGatewayApp] activity on 4724614d-810f-4f94-9559-564aae170e12: stream
[D 2018-10-09 06:59:44.694 EnterpriseGatewayApp] Websocket closed 4724614d-810f-4f94-9559-564aae170e12:5f7fafe3-2d42fde46eae0361dab31902
[I 2018-10-09 06:59:44.694 EnterpriseGatewayApp] Starting buffering for 4724614d-810f-4f94-9559-564aae170e12:5f7fafe3-2d42fde46eae0361dab31902
[D 2018-10-09 06:59:44.694 EnterpriseGatewayApp] Clearing buffer for 4724614d-810f-4f94-9559-564aae170e12
[I 181009 06:59:44 web:2162] 200 GET /api/kernels/4724614d-810f-4f94-9559-564aae170e12 (9.108.92.236) 1.28ms
[D 2018-10-09 06:59:45.380 EnterpriseGatewayApp] Buffering msg on 4724614d-810f-4f94-9559-564aae170e12:iopub
[D 2018-10-09 06:59:45.383 EnterpriseGatewayApp] activity on 4724614d-810f-4f94-9559-564aae170e12: stream
[D 2018-10-09 06:59:46.381 EnterpriseGatewayApp] Buffering msg on 4724614d-810f-4f94-9559-564aae170e12:iopub
[D 2018-10-09 06:59:46.383 EnterpriseGatewayApp] activity on 4724614d-810f-4f94-9559-564aae170e12: stream
[D 2018-10-09 06:59:47.382 EnterpriseGatewayApp] Buffering msg on 4724614d-810f-4f94-9559-564aae170e12:iopub
[D 2018-10-09 06:59:47.385 EnterpriseGatewayApp] activity on 4724614d-810f-4f94-9559-564aae170e12: stream
[D 2018-10-09 06:59:48.383 EnterpriseGatewayApp] Buffering msg on 4724614d-810f-4f94-9559-564aae170e12:iopub

I then get back into the notebook and see the EG log transition back to normal behavior...

[D 2018-10-09 06:59:57.246 EnterpriseGatewayApp] Initializing websocket connection /api/kernels/4724614d-810f-4f94-9559-564aae170e12/channels
[W 2018-10-09 06:59:57.248 EnterpriseGatewayApp] No session ID specified
[I 2018-10-09 06:59:57.249 EnterpriseGatewayApp] Adapting to protocol v5.1 for kernel 4724614d-810f-4f94-9559-564aae170e12
[I 181009 06:59:57 web:2162] 101 GET /api/kernels/4724614d-810f-4f94-9559-564aae170e12/channels (9.108.92.236) 4.63ms
[D 2018-10-09 06:59:57.250 EnterpriseGatewayApp] Opening websocket /api/kernels/4724614d-810f-4f94-9559-564aae170e12/channels
[D 2018-10-09 06:59:57.250 EnterpriseGatewayApp] Getting buffer for 4724614d-810f-4f94-9559-564aae170e12
[D 2018-10-09 06:59:57.251 EnterpriseGatewayApp] Clearing buffer for 4724614d-810f-4f94-9559-564aae170e12
[I 2018-10-09 06:59:57.251 EnterpriseGatewayApp] Discarding 12 buffered messages for 4724614d-810f-4f94-9559-564aae170e12:5f7fafe3-2d42fde46eae0361dab31902
[D 2018-10-09 06:59:57.251 EnterpriseGatewayApp] Connecting to: tcp://172.16.172.6:33243
[D 2018-10-09 06:59:57.252 EnterpriseGatewayApp] Connecting to: tcp://172.16.172.6:47165
[D 2018-10-09 06:59:57.252 EnterpriseGatewayApp] Connecting to: tcp://172.16.172.6:48693
[D 2018-10-09 06:59:57.400 EnterpriseGatewayApp] activity on 4724614d-810f-4f94-9559-564aae170e12: stream
[D 2018-10-09 06:59:58.401 EnterpriseGatewayApp] activity on 4724614d-810f-4f94-9559-564aae170e12: stream
[D 2018-10-09 06:59:59.399 EnterpriseGatewayApp] activity on 4724614d-810f-4f94-9559-564aae170e12: stream
[D 2018-10-09 07:00:00.400 EnterpriseGatewayApp] activity on 4724614d-810f-4f94-9559-564aae170e12: stream
[D 2018-10-09 07:00:01.405 EnterpriseGatewayApp] activity on 4724614d-810f-4f94-9559-564aae170e12: stream

However, nothing is getting posted to the Notebook, yet the notebook is appearing as busy and I can interrupt the kernel - at which time I see the EG log indicate that appropriately...

[D 2018-10-09 07:00:15.418 EnterpriseGatewayApp] activity on 4724614d-810f-4f94-9559-564aae170e12: stream
[D 2018-10-09 07:00:16.419 EnterpriseGatewayApp] activity on 4724614d-810f-4f94-9559-564aae170e12: stream
[D 2018-10-09 07:00:17.420 EnterpriseGatewayApp] activity on 4724614d-810f-4f94-9559-564aae170e12: stream
[D 2018-10-09 07:00:18.327 EnterpriseGatewayApp] RemoteKernelManager.signal_kernel(2)
[D 2018-10-09 07:00:18.328 EnterpriseGatewayApp] Signal (2) sent via gateway communication port.
[I 2018-10-09 07:00:18.328 EnterpriseGatewayApp] Kernel interrupted: 4724614d-810f-4f94-9559-564aae170e12
[I 181009 07:00:18 web:2162] 204 POST /api/kernels/4724614d-810f-4f94-9559-564aae170e12/interrupt (9.108.92.236) 1.91ms
[D 2018-10-09 07:00:18.383 EnterpriseGatewayApp] activity on 4724614d-810f-4f94-9559-564aae170e12: stream
[D 2018-10-09 07:00:18.388 EnterpriseGatewayApp] activity on 4724614d-810f-4f94-9559-564aae170e12: error
[D 2018-10-09 07:00:18.393 EnterpriseGatewayApp] activity on 4724614d-810f-4f94-9559-564aae170e12: status (busy)
[D 2018-10-09 07:00:18.398 EnterpriseGatewayApp] activity on 4724614d-810f-4f94-9559-564aae170e12: status (idle)
[D 2018-10-09 07:00:18.403 EnterpriseGatewayApp] activity on 4724614d-810f-4f94-9559-564aae170e12: status (busy)
[D 2018-10-09 07:00:18.408 EnterpriseGatewayApp] activity on 4724614d-810f-4f94-9559-564aae170e12: status (idle)
[D 2018-10-09 07:00:18.445 EnterpriseGatewayApp] activity on 4724614d-810f-4f94-9559-564aae170e12: status (idle)

In the previous log snippet I see that is 'discarded' the buffered messages - so there must be something that "asks" for those messages - or something like that.

I figured that traditional Notebook (i.e., one that doesn't proxy its kernel management to a gateway server) would work, so I tried it using a similar cell. Much to my surprise I see precisely the same behavior! The closed tab is detected and the re-opened notebook is detected, BUT that re-opened notebook does not display the current results AND I see the same 'discarded' message logged in the notebook log.

[I 07:20:11.132 NotebookApp] Discarding 4 buffered messages for 1b27cea3-4858-41b3-834b-b8b63326b7c3:2801368a62b045339effc95362daed64

What experiences do you have with this feature? I'm wondering if this is something that JupyterLab supports over Notebook. I haven't tried lab - kinda wanted to hear your experiences first.

kevin-bates commented 5 years ago

In looking into this a bit more, I probably need to revisit the traditional notebook scenario because "play back" is supposed to work. However, the key used to index into the buffered set of messages is a function of the kernel_id and session key and I think the issue is that NB2KG doesn't proxy session management (only kernel management). As a result, reconnection creates a new session key even though the kernel_id is preserved, but because the composed index doesn't match, the buffered messages get discarded (instead of replayed).

I'd still be curious to know your experiences and understanding. I believe this is something we should try to make work.

kevin-bates commented 5 years ago

OK - so apparently there are differences upon how the disconnection works. If you merely stop the network (leaving the notebook intact), then replay works on traditional notebook. Interestingly enough though, you don't see any of the buffering messages in the notebook server log, yet after restoring the network connection, you get the intermediate results. After finding it works the same way with EG, I suspect this is because the "buffering" is being performed by zmq, then replayed when the connection is restored.

However, when closing the tab (which is probably the same as hibernating), the browser loses context and a new session key is created - resulting in the buffered messages being discarded (same as for EG).

These issues are discussed in Notebook issue 1150 and JupyterLab issue 4237. The replay on network connection loss was implemented via Notebook PR 2871 and PR 2916 added the option that you mentioned above.

At this point, I'm of the impression that EG (and Kernel Gateway) behave the same way as traditional Notebook.

I suspect the code that was implemented for PR 2871 does not come into play relative to network connection drops. :smile:

kevin-bates commented 5 years ago

All tests are showing that NB2KG-enabled notebooks are behaving the same way as traditional notebook configurations - both do not trigger replays of buffered messages. I believe that is a function of two things.

  1. How the kernel management determines which buffered messages are to be replayed. Currently, this is based on a composed key of kernel_id and session, where the session value is based on the current connection. As a result, replay doesn't occur.
  2. If replay were to be enabled (which I attempted by ignoring the session component), how the messages are 're-connected' on the notebook client (in front of NB2KG), which, again, I believe to be a function of the notebook code - and is still connection oriented today (i.e., the replayed and new messages still do not appear).

There has been some revived discussion on https://github.com/jupyter/notebook/pull/2871 that I will be monitoring.

Where issues may arise relative to NB2KG and Gateway servers is if the session key portion becomes a function of the session class from the notebook side of things. In that case, we'd need to ensure that that value get communicated when creating the kernel.

At a minimum, I believe we'll want to push the buffer index/key determination into a method on kernel manager that the gateway servers can override to deal with their 'detached' design.

@ktong - I'm inclined to close this issue unless you feel it should remain open. I will be tracking message buffering and make sure replays (once properly implemented) also work relative to the gateway servers.

ktong commented 5 years ago

It’s pretty clear answer. Thanks, @kevin-bates. Please feel free to close it.