Closed ktong closed 5 years ago
Since NB2KG proxies kernel management to the corresponding gateway, this configuration option is controlled by the gateway server application. Because its enabled by default, then, yes, kernels created via NB2KG support the buffering of offline messages. That said, setting this configuration option on the client (NB2KG) side will be ignored.
If we find that too much memory is consumed from buffering, we may want to look at making this a per-kernel option and have Enterprise Gateway disable it by default.
I hope this answers your question.
Below is my steps:
Any step or configuration do I miss?
I'm not familiar with how the buffering works other than the grep I performed and seeing items logged in the trace log, so I just tried it with this cell content, then closing the browser tab...
import time
for i in range(500):
time.sleep(1.0)
print(i)
When closing the tab, I see the debug entries on the EG server transition to indicate that buffering is taking place...
[D 2018-10-09 06:59:41.377 EnterpriseGatewayApp] activity on 4724614d-810f-4f94-9559-564aae170e12: stream
[D 2018-10-09 06:59:42.381 EnterpriseGatewayApp] activity on 4724614d-810f-4f94-9559-564aae170e12: stream
[D 2018-10-09 06:59:43.381 EnterpriseGatewayApp] activity on 4724614d-810f-4f94-9559-564aae170e12: stream
[D 2018-10-09 06:59:44.382 EnterpriseGatewayApp] activity on 4724614d-810f-4f94-9559-564aae170e12: stream
[D 2018-10-09 06:59:44.694 EnterpriseGatewayApp] Websocket closed 4724614d-810f-4f94-9559-564aae170e12:5f7fafe3-2d42fde46eae0361dab31902
[I 2018-10-09 06:59:44.694 EnterpriseGatewayApp] Starting buffering for 4724614d-810f-4f94-9559-564aae170e12:5f7fafe3-2d42fde46eae0361dab31902
[D 2018-10-09 06:59:44.694 EnterpriseGatewayApp] Clearing buffer for 4724614d-810f-4f94-9559-564aae170e12
[I 181009 06:59:44 web:2162] 200 GET /api/kernels/4724614d-810f-4f94-9559-564aae170e12 (9.108.92.236) 1.28ms
[D 2018-10-09 06:59:45.380 EnterpriseGatewayApp] Buffering msg on 4724614d-810f-4f94-9559-564aae170e12:iopub
[D 2018-10-09 06:59:45.383 EnterpriseGatewayApp] activity on 4724614d-810f-4f94-9559-564aae170e12: stream
[D 2018-10-09 06:59:46.381 EnterpriseGatewayApp] Buffering msg on 4724614d-810f-4f94-9559-564aae170e12:iopub
[D 2018-10-09 06:59:46.383 EnterpriseGatewayApp] activity on 4724614d-810f-4f94-9559-564aae170e12: stream
[D 2018-10-09 06:59:47.382 EnterpriseGatewayApp] Buffering msg on 4724614d-810f-4f94-9559-564aae170e12:iopub
[D 2018-10-09 06:59:47.385 EnterpriseGatewayApp] activity on 4724614d-810f-4f94-9559-564aae170e12: stream
[D 2018-10-09 06:59:48.383 EnterpriseGatewayApp] Buffering msg on 4724614d-810f-4f94-9559-564aae170e12:iopub
I then get back into the notebook and see the EG log transition back to normal behavior...
[D 2018-10-09 06:59:57.246 EnterpriseGatewayApp] Initializing websocket connection /api/kernels/4724614d-810f-4f94-9559-564aae170e12/channels
[W 2018-10-09 06:59:57.248 EnterpriseGatewayApp] No session ID specified
[I 2018-10-09 06:59:57.249 EnterpriseGatewayApp] Adapting to protocol v5.1 for kernel 4724614d-810f-4f94-9559-564aae170e12
[I 181009 06:59:57 web:2162] 101 GET /api/kernels/4724614d-810f-4f94-9559-564aae170e12/channels (9.108.92.236) 4.63ms
[D 2018-10-09 06:59:57.250 EnterpriseGatewayApp] Opening websocket /api/kernels/4724614d-810f-4f94-9559-564aae170e12/channels
[D 2018-10-09 06:59:57.250 EnterpriseGatewayApp] Getting buffer for 4724614d-810f-4f94-9559-564aae170e12
[D 2018-10-09 06:59:57.251 EnterpriseGatewayApp] Clearing buffer for 4724614d-810f-4f94-9559-564aae170e12
[I 2018-10-09 06:59:57.251 EnterpriseGatewayApp] Discarding 12 buffered messages for 4724614d-810f-4f94-9559-564aae170e12:5f7fafe3-2d42fde46eae0361dab31902
[D 2018-10-09 06:59:57.251 EnterpriseGatewayApp] Connecting to: tcp://172.16.172.6:33243
[D 2018-10-09 06:59:57.252 EnterpriseGatewayApp] Connecting to: tcp://172.16.172.6:47165
[D 2018-10-09 06:59:57.252 EnterpriseGatewayApp] Connecting to: tcp://172.16.172.6:48693
[D 2018-10-09 06:59:57.400 EnterpriseGatewayApp] activity on 4724614d-810f-4f94-9559-564aae170e12: stream
[D 2018-10-09 06:59:58.401 EnterpriseGatewayApp] activity on 4724614d-810f-4f94-9559-564aae170e12: stream
[D 2018-10-09 06:59:59.399 EnterpriseGatewayApp] activity on 4724614d-810f-4f94-9559-564aae170e12: stream
[D 2018-10-09 07:00:00.400 EnterpriseGatewayApp] activity on 4724614d-810f-4f94-9559-564aae170e12: stream
[D 2018-10-09 07:00:01.405 EnterpriseGatewayApp] activity on 4724614d-810f-4f94-9559-564aae170e12: stream
However, nothing is getting posted to the Notebook, yet the notebook is appearing as busy and I can interrupt the kernel - at which time I see the EG log indicate that appropriately...
[D 2018-10-09 07:00:15.418 EnterpriseGatewayApp] activity on 4724614d-810f-4f94-9559-564aae170e12: stream
[D 2018-10-09 07:00:16.419 EnterpriseGatewayApp] activity on 4724614d-810f-4f94-9559-564aae170e12: stream
[D 2018-10-09 07:00:17.420 EnterpriseGatewayApp] activity on 4724614d-810f-4f94-9559-564aae170e12: stream
[D 2018-10-09 07:00:18.327 EnterpriseGatewayApp] RemoteKernelManager.signal_kernel(2)
[D 2018-10-09 07:00:18.328 EnterpriseGatewayApp] Signal (2) sent via gateway communication port.
[I 2018-10-09 07:00:18.328 EnterpriseGatewayApp] Kernel interrupted: 4724614d-810f-4f94-9559-564aae170e12
[I 181009 07:00:18 web:2162] 204 POST /api/kernels/4724614d-810f-4f94-9559-564aae170e12/interrupt (9.108.92.236) 1.91ms
[D 2018-10-09 07:00:18.383 EnterpriseGatewayApp] activity on 4724614d-810f-4f94-9559-564aae170e12: stream
[D 2018-10-09 07:00:18.388 EnterpriseGatewayApp] activity on 4724614d-810f-4f94-9559-564aae170e12: error
[D 2018-10-09 07:00:18.393 EnterpriseGatewayApp] activity on 4724614d-810f-4f94-9559-564aae170e12: status (busy)
[D 2018-10-09 07:00:18.398 EnterpriseGatewayApp] activity on 4724614d-810f-4f94-9559-564aae170e12: status (idle)
[D 2018-10-09 07:00:18.403 EnterpriseGatewayApp] activity on 4724614d-810f-4f94-9559-564aae170e12: status (busy)
[D 2018-10-09 07:00:18.408 EnterpriseGatewayApp] activity on 4724614d-810f-4f94-9559-564aae170e12: status (idle)
[D 2018-10-09 07:00:18.445 EnterpriseGatewayApp] activity on 4724614d-810f-4f94-9559-564aae170e12: status (idle)
In the previous log snippet I see that is 'discarded' the buffered messages - so there must be something that "asks" for those messages - or something like that.
I figured that traditional Notebook (i.e., one that doesn't proxy its kernel management to a gateway server) would work, so I tried it using a similar cell. Much to my surprise I see precisely the same behavior! The closed tab is detected and the re-opened notebook is detected, BUT that re-opened notebook does not display the current results AND I see the same 'discarded' message logged in the notebook log.
[I 07:20:11.132 NotebookApp] Discarding 4 buffered messages for 1b27cea3-4858-41b3-834b-b8b63326b7c3:2801368a62b045339effc95362daed64
What experiences do you have with this feature? I'm wondering if this is something that JupyterLab supports over Notebook. I haven't tried lab - kinda wanted to hear your experiences first.
In looking into this a bit more, I probably need to revisit the traditional notebook scenario because "play back" is supposed to work. However, the key used to index into the buffered set of messages is a function of the kernel_id and session key and I think the issue is that NB2KG doesn't proxy session management (only kernel management). As a result, reconnection creates a new session key even though the kernel_id is preserved, but because the composed index doesn't match, the buffered messages get discarded (instead of replayed).
I'd still be curious to know your experiences and understanding. I believe this is something we should try to make work.
OK - so apparently there are differences upon how the disconnection works. If you merely stop the network (leaving the notebook intact), then replay works on traditional notebook. Interestingly enough though, you don't see any of the buffering messages in the notebook server log, yet after restoring the network connection, you get the intermediate results. After finding it works the same way with EG, I suspect this is because the "buffering" is being performed by zmq, then replayed when the connection is restored.
However, when closing the tab (which is probably the same as hibernating), the browser loses context and a new session key is created - resulting in the buffered messages being discarded (same as for EG).
These issues are discussed in Notebook issue 1150 and JupyterLab issue 4237. The replay on network connection loss was implemented via Notebook PR 2871 and PR 2916 added the option that you mentioned above.
At this point, I'm of the impression that EG (and Kernel Gateway) behave the same way as traditional Notebook.
I suspect the code that was implemented for PR 2871 does not come into play relative to network connection drops. :smile:
All tests are showing that NB2KG-enabled notebooks are behaving the same way as traditional notebook configurations - both do not trigger replays of buffered messages. I believe that is a function of two things.
There has been some revived discussion on https://github.com/jupyter/notebook/pull/2871 that I will be monitoring.
Where issues may arise relative to NB2KG and Gateway servers is if the session key portion becomes a function of the session class from the notebook side of things. In that case, we'd need to ensure that that value get communicated when creating the kernel.
At a minimum, I believe we'll want to push the buffer index/key determination into a method on kernel manager that the gateway servers can override to deal with their 'detached' design.
@ktong - I'm inclined to close this issue unless you feel it should remain open. I will be tracking message buffering and make sure replays (once properly implemented) also work relative to the gateway servers.
It’s pretty clear answer. Thanks, @kevin-bates. Please feel free to close it.
MappingKernelManager.buffer_offline_messages : Bool Default: True
Whether messages from kernels whose frontends have disconnected should be buffered in-memory.
When True (default), messages are buffered and replayed on reconnect, avoiding lost messages due to interrupted connectivity.
Disable if long-running kernels will produce too much output while no frontends are connected.