Open Zsailer opened 2 years ago
In the Jupyter Server meeting today, we discussed that perhaps the control channel shouldn't trigger status messages. I think this makes the most sense, since we expect all control requests-replies to be fast/responsive.
I agree, and for me the reason is not necessarily that control requests should be handled quickly, but rather that the control channel is for talking to "the thing that controls the interpreter", by opposition to the shell channel which is for talking to "the interpreter". Statuses refer to the "interpreter", not to the "controller". A proof that they are not related is that we recommend to run the "controller" in another thread.
cc @JohanMabille
Refreshing JupyterLab when a long running cell is running should not indicate that the kernel is idle. The state is unknown until a kernel_info request is made and the purpose of the kernel_info request is to get a state through a busy-idle cycle. But it's a pretty safe bet when a client arrives, issues a kernel_info request, it should say that the kernel is busy, until it hears back from the kernel otherwise. So that sounds like a deficiency or wrong assumption in JupyterLab, not the protocol. A new client doesn't have to wait for its kernel_info request to be serviced to get some status messages on IOPub: if a client arrives while a bunch of long running cells are already executing that it doesn't know about, it issues its kernel_info request and switches indicator to busy, and when the current long running execution completes, the client can flicker from busy to idle and back to busy when the next long running queued execution starts running.
I think this proposal has to be further refined with what problem it is trying to address. Is the intent here to make kernel_info_request not trigger IOPub busy-idle messages and to have kernel_info_requests be serviced on the control channel only? I can see advantage of that, and to report the state immediately even when the kernel is busy for kernels which, as David mentions above, have the controller running in a separate thread.
However, we must still have messages on the control channel trigger kernel status messages IOPub in the cases of interrupt - because only the client issuing the request receives an acknowledgement of an interrupt request, so how would all other client know that an interrupt occurred if we say that control channel messages should not trigger status messages.
However, we must still have messages on the control channel trigger kernel status messages IOPub in the cases of interrupt
This should trigger a reply to the execute request on the shell channel (with e.g. a KeyboardInterrupt
error), and with it an idle status broadcast on IOPub, right? In other words, control is indirectly triggering a status message through shell.
The definition and handling of the control channel has changed without really being reflect in changes to the spec, making this tricky. Rather than a higher-priority queue for shell messages (the original definition), it's come to also (sometimes) be used for concurrent message handling. Having concurrent message handling means the single busy/idle state indicator doesn't actually work, and clients need to actually track the parent header associated with the status change, because busy/idle only indicates completed processing of a single request, not a global state. This would work well if the busy/idle indicator were on cells instead of on the notebook as a whole.
I think if we propose losing busy/idle entirely on the control channel (a major version bump to the protocol), we also need to make changes like "control channel messages shouldn't produce any side effect messages on IOPub", because without the busy/idle boundaries, there are problems with having any IOPub messages. If that's a problem, then I think dropping busy/idle from control isn't the right change.
I think adding the channel to status messages as in #839 is a simpler and more robust solution, if we preserve the notion that each channel message handling is serialized. If, for instance, the control channel (or even the shell channel) can handle its own requests concurrently, then clients tracking parent requests is really the only possible solution. I'm not certain how you handle multiple clients in this case, though.
A new client doesn't have to wait for its kernel_info request to be serviced to get some status messages on IOPub: if a client arrives while a bunch of long running cells are already executing that it doesn't know about, it issues its kernel_info request and switches indicator to busy, and when the current long running execution completes, the client can flicker from busy to idle and back to busy when the next long running queued execution starts running.
@ivanov, you're right—and I think the issue is with Jupyter Server. When a kernel websocket is opened in Jupyter server, it sends two kernel_info requests—one on shell and one on control. The shell channel's kernel_info gets queued and JLab's indicator does show "busy" for a brief moment, but the control_channel responds immediately and sends an "idle" status immediately after.
As @minrk said, I think the solution is that we track parent messages to know which channel sent the kernel_info request and only listen on the shell channel for busy/idle indicator in the frontend. #839 would help with this tracking, since we could watch messaging on a specific channel.
I'll work on a fix in Jupyter Server that allows us track parent messages and resolve state more accurately in the server.
I'm not certain how you handle multiple clients in this case, though.
@minrk—for frontends that talk to kernels through Jupyter Server, can we solve this making the KernelManager track+store the kernel's execution state? I proposed something along these lines here: https://github.com/jupyter-server/jupyter_server/issues/990
Today, frontends have to resolve this by watching the IOPUb stream directly. I'm proposing that we make the Kernel Manager becomes the source of truth for kernel state. Clients can fetch the current state from the kernel manager at anytime, rather than send ZMQ messages themselves to try and resolve the state.
can we solve this making the KernelManager track+store the kernel's execution state?
In practice, ~all clients connect through the kernel manager, but not in principle (I can connect a jupyter-console
to a kernel started by jupyter-server). So the only true source of this info is the kernel itself. But tracking it in the server could improve the situation for most cases in the interim.
@MSeal @rgbkrk fyi
I think that now that we have the xpub
socket and the welcome message in the protocol, we could use it to indicate the state of the kernel upon new client connections. This has many advantages:
kernel_info_request
to the kernel, one on control to get the info about the kernel, and one on shell only to get the busy/idle status. We can simply drop the latter, and we can state in the protocol specification that kernel_info_request
can be sent on control only.Adding the channel to status messages has the advantage of being very flexible: if we had another channel in the future (there are discussions about subshells for instance), no need to change the protocol again.
Regarding asynchronous kernels, a possible improvement would be to add a busy_level
or busy_factor
field in the status message, indicating how many tasks are being executed concurrently. In this scheme, the idle
message would disappear, replaced with a busy message whose busy_level
is set to 0.
we can state in the protocol specification that kernel_info_request can be sent on control only.
Let's definitely not make that breaking change. I don't think there's a reason to stop accepting kernel_info on the standard channel.
We discussed this briefly in the Jupyter Server meeting today.
Today, messages on the control channel trigger IOPub status messages. This poses some pretty significant issues for Jupyter Lab and Jupyter Server.
In JupyterLab/Jupyter Server, the control channel is being used more and more—e.g. debug requests, nudging with kernel_info requests on websocket reconnect, etc. Most (all?) frontends use the IOPub stream to tell the user what the execution state of the kernel's shell channel is, i.e. "idle" or "busy". Because the control channel also triggers IOPub messages, a busy kernel will sometimes show up as "idle" in the client.
There are a couple ways to reproduce this bug—
In each of these cases, JupyterLab will say the kernel is idle when it's actually "busy".
In the Jupyter Server meeting today, we discussed that perhaps the control channel shouldn't trigger status messages. I think this makes the most sense, since we expect all control requests-replies to be fast/responsive.
This would be a messaging specification change (though I think low-risk) and will likely require a JEP. Before going that route, I want to make sure this makes sense. Thoughts?