Pre-Proposal: Add Restart In Place API Support

timg512372 commented 1 year ago

Problem

Restart-in-place is a new type of restart requested by the Jupyter community. Normally, when a remote kernel is restarted, both the kernel process and the tunnel process are ended. On resource-managed clusters, ending the tunnel process will end the task that the kernel is scheduled on, forcing the restarted kernel to wait in the back of the queue for another task. This is a major inconvenience for users on distributed clusters who need to restart their kernels often.

With restart-in-place, the kernel process gets terminated but the tunnel process stays alive. Then, the new kernel is launched directly on the existing tunnel process. This may speed up restart times significantly.

There are previously proposed implementations of restart-in-place. All of them were held back by the lack of an officially supported API for restart-in-place, and had to trigger their restarts through non-intuitive methods such as magics. For example, inplace_restarter started a second (nanny) process alongside the kernel and used a magic to tell the nanny process to restart the kernel. However, inplace_restarter could be tricky to setup and was less straightforward to use than a native Jupyterlab button, and there were use cases where a nanny process is undesirable.

Spec Changes

We propose adding native API support for restart-in-place. We will add functionality to jupyter-client's kernel management infrastructure to distinguish between standard restart and restart-in-place requests. We will also add a new server endpoint on jupyter-server: /api/kernels/{kernel_id}/restart-in-place to handle restart in place requests. The new endpoint enables users to specify their restart preferences, and allows kernel implementations to handle the restart-in-place.

We also propose adding a new command in jupyterlab to call the new restart-in-place API. Then, we will add front-end elements for calling the command such as toolbar buttons and menu bars items.

We prefer this approach to writing a new extension because several classes in jupyterlab that manage the kernel lifecycle have private variables. This means that we can't execute all of the restart logic without editing some classes in jupyterlab.

Reference Implementation

We will also add a reference implementation for restart in place using DistributedProvisioner in the package gateway_provisioners, since we assume that many users of remote kernels on resource managed clusters either use Enterprise Gateway or Gateway Provisioners. Because a kernel provisioner is responsible for managing the kernel and tunnel process, and a gateway provisioner is just an extension of a kernel provisioner for remote services, we can edit the DistributedProvisioner to not terminate the tunnel process upon a restart in place.

FAQ

Why is this a JEP?

We decided to write a JEP for restart in place because it requires changes to multiple Project Jupyter packages and edits the spec by which jupyterlab communicates with jupyter_server. Also, while we are pushing one implementation of restart in place for a specific resource manager, we leave open future implementations for other resource managers such as Kubernetes, Spark, Hadoop, etc...

Will this work with ____ kernel / implementation?

The proposal would support restart-in-place with many different processes. We leave the implementation of restart-in-place up to the specific kernel specification, but regardless of the implementation, we need to a new API endpoint to distinguish whether the user wants a normal restart or a restart in place.

What if a kernel doesn't support restart-in-place?

If restart-in-place is not supported, the boolean argument that indicates whether a restart is in-place or not is ignored and the kernel restarts as normal.

Will the frontend be able to tell when restart in place is enabled and active features accordingly?

Yes, we would like to standardize a variable in the kernel specification that will enable restart-in-place functionality. Then,jupyterlab would read in the kernel specification to tell whether restart-in-place is enabled or not.

What are the implementation details of the new API endpoint?

We propose adding an optional boolean keyword argument to the kernel restart methods in MultiKernelManager and KernelManager in jupyter_client. If set to true, they will allow subclasses of KernelManager to execute a restart-in-place, if possible.

In jupyter_server, we will edit the KernelActionHandler class to take an additional action "restart-in-place". This will call the restart function as normal, except with the restart in place keyword argument set to true.

kevin-bates commented 1 year ago

Hi @timg512372. After looking at the initial PRs and re-reading the proposal, I'm wondering if this functionality couldn't be rethought as a configurable option/attribute on the Kernel Provisioner. This would alleviate three layers of plumbing and UI changes for a configurable option that resides in a config file or the kernel specification.

This approach would assume that all users of a given provisioner configured for in-place restart do indeed want in-place restart as their behavior - which seems reasonable. It also seems less confusing relative to the LocalProvisioner, which I don't think could support this w/o altering the kernel or introducing a nanny-like wrapper - which is what most remote implementations use. From a UI perspective, users won't "see" any difference between regular restarts and in-place restarts except that for remote kernels, the behaviors they expect based on their experiences with local kernels are indeed exhibited (now) in their remote kernels. Finally, it would allow kernel provisioner implementations to choose to adopt this functionality and leave the decision as to when it is enabled to the operators configuring their system.

It would be great to discuss this in Thursday's Server/Kernel Meeting if you're available, otherwise, we can discuss further here.

davidbrochart commented 1 year ago

@kevin-bates If this functionality goes through kernel provisioners, then I guess this invalidates a JEP since provisioners are not a Jupyter specification, right? For instance, jupyverse doesn't use jupyter-client and has no notion of kernel provisioner.

davidbrochart commented 1 year ago

Also, let me link to this issue about the specification of "shutdown and restart", which is not clear enough IMO. At some point, akernel (when used with jupyverse) supported something similar to restart-in-place, that consisted of just clearing the kernel namespace and not shutting down the kernel process, resulting in instant restart. This could be an interesting approach to restart-in-place.

minrk commented 1 year ago

I think this behavior definitely makes sense, and I agree that it's currently not defined in the spec in a way to prohibit existing kernel provisioners from doing so, either by default or behind config. An argument in support of this approach is that I think it's probably what most people want and expect from a kernel restart, not necessarily a start-from-scratch fresh kernel. stop & start is already a separate sequence available in the API, distinct from restart, and it's appropriate for the two to behave differently where there's a meaningful distinction. The strongest case for a new API is if folks clearly want both behaviors at the same time in the same session, and distinct stop & start vs restart doesn't satisfy that case.

I think it's valid to add to the existing spec for restart the suggestion that restart ought to re-use resources where appropriate/relevant (i.e. nothing to do for most local kernels).

Slightly tangential, I don't think that clearing the namespace ought to count as a restart (In IPython terminology, this is a %reset, not a restart). This is in large part because you can't (at least in Python) unload DLLs, so restart is the only way to get a fresh import of a compiled module.

mlucool commented 1 year ago

I agree with what @Zsailer said here in https://github.com/jupyter/jupyter_client/issues/736#issuecomment-1102859188:

In this context, there are multiple types of "restart" actions:

Teardown the kernel container (includes the kernel process), then start a new container and kernel process.

Leave the container running but restart the kernel process

Leave the container running and clear the kernel namespace. (1) can be a really expensive, slow operation, so having (2) and (3) as options is really useful.

It would be helpful to have semantics defined/available for each one of these scenarios.

I find being able to do 1 vs. 2 very important as a user, while 3 is less clear I'd use. I don't like the idea of conflating restart with restart in place always because my process may alter state outside of the process (e.g. write a file) and I need some way to toss it and start over from scratch (even if it is expensive).

minrk commented 1 year ago

I find being able to do 1 vs. 2 very important as a user

That makes sense, but we don't have it now, though, right? Or do some provisioners already implement restart as restart in place?

I think an implementation of kernels (the manager/provisioner side, I do not think this is relevant to the kernel side) could comply with the spec as-is and implement 2. as restart and 1. as stop followed by start. It would then be a UI task to expose the two actions (stop & restart is pretty tedious now, I think), but they already exist in the API and protocol. If anything, I think users would be reasonable to expect that this is the behavior we already have, based on the terms we use, and we can help resolve this by encouraging it in defining what 'restart' should mean.

I'm not sure what problem isn't solved by this approach, but I could be missing something.

I don't like the idea of conflating restart with restart in place

To me, current implementations are conflating 'restart' with 'give me a new kernel', when restart is explicitly a distinct action from 'stop this kernel and give me a new one'. This is a lack of clear separation now mainly because local kernel processes don't have a particularly meaningful distinction between the two aside from reusing some ports. I think it is a reasonable expectation that restart for a kernel in a container would preserve the container, as everything is defined today.

We don't have a message for 3., but it is available in a language-specific way (e.g. IPython's %reset magic). IPython Parallel extends the Jupyter protocol with a clear_request (which maps to IPython's %reset), which is 3. exactly. If folks think that's sufficiently useful that it should be implemented as a standard message instead of something kernels implement via their own APIs, that seems like a sensible proposal, and orthogonal to this discussion of restart-in-place.

kevin-bates commented 1 year ago

Hi @davidbrochart.

@kevin-bates If this functionality goes through kernel provisioners, then I guess this invalidates a JEP since provisioners are not a Jupyter specification, right?

This shouldn't require a JEP due to its isolation and not involving multiple projects and user behaviors. However, does not being part of a "Jupyter specification" always precludes the need for the JEP? That doesn't sound right.

For instance, jupyverse doesn't use jupyter-client and has no notion of kernel provisioner.

And that's unfortunate it doesn't use provisioners. IMO, there should be some form of abstraction around the kernel process management to accommodate innovation.

@minrk - thank you for your responses. I think we're essentially saying the same thing, although your response is much more elegant. When building remote support into EG via process-proxies, I used the existing "local" case as the concept to follow and, thus the heavier weight shutdown/startup restart behavior. With Gateway Provisioners, the only difference between today's restart and shutdown/startup is that the kernel-id is preserved. However, the benefits of having an in-place restart far outweigh (IMO) preservation of the kernel-id (wrt those that wish to use today's approach).

Given there's existing behavior in place, I think it makes sense to let each provisioner implementation determine how restarts should behave and (optionally) introduce a config option to let users choose the behavior they want. Because this would be an individual provisioner's decision, and given this really doesn't apply to kernels provisioned via the LocalProvisioner, there is no need to add any attribute to the KernelProvisionerBase class.

timg512372 commented 1 year ago

Hi all, really appreciate the comments! I will be at Thursday's Jupyter Server meeting to discuss the proposal more. It seems that the question is how we want to structure the different types of restarts to the user. It is completely possible to simplify the implementation by leaving it to each individual kernel provisioner, and currently kernel provisioners would have to implement their own restart in place implementations anyways. However, if we want to give users the option to fully restart and to restart in place, it probably makes more sense to add a new spec.

minrk commented 1 year ago

Yes, I think it's fairly clear that folks want a lighter restart, the question remaining is mainly how/where to communicate that via APIs and/or configuration, and I agree with @kevin-bates that it could mainly be at the level of kernel provisioners other than the base class.

However, if we want to give users the option to fully restart and to restart in place, it probably makes more sense to add a new spec.

Yes, if stop & start new is determined to be insufficient. There might be a satisfactory UI-level solution.

It is perhaps worth investigating exactly how disruptive stop & start at the frontend level might be to various use cases (main effect: changed kernel id). For a single view on a single notebook, it really ought to have no impact. But it would probably mean detaching secondary frontends to the same kernel (what about a JupyterLab Console on the same kernel, would it detach? I think so. Should it? I'm not sure! If not, how hard is it for Lab to connect to the new kernel?). Then we'll have a clearer understanding of the trade offs between:

UI complexity of stop & start for "hard restart"
disruption of stop & start
implementation/compatibility/spec cost of two restart methods

krassowski commented 1 year ago

This pre-proposal was discussed during today's jupyter server meeting https://github.com/jupyter-server/team-compass/issues/45#issuecomment-1682582186, here are the notes for visibility:

currently restart is "just preservation of kernel ID"
- relevant when multiple consoles/notebooks are attached to the same kernel
- but there is more complexity in UX actions associated (debugger, restart & re-run)
relevance to local kernels
- proposed in-place restart is the same as current local restart
  - in the local case there is no distinction
  - in remote case there should be a choice
- but two buttons not relevant here
why would users prefer to have a choice?
- ability to switch to a different host
should it be just a configuration option at provisioner level?
- limiting UX options (pros & cons)
discussion: could it be a server extension implementing hard restart (with an API endpoint) for remote scenario, while the default implementation would be specified to be in-place restart?
- that means "fixing" EG/provisioners (possibly with an option)
- complexity/maintenance concerns
- extension implementing hard restart could come later if there is a demand from users
- this could still benefit from a JEP documenting the new default/tightening specification

If a relevant part of the discussion was not captured, please add in a comment.

Zsailer commented 1 year ago

Thanks @krassowski. Just to follow-up from today's server meeting discussion.

Today, by default for local kernels, "restart" means shutdown the kernel (sub)process, start a new kernel (sub)process and keep the same kernel ID. This is, effectively, "restart-in-place".

In remote kernel contexts, e.g. enterprise gateway and gateway provisioners, "restart" is doing something slightly different—they shutdown the kernel (sub)process and the kernel's surrounding environment (e.g. container), start a new kernel environment and subprocess, and keep the same kernel ID. In the meeting, we were calling this "hard restart".

Near the end of the discussion, it was proposed that we should just define what "restart" means unambiguously. Specifically, it should stay consistent with the local case today and "shutdown the kernel (sub)process, start a new kernel (sub)process and keep the same kernel ID", i.e. "restart-in-place". This would mean that EG and gateway provisioners would likely want to change their behavior someday to be consistent with this definition of restart.

As an extension of the kernels API for remote kernels, we could add a "hard restart" plugin—shutdown the kernel (sub)process and the kernel's surrounding environment (e.g. container), start a new kernel environment and subprocess, and keep the same kernel ID. This would be a server extension with a new endpoint, say POST api/kernels/<kernel-id>/hard-restart and a jupyterlab plugin with an additional button in the UI to enable a "hard restart". This extension wouldn't make sense in a local kernel case; hence, we make it an extension that remote kernel contexts can enable.

In essence, this proposal no longer proposes a new "restart-in-place" API; rather, it codifies the current "restart" to always mean "restart-in-place" and proposes an extension to handle the "hard restart" option for remote kernel scenarios.

jupyter / enhancement-proposals