Can't effectively change KERNEL_LAUNCH_TIMEOUT on EG side

Description

Hi all, thanks for the work on this project. I'm using Jupyter Enterprise Gateway on Kubernetes, and sometimes the default KERNEL_LAUNCH_TIMEOUT is not large enough for some kernels. I'd like to change this default value for all enterprise gateway or on a per-kernel basis, so that any client that connects to the enterprise gateway benefits from a larger default value adapted to the kernel and context.

The default value is actually 40s and comes from the option GatewayClient.KERNEL_LAUNCH_TIMEOUT of the jupyter notebook client (here's the line).

As far as I understand, the order of precedence is

KERNEL_LAUNCH_TIMEOUT (from "jupyter notebook" client)
>> KERNEL_LAUNCH_TIMEOUT (from the "env" stanza in kernelspecs)
>> EG_KERNEL_LAUNCH_TIMEOUT (env variable for enterprise gateway)

So I think it is not possible to change the default value either on EG side or in the kernelspecs, as there is always a value coming the jupyter notebook client. More explicitly, the only way to change the kernel launch timeout value is to change it on the client side.

Am I right? Or is there a solution that I have overlooked? Should I ask that this default value in the client be removed? Although I think there must be a good reason why it was put there.

Thanks!

Environment

Enterprise Gateway version 2.0 deployed via Helm chart
Running on GKE v1.13.7-gke.24
client is Jupyter Notebook version 6.0.1

Hi @jrj-d - thank you for your interest in EG.

You raise a good point about the ability to override KERNEL_LAUNCH_TIMEOUT values. I have a few points that might help with this discussion, followed by some questions of my own.

There should be no need to apply KERNEL_LAUNCH_TIMEOUT to the env stanza of the kernelspec since it is not applicable to the kernel. Its solely used by EG to know when to 'give up' waiting for a given kernel to be started.
EG_KERNEL_LAUNCH_TIMEOUT exists for the cases when the client doesn't include a value and acts as a default. However, the Notebook 6/gateway package and nb2kg have been updated to always set the value (default 40s), so it would only come into play for pure REST clients (which is still applicable).
It's value is closely tied to the corresponding "connect" and "request" timeout values - which the server (EG) doesn't have any knowledge of since they come into play in the connection between the client (Notebook Server) and EG. It is strongly advised that those timeout values always be larger than KERNEL_LAUNCH_TIMEOUT in order to prevent issues where the connection times out due to slow kernel startup times, but EG continues to think the kernel (eventually) starts because it met the requirement of starting within the (larger)KERNEL_LAUNCH_TIMEOUT. So there's a strong relationship there. The default for those timeout values is 60s.
We figured different clients may be using different kinds of kernels - some requiring much longer launch times than others, so felt it best to make it a "client-side" property. That said, for more controlled environments, a global override is handy.

One approach that may be possible here is to use the launch timeout value that is the larger of the two - KERNEL_LAUNCH_TIMEOUT and EG_KERNEL_LAUNCH_TIMEOUT. However, would still want the pair of connection-related timeouts to be greater and those are only client-side parameters.

If you are in a JupyterHub environment, then the issue is simple, you just update the respective client-side timeouts in the configuration that launches the Notebook server pod. But I suspect that's not the case here.

Have you experimented with increasing KERNEL_LAUNCH_TIMEOUT such that the kernel starts successfully? Just want to be sure you're not chasing an issue that no timeout can solve.

I'm curious about your environment that kernel start can take longer than 40 seconds. Does this happen when there are lots of other kernels already running (i.e., a resource scheduling delay)? We find that most kernel startups in k8s take on the order of 5 - 10 seconds.

Thanks.

Thanks for the quick answer!

From your points on the dependency of kernel launch timeout on connect and request timeout values, I understand that I cannot set kernel launch timeout to arbitrary values on EG side.

(I've set KERNEL_LAUNCH_TIMEOUT to larger values so that the kernel starts successfully)

Here's more about my context:

What I'm building

I'm building a platform for running Spark jobs that is supported by EG. The requirement is that any EG-compatible client with minimal knowledge about the platform can connect and run jobs. I'm trying to separate concerns between notebook management (responsibility of the client) and kernel management (EG's responsibility). In this regard, I thought it made sense that KERNEL_LAUNCH_TIMEOUT was handled by EG or on a per-kernel basis, as the time kernels take to launch is a EG/cluster related thing.

Now I understand better that I cannot play with this value independently of the client. So I'll try and make my kernels launch faster, or give instructions to clients.

Why do my kernels take so long to spin up?

Timeout only happens when there is not enough room on the k8s cluster.

In this case,

the cluster needs to autoscale, which happens in reaction to pod creation requests
then KIP downloads the images

The combination of those two steps can take more than 40s.

I think I can improve on both fronts:

tweak the autoscaler to always have room for a few kernels / Spark drivers.
dig into KIP. There might be a way to pre-pull images and to share them between nodes?

Do you think those two leads make sense? Thanks!

This is great information - thank you! Yes - those leads make perfect sense.

Based on your comments it seems like the KIP daemonset may only be useful if the scaling was manual or somehow early enough - so the node can be added (and images pulled) before the kernel startup request.

I don't have any experience with auto-scaling but if there are ways to configure the auto-scaler such that it can anticipate overflow soon enough, that would be ideal and KIP may be sufficient. FWIW, Hub uses placeholder pods that are then pre-empted by actual pods and the pre-empted placeholder is restarted, eventually triggering overflow where a daemonset can pre-pull required images. Of course, the trick is to balance small enough placeholders with what the actual request requires and when the actual request requires pre-emption of multiple placeholders, things can get tricky. Here's a great Hub issue where these kinds of things are discussed and might prove helpful: https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/1414

It would also be nice if the server (in this case, the notebook server on which EG is built) could not require the kernel's startup within initial request, but instead, hand back a handle (kernel_id would work fine) that the client can then ask - "is my kernel ready for requests?" and decouple the kernel launch from the connection timeouts.

It would be awesome if you could spend some time in this area - thank you!

I'll certainly have to solve this problem one way or another, so I should be able to propose something at some point! But I'm quite new to tinkering with Jupyter and Kubernetes, so it'll take some time :)

Thanks for the references, it looks like it is a rather general issue on Kubernetes!

Closing the issue for now as my initial question about KERNEL_LAUNCH_TIMEOUT is answered. Thanks!

jupyter-server / enterprise_gateway