Closed jrj-d closed 5 years ago
Hi @jrj-d - thank you for your interest in EG.
You raise a good point about the ability to override KERNEL_LAUNCH_TIMEOUT
values. I have a few points that might help with this discussion, followed by some questions of my own.
KERNEL_LAUNCH_TIMEOUT
to the env
stanza of the kernelspec since it is not applicable to the kernel. Its solely used by EG to know when to 'give up' waiting for a given kernel to be started.EG_KERNEL_LAUNCH_TIMEOUT
exists for the cases when the client doesn't include a value and acts as a default. However, the Notebook 6/gateway package and nb2kg have been updated to always set the value (default 40s), so it would only come into play for pure REST clients (which is still applicable).KERNEL_LAUNCH_TIMEOUT
in order to prevent issues where the connection times out due to slow kernel startup times, but EG continues to think the kernel (eventually) starts because it met the requirement of starting within the (larger)KERNEL_LAUNCH_TIMEOUT
. So there's a strong relationship there. The default for those timeout values is 60s.One approach that may be possible here is to use the launch timeout value that is the larger of the two - KERNEL_LAUNCH_TIMEOUT
and EG_KERNEL_LAUNCH_TIMEOUT
. However, would still want the pair of connection-related timeouts to be greater and those are only client-side parameters.
If you are in a JupyterHub environment, then the issue is simple, you just update the respective client-side timeouts in the configuration that launches the Notebook server pod. But I suspect that's not the case here.
Have you experimented with increasing KERNEL_LAUNCH_TIMEOUT
such that the kernel starts successfully? Just want to be sure you're not chasing an issue that no timeout can solve.
I'm curious about your environment that kernel start can take longer than 40 seconds. Does this happen when there are lots of other kernels already running (i.e., a resource scheduling delay)? We find that most kernel startups in k8s take on the order of 5 - 10 seconds.
Thanks.
Thanks for the quick answer!
From your points on the dependency of kernel launch timeout on connect and request timeout values, I understand that I cannot set kernel launch timeout to arbitrary values on EG side.
(I've set KERNEL_LAUNCH_TIMEOUT
to larger values so that the kernel starts successfully)
Here's more about my context:
I'm building a platform for running Spark jobs that is supported by EG.
The requirement is that any EG-compatible client with minimal knowledge about the platform can connect and run jobs.
I'm trying to separate concerns between notebook management (responsibility of the client) and kernel management (EG's responsibility).
In this regard, I thought it made sense that KERNEL_LAUNCH_TIMEOUT
was handled by EG or on a per-kernel basis, as the time kernels take to launch is a EG/cluster related thing.
Now I understand better that I cannot play with this value independently of the client. So I'll try and make my kernels launch faster, or give instructions to clients.
Timeout only happens when there is not enough room on the k8s cluster.
In this case,
The combination of those two steps can take more than 40s.
I think I can improve on both fronts:
Do you think those two leads make sense? Thanks!
This is great information - thank you! Yes - those leads make perfect sense.
Based on your comments it seems like the KIP daemonset may only be useful if the scaling was manual or somehow early enough - so the node can be added (and images pulled) before the kernel startup request.
I don't have any experience with auto-scaling but if there are ways to configure the auto-scaler such that it can anticipate overflow soon enough, that would be ideal and KIP may be sufficient. FWIW, Hub uses placeholder pods that are then pre-empted by actual pods and the pre-empted placeholder is restarted, eventually triggering overflow where a daemonset can pre-pull required images. Of course, the trick is to balance small enough placeholders with what the actual request requires and when the actual request requires pre-emption of multiple placeholders, things can get tricky. Here's a great Hub issue where these kinds of things are discussed and might prove helpful: https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/1414
It would also be nice if the server (in this case, the notebook server on which EG is built) could not require the kernel's startup within initial request, but instead, hand back a handle (kernel_id would work fine) that the client can then ask - "is my kernel ready for requests?" and decouple the kernel launch from the connection timeouts.
It would be awesome if you could spend some time in this area - thank you!
I'll certainly have to solve this problem one way or another, so I should be able to propose something at some point! But I'm quite new to tinkering with Jupyter and Kubernetes, so it'll take some time :)
Thanks for the references, it looks like it is a rather general issue on Kubernetes!
Closing the issue for now as my initial question about KERNEL_LAUNCH_TIMEOUT
is answered. Thanks!
Description
Hi all, thanks for the work on this project. I'm using Jupyter Enterprise Gateway on Kubernetes, and sometimes the default
KERNEL_LAUNCH_TIMEOUT
is not large enough for some kernels. I'd like to change this default value for all enterprise gateway or on a per-kernel basis, so that any client that connects to the enterprise gateway benefits from a larger default value adapted to the kernel and context.The default value is actually
40s
and comes from the optionGatewayClient.KERNEL_LAUNCH_TIMEOUT
of thejupyter notebook
client (here's the line).As far as I understand, the order of precedence is
So I think it is not possible to change the default value either on EG side or in the kernelspecs, as there is always a value coming the
jupyter notebook
client. More explicitly, the only way to change the kernel launch timeout value is to change it on the client side.Am I right? Or is there a solution that I have overlooked? Should I ask that this default value in the client be removed? Although I think there must be a good reason why it was put there.
Thanks!
Environment
2.0
deployed via Helm chartv1.13.7-gke.24
6.0.1