Open chiawchen opened 2 years ago
Hi @chiawchen - yeah, the HA/DR machinery has not been fully resolved. It is primarily intended for hard failures, behaving more like SIGKILL
than SIGTERM
, where remote kernels are orphaned.
It makes sense to make the automatic kernel shutdown sensitive to failover configuration, although I wonder if it should be an explicit option (so that we don't always orphan remote kernels), at least for now. Perhaps something like terminate_kernels_on_shutdown
that defaults to True
and must be explicitly set to False
. Operators in configurations that need to perform periodic upgrades would then want to set this. If we find the machinery to be solid, we could then tie this option to the HA modes.
Also note that we now support terminationGracePeriodSeconds
in the helm chart.
avoiding orphan remote kernels
make sense for general use case, to prevent this, i think operator side need to have some auto-GC (e.g. 1 week delete all the remote kernel pod) enabled as the final guard
avoiding orphan remote kernels
make sense for general use case, to prevent this, i think operator side need to have some auto-GC (e.g. 1 week delete all the remote kernel pod) enabled as the final guard
Later last night I realized that, so long as there's another EG instance running at the time the first gets shutdown or even sometime later, and that "other instance" shares the same kernel persistence store (which is assumed in HA configs), then the only kernel pods to be orphaned would be those in which a user never interacts with following the stopped EG's shutdown. That is, even those kernel pods should become active again by virtue of the "hydration" that occurs when a user interacts with their kernel via interrupt or reconnect, etc.
But, yes, we've talked about introducing some admin-related endpoints - one of which could interrogate the kernel persistence store and compare that with the set of managed kernels (somehow checking with each EG instance) and present of list of currently unmanaged kernels. On Kubernetes, this application could present some of the labels, envs, etc. that reside on the kernel pod to help operators better understand whether they should be hydrated or terminated.
This leads me to wonder if kernel provisioners (and perhaps the older, to be obsoleted, process proxies) should expose a method allowing users to access their "metadata" given a kernel_id (or whatever else is necessary to locate the kernel).
Description
Whenever K8s try to terminate a pod, application will receive a SIGTERM signal [reference], and ideally do the gracefully shutdown; however, I found the line here in JEG,
https://github.com/jupyter-server/enterprise_gateway/blob/7a9a6469a1f0153ae6f425c19526aeef11fae9e3/enterprise_gateway/enterprisegatewayapp.py#L343
it will trigger a shutdown to all the existing kernels, thus existing kernel information will be eliminated even if we have external webhook kernel session persistent [reference on JEG doc]. Did I miss anything about handling the restart happened on server side? This may happen quite frequently depends on upgrading sidecar, upgrading some configuration for JEG, of even simply upgrading the hardcoded kernelspec.
Reproduce
kubectl delete pod <pod_name>
Expected behavior
Shouldn't shutdown remote kernel, but only shutdown local kernel running on JEG (cuz it's impossible to retrieve back the process)
Context
Troubleshoot Output
Command Line Output
Browser Output