Question about High Availability for JEG on k8s

chiawchen commented 2 years ago

Description

Whenever K8s try to terminate a pod, application will receive a SIGTERM signal [reference], and ideally do the gracefully shutdown; however, I found the line here in JEG,

https://github.com/jupyter-server/enterprise_gateway/blob/7a9a6469a1f0153ae6f425c19526aeef11fae9e3/enterprise_gateway/enterprisegatewayapp.py#L343

it will trigger a shutdown to all the existing kernels, thus existing kernel information will be eliminated even if we have external webhook kernel session persistent [reference on JEG doc]. Did I miss anything about handling the restart happened on server side? This may happen quite frequently depends on upgrading sidecar, upgrading some configuration for JEG, of even simply upgrading the hardcoded kernelspec.

Reproduce

Deploy JEG as k8s service with Replication availability & Webhook Kernel Session Persistence
Connect it through jupyterlab and create an arbitrary remote kernel
Delete one of the JEG replica through kubectl delete pod <pod_name>
Observe the remote kernel been deleted instead of preserving for later re-connection

Expected behavior

Shouldn't shutdown remote kernel, but only shutdown local kernel running on JEG (cuz it's impossible to retrieve back the process)

Context

Operating System and version: Kubernetes v1.18
Browser and version: N/A
Jupyter Server version: 1.18.1
Jupyter Enterprise Gateway: v3.0.0dev

Troubleshoot Output

Paste the output from running `jupyter troubleshoot` from the command line here.
You may want to sanitize the paths in the output.

Command Line Output

Paste the output from your command line running `jupyter lab` here, use `--debug` if possible.

Browser Output

Paste the output from your browser Javascript console here, if applicable.

kevin-bates commented 2 years ago

Hi @chiawchen - yeah, the HA/DR machinery has not been fully resolved. It is primarily intended for hard failures, behaving more like SIGKILL than SIGTERM, where remote kernels are orphaned.

It makes sense to make the automatic kernel shutdown sensitive to failover configuration, although I wonder if it should be an explicit option (so that we don't always orphan remote kernels), at least for now. Perhaps something like terminate_kernels_on_shutdown that defaults to True and must be explicitly set to False. Operators in configurations that need to perform periodic upgrades would then want to set this. If we find the machinery to be solid, we could then tie this option to the HA modes.

Also note that we now support terminationGracePeriodSeconds in the helm chart.

chiawchen commented 2 years ago

avoiding orphan remote kernels

make sense for general use case, to prevent this, i think operator side need to have some auto-GC (e.g. 1 week delete all the remote kernel pod) enabled as the final guard

kevin-bates commented 2 years ago

avoiding orphan remote kernels

make sense for general use case, to prevent this, i think operator side need to have some auto-GC (e.g. 1 week delete all the remote kernel pod) enabled as the final guard

Later last night I realized that, so long as there's another EG instance running at the time the first gets shutdown or even sometime later, and that "other instance" shares the same kernel persistence store (which is assumed in HA configs), then the only kernel pods to be orphaned would be those in which a user never interacts with following the stopped EG's shutdown. That is, even those kernel pods should become active again by virtue of the "hydration" that occurs when a user interacts with their kernel via interrupt or reconnect, etc.

But, yes, we've talked about introducing some admin-related endpoints - one of which could interrogate the kernel persistence store and compare that with the set of managed kernels (somehow checking with each EG instance) and present of list of currently unmanaged kernels. On Kubernetes, this application could present some of the labels, envs, etc. that reside on the kernel pod to help operators better understand whether they should be hydrated or terminated.

This leads me to wonder if kernel provisioners (and perhaps the older, to be obsoleted, process proxies) should expose a method allowing users to access their "metadata" given a kernel_id (or whatever else is necessary to locate the kernel).

jupyter-server / enterprise_gateway