jupyter-server / enterprise_gateway

A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter Notebooks to share resources across distributed clusters such as Apache Spark, Kubernetes and others.
https://jupyter-enterprise-gateway.readthedocs.io/en/latest/
Other
617 stars 222 forks source link

Address delayed shutdown behavior in k8s kernels #1146

Closed kevin-bates closed 1 year ago

kevin-bates commented 1 year ago

This pull request attempts to address the delay seen when shutting down k8s-managed kernels. I believe the issue is because the pod (and namespace if not BYO namespace) is not terminated when the kernel process has stopped. Previously, the pod/namespace is not deleted until the cleanup phase, but these changes move the container resource termination to when the listener is shut down. This allows the polling to check that the "kernel" has terminated to leverage the container status, which, previously, because it had yet to occur, was causing the fallback termination behavior to occur.

I suspect this was not a problem in older versions of kubernetes because (just a hunch) when the container's process terminated the container terminated. However, I suspect k8s has "grown" enough where interaction with the resource manager is necessary and has introduced some delay.

There are several areas of race condition with starting/stopping pods from other applications, so I think this improves on those, but may not necessarily be the final solution.

kevin-bates commented 1 year ago

Good point @kiersten-stokes. I'm also going to remove the call to get_container_status() within the namespace deletion block because it really reflects the state of the pod and not the namespace and was unconditionally invoked previously. As a result, we should hold off on this merge until I test this and push the commit.