Kill k8s pods earlier when specific errors appear

CarlosDominguezBecerril commented 1 year ago

Proposed change

In our current setup, when we create a server, I can encounter any of the following errors: "probe failed", "ErrImagePull", "ImagePullBackOff".

Example: 2023-09-21T14:50:07Z [Warning] Readiness probe failed: Get "http://{my_ip}:8658/health/ready": dial tcp {my_ip}:8658: connect: connection refused

When this occurs, I have to wait for kubespawner to time out. In my case, the timeout is set to 20 minutes because occasionally I need to pull a large docker image.

It would be great to have a feature that automatically deletes the pods when certain Kubernetes errors (error messages provided by the users) are detected to avoid unnecessary waiting.

Alternative options

Who would use this feature?

Anyone that wants to kill their pods earlier if certain error messages appear.

(Optional): Suggest a solution

My simple patch for version 4.3.0 (can't update to new version due to k8s version). I think code is more explainable, but the idea is to have a variable like self.kill_messages = ["probe failed", "ErrImagePull", "ImagePullBackOff"] with the error messages that must kill the pod when found. If any of these errors are found make exponential_backoff raise an Error (code can definitely be improved)

--- kubespawner/spawner.py
+++ kubespawner/spawner.py
@@ -44,6 +44,9 @@ from .objects import (
 )
 from .reflector import ResourceReflector

+import random
+from tornado import ioloop
+

 class PodReflector(ResourceReflector):
     """
@@ -199,6 +202,20 @@ class KubeSpawner(Spawner):
         if self.events_enabled:
             self._start_watching_events()

+        self.need_to_kill = False
+        # Add error messages to this list to kill earlier the pod
+        self.kill_messages = ["probe failed", "ErrImagePull", "ImagePullBackOff"]
+
+    # Wrapper to capture all the logs and find kill_messages to early terminate pods
+    def log_kill(self, msg, *args, **kwargs):
+        # Send original log
+        self.log.debug(msg, *args, **kwargs)
+        # Check if there is a log that requires to kill the pod
+        for kill_message in self.kill_messages:
+            if kill_message.lower() in msg.lower():
+                self.need_to_kill = True
+                self.log.debug(f"Kill pod message: {kill_message} found!!!!")
+
     def _await_pod_reflector(method):
         """Decorator to wait for pod reflector to load

@@ -2358,6 +2375,9 @@ class KubeSpawner(Spawner):
                     # 30 50 63 72 78 82 84 86 87 88 88 89
                     progress += (90 - progress) / 3

+                    # Send the logs that we can see in the webpage
+                    self.log_kill(f"{event['message']}")
+
                     yield {
                         'progress': int(progress),
                         'raw_event': event,
@@ -2638,6 +2658,51 @@ class KubeSpawner(Spawner):
         else:
             return True

+    # Exactly the same function as exponential_backoff. But we add the check of self.need_to_kill to raise an error
+    # Creating this function again to avoid any conflicts with the original one
+    async def exponential_backoff_kill(
+        self,
+        pass_func,
+        fail_message,
+        start_wait=0.2,
+        scale_factor=2,
+        max_wait=5,
+        timeout=10,
+        timeout_tolerance=0.1,
+        *args,
+        **kwargs,
+    ):
+        loop = ioloop.IOLoop.current()
+        deadline = loop.time() + timeout
+        # add jitter to the deadline itself to prevent re-align of a bunch of
+        # timing out calls once the deadline is reached.
+        if timeout_tolerance:
+            tol = timeout_tolerance * timeout
+            deadline = random.uniform(deadline - tol, deadline + tol)
+        scale = 1
+        while True:
+            # Extra code to handle when we need to kill the server
+            if self.need_to_kill:
+                raise Exception(
+                    "Upps, we found an error while we were trying to create your server. Please create a new one")
+            ret = await maybe_future(pass_func(*args, **kwargs))
+            # Truthy!
+            if ret:
+                return ret
+            remaining = deadline - loop.time()
+            if remaining < 0:
+                # timeout exceeded
+                break
+            # add some random jitter to improve performance
+            # this prevents overloading any single tornado loop iteration with
+            # too many things
+            limit = min(max_wait, start_wait * scale)
+            if limit < max_wait:
+                scale *= scale_factor
+            dt = min(remaining, random.uniform(0, limit))
+            await asyncio.sleep(dt)
+        raise asyncio.TimeoutError(fail_message)
+
     async def _start(self):
         """Start the user's pod"""

@@ -2662,7 +2727,7 @@ class KubeSpawner(Spawner):
             pvc = self.get_pvc_manifest()

             # If there's a timeout, just let it propagate
-            await exponential_backoff(
+            await self.exponential_backoff_kill(
                 partial(
                     self._make_create_pvc_request, pvc, self.k8s_api_request_timeout
                 ),
@@ -2681,7 +2746,7 @@ class KubeSpawner(Spawner):

         ref_key = f"{self.namespace}/{self.pod_name}"
         # If there's a timeout, just let it propagate
-        await exponential_backoff(
+        await self.exponential_backoff_kill(
             partial(self._make_create_pod_request, pod, self.k8s_api_request_timeout),
             f'Could not create pod {ref_key}',
             timeout=self.k8s_api_request_retry_timeout,
@@ -2691,7 +2756,7 @@ class KubeSpawner(Spawner):
             try:
                 # wait for pod to have uid,
                 # required for creating owner reference
-                await exponential_backoff(
+                await self.exponential_backoff_kill(
                     lambda: self.pod_has_uid(
                         self.pod_reflector.pods.get(ref_key, None)
                     ),
@@ -2706,7 +2771,7 @@ class KubeSpawner(Spawner):
                 if self.internal_ssl:
                     # internal ssl, create secret object
                     secret_manifest = self.get_secret_manifest(owner_reference)
-                    await exponential_backoff(
+                    await self.exponential_backoff_kill(
                         partial(
                             self._ensure_not_exists,
                             "secret",
@@ -2714,7 +2779,7 @@ class KubeSpawner(Spawner):
                         ),
                         f"Failed to delete secret {secret_manifest.metadata.name}",
                     )
-                    await exponential_backoff(
+                    await self.exponential_backoff_kill(
                         partial(
                             self._make_create_resource_request,
                             "secret",
@@ -2725,7 +2790,7 @@ class KubeSpawner(Spawner):

                 if self.internal_ssl or self.services_enabled:
                     service_manifest = self.get_service_manifest(owner_reference)
-                    await exponential_backoff(
+                    await self.exponential_backoff_kill(
                         partial(
                             self._ensure_not_exists,
                             "service",
@@ -2733,7 +2798,7 @@ class KubeSpawner(Spawner):
                         ),
                         f"Failed to delete service {service_manifest.metadata.name}",
                     )
-                    await exponential_backoff(
+                    await self.exponential_backoff_kill(
                         partial(
                             self._make_create_resource_request,
                             "service",
@@ -2757,7 +2822,7 @@ class KubeSpawner(Spawner):
         # because the handler will have stopped waiting after
         # start_timeout, starting from a slightly earlier point.
         try:
-            await exponential_backoff(
+            await self.exponential_backoff_kill(
                 lambda: self.is_pod_running(self.pod_reflector.pods.get(ref_key, None)),
                 f'pod {ref_key} did not start in {self.start_timeout} seconds!',
                 timeout=self.start_timeout,

welcome[bot] commented 1 year ago

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively. welcome You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:

manics commented 1 year ago

Kubernetes is designed to work asynchronously so I don't think we should kill the pods straight away. For example, if some other resource takes time to update, or if there's a temporary glitch with the container registry, retrying is the correct behaviour.

CarlosDominguezBecerril commented 1 year ago

I agree that ideally, we should't kill the pods straight away. However, in rare cases where retrying is not effective regardless of the underlying issue, I believe that we could kill the pod directly in order to save time and resources. Probably I'm noticing this more due to having a timeout of 20 minutes

aychang95 commented 1 year ago

I'll +1 this just because even with a 5 minute time out, it does get a little cumbersome if you accidentally start a server with incorrect configurations i.e. mispell image, set wrong config, etc.

Maybe another solution is to introduce an alternative "max-retries" instead of "timeout"?

jupyterhub / kubespawner