high k8s etcd/apiserver load due to pod listings

juliantaylor commented 1 year ago

Bug description

In clusters with many pods the kubespawner creates significant load on the etcd/apiserver due to its frequent pod listings when using enable_user_namespaces which does listings of all pods in the whole cluster. full pod listings even with a label selector are very expensive as the pods in the etcd are only indexed by the namespaces, the label index is only performed in the apiserver caches. When not setting resource_version (equivalent to setting it to "") the apiserver will always retrieve all data from etcd instead of its caches and thus the listing becomes very expensive [0]. Additionally your watches have a default timeout of 10s causing it to relist the pods frequently. In a cluster with 10k pods the listing can take several seconds in the etcd.

To improve the situation I would suggest to set resource_version=0 in the initial list of the _list_and_update and on watch timeouts pass it the last received resource version to it on subsequent calls. This has the effect that the request can be fulfilled by the apiserver caches without hitting the etcd if possible (which makes the label selector effective). This also means you do not have the strong consistency guarantee of the etcd on the listing.

If this is inappropriate I would suggest to put an additional warning into the enable_user_namespaces documentation to note that the watch timeout should be increased from the default 10s in large clusters.

[0] https://kubernetes.io/docs/reference/using-api/api-concepts/#resource-versions

welcome[bot] commented 1 year ago

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively. welcome You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:

juliantaylor commented 1 year ago

something along the line of this should be more efficient (not yet tested):


--- a/kubespawner/reflector.py
+++ b/kubespawner/reflector.py
@@ -208,7 +208,7 @@ class ResourceReflector(LoggingConfigurable):

         self.watch_task = None

-    async def _list_and_update(self):
+    async def _list_and_update(self, resource_version=None):
         """
         Update current list of resources by doing a full fetch.

@@ -218,9 +218,12 @@ class ResourceReflector(LoggingConfigurable):
         kwargs = dict(
             label_selector=self.label_selector,
             field_selector=self.field_selector,
+            resource_version=resource_version,
             _request_timeout=self.request_timeout,
             _preload_content=False,
         )
+        if resource_version:
+            kwargs["resource_version_match"] = "NotOlderThan"
         if not self.omit_namespace:
             kwargs["namespace"] = self.namespace

@@ -264,6 +267,8 @@ class ResourceReflector(LoggingConfigurable):
             selectors.append("field selector=%r" % self.field_selector)
         log_selector = ', '.join(selectors)

+        resource_version = ""
+
         cur_delay = 0.1

         if self.omit_namespace:
@@ -282,7 +287,7 @@ class ResourceReflector(LoggingConfigurable):
             start = time.monotonic()
             w = watch.Watch()
             try:
-                resource_version = await self._list_and_update()
+                resource_version = await self._list_and_update(resource_version)
                 watch_args = {
                     "label_selector": self.label_selector,
                     "field_selector": self.field_selector,
@@ -325,6 +330,7 @@ class ResourceReflector(LoggingConfigurable):
                         else:
                             # This is an atomic operation on the dictionary!
                             self.resources[ref_key] = resource
+                            resource_version = resource["metadata"]["resourceVersion"]
                         if self._stopping:
                             self.log.info("%s watcher stopped: inner", self.kind)
                             break

yuvipanda commented 1 year ago

Thanks for opening this, @juliantaylor! I think we already do something like this when handling just a single namespace (https://github.com/jupyterhub/kubespawner/blob/8acc3751be9627cd61660f818e070f6c8bdc37d6/kubespawner/reflector.py#L239). A PR to do this for the multi-namespace usecase would be great!

jupyterhub / kubespawner

high k8s etcd/apiserver load due to pod listings #753

Bug description