projectcapsule / capsule

Multi-tenancy and policy-based framework for Kubernetes.
https://capsule.clastix.io
Apache License 2.0
1.57k stars 155 forks source link

Kuberntes cluster goes down if capsule is down #1135

Closed pratik705 closed 2 months ago

pratik705 commented 2 months ago

Bug description

I have deployed Capsule with a single replica and noticed an issue if that single Capsule replica goes down[1], it brings the Kubernetes cluster down.

After reviewing existing issues, it seems the nodes.capsule.clastix.io webhook causes the issue if Capsule is unreachable. As per this comment, I set the failurePolicy to Ignore. Subsequently, the worker nodes recovered[2], but the master nodes moved to Ready,SchedulingDisabled status. From the logs[3], I observed that the issue persisted because Capsule was down. To fix this, I had to set failurePolicy to Ignore for owner.namespace.capsule.clastix.io mutating webhook and uncordon the master nodes[4].

Can anyone help me understand if the behavior I encountered is expected when Capsule goes down in the environment? If so, how can it be avoided? Also, what functionality of capsule will be impacted by setting failurePolicy to Ignore for owner.namespace.capsule.clastix.io mutating webhook?

Thanks in advance.

Steps to reproduce:

Workaround:

Expected behavior

Additional context

[1]

{"L":"ERROR","T":"2024-07-16T05:55:47.125Z","C":"kubeutils/kube_utils.go:330","M":"failed to update node with newly added labels [failed try 1] [retrying in 20 seconds] : Internal error occurred: failed calling webhook \"nodes.capsule.clastix.io\": failed to call webhook: Post \"https://capsule-webhook-service.capsule-system.svc:443/nodes?timeout=30s\": dial tcp 10.11.113.114:443: connect: connection refused"}

# kubectl get nodes
NAME           STATUS                     ROLES    AGE   VERSION
10.239.0.121   Ready                      master   9d    v1.25.15
10.239.0.122   Ready                      master   9d    v1.25.15
10.239.0.123   Ready,SchedulingDisabled   master   9d    v1.25.15
10.239.0.124   NotReady                   worker   9d    v1.25.15
10.239.0.125   NotReady                   worker   9d    v1.25.15

[2]

# kubectl get nodes -w
NAME           STATUS                     ROLES    AGE   VERSION
10.239.0.121   Ready,SchedulingDisabled   master   9d    v1.25.15
10.239.0.122   Ready,SchedulingDisabled   master   9d    v1.25.15
10.239.0.123   Ready,SchedulingDisabled   master   9d    v1.25.15
10.239.0.124   Ready                      worker   9d    v1.25.15 <==
10.239.0.125   Ready                      worker   9d    v1.25.15 <==

[3]

024-07-17 16:22:53] Name: \"kubernetes-dashboard\", Namespace: \"\" [2024-07-17 16:22:53] for: \"STDIN\": error when patching \"STDIN\": Internal error occurred:
failed calling webhook \"owner.namespace.capsule.clastix.io\": failed to call webhook: Post \"https://capsule-webhook-service.capsule-system.svc:443/namespace-own
er-reference?timeout=30s\": dial tcp 10.11.179.188:443: connect: connection refused],}"}

[4]

# kubectl  get nodes -w
NAME           STATUS   ROLES    AGE   VERSION
10.239.0.121   Ready    master   9d    v1.25.15
10.239.0.122   Ready    master   9d    v1.25.15
10.239.0.123   Ready    master   9d    v1.25.15
10.239.0.124   Ready    worker   9d    v1.25.15
10.239.0.125   Ready    worker   9d    v1.25.15
pratik705 commented 2 months ago

I noticed that even after updating the webhooks, some master nodes moved to schedulingdisabled state. From kube-api logs I can see[1]. I had to bring capsule up to fix the issues. So it seems if capsule is installed in the cluster, it should be always up and running to avoid the downtime.

[1]

E0718 08:21:17.821485       1 cacher.go:440] cacher (*unstructured.Unstructured): unexpected ListAndWatch error: failed to list capsule.clastix.io/v1alpha1, Kind=CapsuleConfiguration: conversion webhook for capsule.clastix.io/v1beta2, Kind=CapsuleConfiguration failed: Post "https://capsule-webhook-service.capsule-system.svc:443/convert?timeout=30s": dial tcp 10.11.179.188:443: connect: connection refused; reinitializing...
E0718 08:21:18.180759       1 dispatcher.go:190] failed calling webhook "nodes.capsule.clastix.io": failed to call webhook: Post "https://capsule-webhook-service.capsule-system.svc:443/nodes?timeout=30s": dial tcp 10.11.179.188:443: connect: connection refused
E0718 08:21:19.217729       1 dispatcher.go:190] failed calling webhook "nodes.capsule.clastix.io": failed to call webhook: Post "https://capsule-webhook-service.capsule-system.svc:443/nodes?timeout=30s": dial tcp 10.11.179.188:443: connect: connection refused
E0718 08:21:19.845023       1 cacher.go:440] cacher (*unstructured.Unstructured): unexpected ListAndWatch error: failed to list capsule.clastix.io/v1alpha1, Kind=CapsuleConfiguration: conversion webhook for capsule.clastix.io/v1beta2, Kind=CapsuleConfiguration failed: Post "https://capsule-webhook-service.capsule-system.svc:443/convert?timeout=30s": dial tcp 10.11.179.188:443: connect: connection refused; reinitializing...
E0718 08:21:21.863769       1 cacher.go:440] cacher (*unstructured.Unstructured): unexpected ListAndWatch error: failed to list capsule.clastix.io/v1alpha1, Kind=CapsuleConfiguration: conversion webhook for capsule.clastix.io/v1beta2, Kind=CapsuleConfiguration failed: Post "https://capsule-webhook-service.capsule-system.svc:443/convert?timeout=30s": dial tcp 10.11.179.188:443: connect: connection refused; reinitializing...
E0718 08:21:23.882288       1 cacher.go:440] cacher (*unstructured.Unstructured): unexpected ListAndWatch error: failed to list capsule.clastix.io/v1alpha1, Kind=CapsuleConfiguration: conversion webhook for capsule.clastix.io/v1beta2, Kind=CapsuleConfiguration failed: Post "https://capsule-webhook-service.capsule-system.svc:443/convert?timeout=30s": dial tcp 10.11.179.188:443: connect: connection refused; reinitializing...
prometherion commented 2 months ago

This is not a bug per se, rather, a question.

Capsule is a framework to build a multi-tenant Kubernetes environment: one of the features we provide is the BYOH, and we have some old tutorials in the previous website but it has also been described in the KubeCon EU 2021 talk we presented with WarGaming use case.

We strongly suggest understanding better the Capsule capabilities such as the BYOH feature, by default we're enabling all the features. Still, it comes with the downside of getting impacted if you don't know the risks: this is done to ensure we're offering a smooth and thick all-the-box evaluation at first sight.

If the BYOH is something you're not interested in, disable the webhook. Also, my suggestion is to approach multi-tenant clusters with state-of-the-art strategies, such as having infrastructure nodes (separated from the control plane ones) with taints where all the system components can run without getting impacted.

You should be able to recover your cluster by restarting the Control Plane and removing the taints API server assigned, but it's out of the scope of the current discussion.

Furthermore, I'm moving the discussion to the Discussion section since it's not a bug per se, as you already pointed out, we already have an issue explaining the issue.