Closed pratik705 closed 3 months ago
I noticed that even after updating the webhooks, some master nodes moved to schedulingdisabled
state. From kube-api logs I can see[1]. I had to bring capsule up to fix the issues. So it seems if capsule is installed in the cluster, it should be always up and running to avoid the downtime.
[1]
E0718 08:21:17.821485 1 cacher.go:440] cacher (*unstructured.Unstructured): unexpected ListAndWatch error: failed to list capsule.clastix.io/v1alpha1, Kind=CapsuleConfiguration: conversion webhook for capsule.clastix.io/v1beta2, Kind=CapsuleConfiguration failed: Post "https://capsule-webhook-service.capsule-system.svc:443/convert?timeout=30s": dial tcp 10.11.179.188:443: connect: connection refused; reinitializing...
E0718 08:21:18.180759 1 dispatcher.go:190] failed calling webhook "nodes.capsule.clastix.io": failed to call webhook: Post "https://capsule-webhook-service.capsule-system.svc:443/nodes?timeout=30s": dial tcp 10.11.179.188:443: connect: connection refused
E0718 08:21:19.217729 1 dispatcher.go:190] failed calling webhook "nodes.capsule.clastix.io": failed to call webhook: Post "https://capsule-webhook-service.capsule-system.svc:443/nodes?timeout=30s": dial tcp 10.11.179.188:443: connect: connection refused
E0718 08:21:19.845023 1 cacher.go:440] cacher (*unstructured.Unstructured): unexpected ListAndWatch error: failed to list capsule.clastix.io/v1alpha1, Kind=CapsuleConfiguration: conversion webhook for capsule.clastix.io/v1beta2, Kind=CapsuleConfiguration failed: Post "https://capsule-webhook-service.capsule-system.svc:443/convert?timeout=30s": dial tcp 10.11.179.188:443: connect: connection refused; reinitializing...
E0718 08:21:21.863769 1 cacher.go:440] cacher (*unstructured.Unstructured): unexpected ListAndWatch error: failed to list capsule.clastix.io/v1alpha1, Kind=CapsuleConfiguration: conversion webhook for capsule.clastix.io/v1beta2, Kind=CapsuleConfiguration failed: Post "https://capsule-webhook-service.capsule-system.svc:443/convert?timeout=30s": dial tcp 10.11.179.188:443: connect: connection refused; reinitializing...
E0718 08:21:23.882288 1 cacher.go:440] cacher (*unstructured.Unstructured): unexpected ListAndWatch error: failed to list capsule.clastix.io/v1alpha1, Kind=CapsuleConfiguration: conversion webhook for capsule.clastix.io/v1beta2, Kind=CapsuleConfiguration failed: Post "https://capsule-webhook-service.capsule-system.svc:443/convert?timeout=30s": dial tcp 10.11.179.188:443: connect: connection refused; reinitializing...
This is not a bug per se, rather, a question.
Capsule is a framework to build a multi-tenant Kubernetes environment: one of the features we provide is the BYOH, and we have some old tutorials in the previous website but it has also been described in the KubeCon EU 2021 talk we presented with WarGaming use case.
We strongly suggest understanding better the Capsule capabilities such as the BYOH feature, by default we're enabling all the features. Still, it comes with the downside of getting impacted if you don't know the risks: this is done to ensure we're offering a smooth and thick all-the-box evaluation at first sight.
If the BYOH is something you're not interested in, disable the webhook. Also, my suggestion is to approach multi-tenant clusters with state-of-the-art strategies, such as having infrastructure nodes (separated from the control plane ones) with taints where all the system components can run without getting impacted.
You should be able to recover your cluster by restarting the Control Plane and removing the taints API server assigned, but it's out of the scope of the current discussion.
Furthermore, I'm moving the discussion to the Discussion section since it's not a bug per se, as you already pointed out, we already have an issue explaining the issue.
Bug description
I have deployed Capsule with a single replica and noticed an issue if that single Capsule replica goes down[1], it brings the Kubernetes cluster down.
After reviewing existing issues, it seems the
nodes.capsule.clastix.io
webhook causes the issue if Capsule is unreachable. As per this comment, I set thefailurePolicy
toIgnore
. Subsequently, the worker nodes recovered[2], but the master nodes moved toReady,SchedulingDisabled
status. From the logs[3], I observed that the issue persisted because Capsule was down. To fix this, I had to setfailurePolicy
toIgnore
forowner.namespace.capsule.clastix.io
mutating webhook and uncordon the master nodes[4].Can anyone help me understand if the behavior I encountered is expected when Capsule goes down in the environment? If so, how can it be avoided? Also, what functionality of capsule will be impacted by setting
failurePolicy
toIgnore
forowner.namespace.capsule.clastix.io
mutating webhook?Thanks in advance.
Steps to reproduce:
Workaround:
failurePoliy
ofnodes.capsule.clastix.io
andowner.namespace.capsule.clastix.io
webhook toIgnore
.Expected behavior
Additional context
[1]
[2]
[3]
[4]