Open tedli opened 11 months ago
First, Karmada now using /readyz
to detect the cluster healthy, and will back off to healthz
in case of no response from /readyz
.
I remember /readyz
will be several checks. I'll investigate it and try to list all the checks later.
So we'd better add the ability to configure how to check a member cluster, since the cluster may in different distribution (Azure, AWS, Huawei, etc.),
Do you mean the /readyz
or healthz
implementation is different between these distributions?
Some users I meet also asked this question, /readyz
and /healthz
are not enough to detect the cluster status.
Like the network from Karmada to member clusters is not ok, but the cluster works ok on the cloud provider. So maybe the failover is not necessary.
They may need some custom way to define whether the clusters are healthy.
UPDATE: incorrect, for reference only ++++++++++++++++++++++++
Hi all,
Thanks for reply.
Hi @RainbowMango , I found how apiserver handle /readyz
and /healthz
:
https://github.com/kubernetes/apiserver/blob/6e796557fa8a46ec48109166f6db572bbda87267/pkg/server/config.go#L363-L403
It seems both readyz and healthz act the same, and nearly always returns ok, if the connectivity to apiserver is not broken. (I didn't set a break point to debug through to make sure it's exactly match my thought. I may got it wrong.)
And I didn't express well, I didn't mean readyz, healthz varies between different distributions. What I want to point out, is that, if I didn't got wrong, the code apiserver handles readyz, healthz is just as I pined out above. Then it's not enough to tell a cluster is health or ready only by healthz and readyz. Then extra checks is necessary, but how to check, the action varies, like AWS EKS, it needs various complex role or rolebing, which not likely could have a common check logic to cover all distribution (intree karmada builtin). So how to check a cluster's healthness, is on the user side, the user should give a /healthz
liked probe, and set to Cluster
CR, then cluster controller checks if this is set, then call this /healthz
, it's configurable and extendible, what's more, karmada don't need extra RBACs, roles.
Sorry for the slow response, I just got time to dig into this issue.
It seems both readyz and healthz act the same, and nearly always returns ok, if the connectivity to apiserver is not broken.
Given /healthz
has been deprecated since Kubernetes v1.16, my investigation only focuses on /readyz
.
/readyz
not only checks connection but with other checks, you can get all available checks using the verbose
parameter, like:
# curl -k https://172.18.0.3:6443/readyz?verbose
[+]ping ok
[+]log ok
[+]etcd ok
[+]etcd-readiness ok
[+]informer-sync ok
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/priority-and-fairness-config-consumer ok
[+]poststarthook/priority-and-fairness-filter ok
[+]poststarthook/storage-object-count-tracker-hook ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/crd-informer-synced ok
[+]poststarthook/bootstrap-controller ok
[+]poststarthook/rbac/bootstrap-roles ok
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/priority-and-fairness-config-producer ok
[+]poststarthook/start-cluster-authentication-info-controller ok
[+]poststarthook/start-kube-apiserver-identity-lease-controller ok
[+]poststarthook/start-kube-apiserver-identity-lease-garbage-collector ok
[+]poststarthook/start-legacy-token-tracking-controller ok
[+]poststarthook/aggregator-reload-proxy-client-cert ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
[+]poststarthook/apiservice-openapi-controller ok
[+]poststarthook/apiservice-openapiv3-controller ok
[+]shutdown ok
readyz check passed
From the output above, you can see there are several checks.
By the way, the ReadyzChecks are default checks and initialized with PingHealthz and LogHealthz, more checks will be appended during kube-apiserver startup, for example:
here would append etcd-readiness
check.
The reason why I insist on investigating what exactly checks have been done by /readyz
is I want to figure out the gap between what we have and user expectations. That will guide us to work out new proposals (solutions).
After going through the discussion on #3616, I tend to the idea mentioned by @Poor12 on https://github.com/karmada-io/karmada/pull/3616#issuecomment-1590400173 that we can have a field on Cluster object to clearly note the extension. But I think we need more declaration on that to address questions like:
livenessProbe
(take the name as a place holder) and current cluster status reconcile logic? I mean should the liveness probe
bypass the current logic or it is just a supplement?Push
and Pull
mode clusters to react with the liveness probe
?We may have more questions in addition to the issues mentioned above. I believe that this is not a simple feature that can be finished through just one PR. We need to conduct a detailed analysis and I suggest creating a proposal to address the thoughts on these issues.
@tedli @Poor12 @jwcesign @chaunceyjiang any comments? I think this feature might be an important extension capability for Karmada, it might be used to enhance the cluster failover functionality.
I'd like to share my opinion about the above several questions:
livenessProbe
is the enhancement of original health check. It's a process of calling the chain. If it passes the original health checker, it may need to pass the custom health checker.readyz
is a comprehensive check for native kubernetes. We may not need to extend the exact checker.Introducing a new livenessProbe
liked field is good. (I used annotations, is because I'm sure it's a common scenario)
I have a crude impl, already running for a week. My impl, checks at least 1 node is ready, at least 1 non host networked pod is ready of a member cluster to tell a member cluster is ready.
I think something like my impl, these basic checks, can be impl in tree, users can just config n not 1 node, pod.
If no such in tree basic checks, it's far more complex to config a livenessProbe
, which may let some first time user give up karmada.
I'm trying to figure out which PR/Issue should be included in the coming v1.7 release which is planned at the end of this month. I guess we don't have enough time for this feature, so I'm moving this to v1.8.
What would you like to be added:
Configurable cluster health check plugin, to check member cluster could run workloads or not.
Why is this needed:
Current impl only check
/healthz
of apiserver, which nearly always returns healthy:https://github.com/kubernetes/apiserver/blob/6e796557fa8a46ec48109166f6db572bbda87267/pkg/server/config.go#L363-L364
https://github.com/karmada-io/karmada/blob/dff7a5b7881ea9f101803ffab04f6f7d0e93bd29/pkg/controllers/status/cluster_status_controller.go#L383-L388
But, apiserver is healthy dosen't represent the cluster is healthy, it's may be all nodes not ready, etc. , pods will never run.
So we'd better add the ability to configure how to check a member cluster, since the cluster may in different distribution (Azure, AWS, Huawei, etc.), at least we need to check node status of cluster, to avoid all nodes not ready (due to cni failed, etc.)