karmada-io / karmada

Open, Multi-Cloud, Multi-Cluster Kubernetes Orchestration
https://karmada.io
Apache License 2.0
4.1k stars 802 forks source link

external cluster health probe #3616

Open tedli opened 10 months ago

tedli commented 10 months ago

What type of PR is this?

/kind feature

What this PR does / why we need it:

Current cluster health checking logic only check /healthz of apiserver of member cluster, which nearly always returns OK. However, at the mean time all nodes in the member cluster may be not ready, and karmada controller will continue dispatch works to it.

Which issue(s) this PR fixes: Fixes #3612

Special notes for your reviewer:

Not ready to merge, just for review this idea. If it's a common needs, I'm glad to finish this.

Does this PR introduce a user-facing change?:

NONE

By defining a annotation value, to tell karmada how to check health of a member cluster, the owner of the member cluster should provide a probe and impl checking logic, because clusters varies.

apiVersion: cluster.karmada.io/v1alpha1
kind: Cluster
metadata:
  name: member-dev
  annotations:
     cluster.karmada.io/health-checker: |-
        {"service":{"namespace":"some-system","name":"member-dev-health","port":80,"path":"/healthz"}}
  finalizers:
  - karmada.io/cluster-controller
spec:
  apiEndpoint: https://10.224.144.185:31212
  id: bdcb8c60-b85c-407a-aeb3-6d7623b6b6c5
  impersonatorSecretRef:
    name: member-dev-impersonator
    namespace: karmada-cluster

P.S. Dose karmada willing to create a awesome-karmada, contrib liked repo under top karmada project? So, someone may like to add their unofficial out of tree controller to it. These 3rd controlers, compoents, may solve some cases but yet not a commonly need feature, or just not stable enough to merge intree.

karmada-bot commented 10 months ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: To complete the pull request process, please assign kevin-wangzefeng after the PR has been reviewed. You can assign the PR to them by writing /assign @kevin-wangzefeng in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/karmada-io/karmada/blob/master/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
karmada-bot commented 10 months ago

@tedli: Adding label do-not-merge/contains-merge-commits because PR contains merge commits, which are not allowed in this repository. Use git rebase to reapply your commits on top of the target branch. Detailed instructions for doing so can be found here.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
tedli commented 10 months ago

I give a health check probe impl, feel free to give it a try, and welcom feedback.

https://github.com/tedli/generic-cluster-healthz

chaunceyjiang commented 10 months ago

I think it's a good idea, but is setting the health-checker configuration in annotations a good choice?

RainbowMango commented 10 months ago

I give a health check probe impl, feel free to give it a try, and welcom feedback.

Just want to let you know, it's in my queue. I'll take a look next week.

I think it's a good idea, but is setting the health-checker configuration in annotations a good choice?

Good point! I tend to go with the approach of introducing a field in Cluster API.

codecov-commenter commented 10 months ago

Codecov Report

Merging #3616 (22ca831) into master (8045da4) will increase coverage by 1.21%. The diff coverage is 13.95%.

:exclamation: Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@            Coverage Diff             @@
##           master    #3616      +/-   ##
==========================================
+ Coverage   55.24%   56.45%   +1.21%     
==========================================
  Files         221      222       +1     
  Lines       20838    20924      +86     
==========================================
+ Hits        11511    11812     +301     
+ Misses       8717     8487     -230     
- Partials      610      625      +15     
Flag Coverage Δ
unittests 56.45% <13.95%> (+1.21%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/controllers/status/cluster_health_check.go 11.25% <11.25%> (ø)
...kg/controllers/status/cluster_status_controller.go 68.21% <50.00%> (-0.28%) :arrow_down:

... and 8 files with indirect coverage changes

Poor12 commented 10 months ago

This is quite similar to livenessProbe in kubernetes. I think we can learn from their API design to add a new field in the cluster. From the description of PR, it may like:

apiVersion: cluster.karmada.io/v1alpha1
kind: Cluster
metadata:
  name: member-dev
  finalizers:
  - karmada.io/cluster-controller
spec:
  apiEndpoint: https://10.224.144.185:31212
  id: bdcb8c60-b85c-407a-aeb3-6d7623b6b6c5
  impersonatorSecretRef:
    name: member-dev-impersonator
    namespace: karmada-cluster
  livenessProbe:
    httpGet:
       path: /healthz
       port: 80
       host: member-dev-health.some-system.svc.cluster.local
chaunceyjiang commented 10 months ago

This is quite similar to livenessProbe in kubernetes.

Amazing!!!!!!!!

jwcesign commented 10 months ago

Hi, @tedli Just curious,

  1. Where did you deploy the component corresponding to this service?
  2. Different clusters will have different health checking services?
  3. Is it possible that all clusters use one service to check the status, but with different query parameters?
tedli commented 10 months ago

Hi @jwcesign ,

In my case:

  1. I did deploy the probe component in the host cluster of karmada control plane.
  2. Each cluster has its own probe component.
  3. Of course, it's possible.

Reason:

  1. Different cluster use different kubeconfig.
  2. It's need to config multi kubeconfig of different clusters to this probe, so that it could use corresponding kubeconfig of the cluster specified through query, to check the cluster.

What I got WRONG:

The kubeconfig of member cluster that provided to karmada control plane during join must be something like admin or cluster-admin,that nearly has all role.

So karmada controller could do some basic checks, like checking n nodes ready, n non host networked pods ready (n could be configed) in tree, no need to add an extra probe.

Only if the basic checks like I mentioned above was not enough, then an external probe can be used. I may took this over serious, it maybe just do some basic check intree could be consider enough.

CC: @RainbowMango