kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
110.96k stars 39.62k forks source link

integrate kubelet with the systemd watchdog #127460

Closed SergeyKanzhelev closed 2 weeks ago

SergeyKanzhelev commented 1 month ago

Implement the systemd watchdog in kubelet. Similar to https://github.com/containerd/containerd/issues/10329, we want to have a lightweight way to health check kubelet instead of requiring to run the health check process curl-ing the /healthz endpoint.

Implementation will require integrating with watchdog API, hooking those with the same checks that /healthz performs, and implement some e2e_node tests demonstrating this behavior. Documentation should be updated once implemented as well.

/sig node /area kubelet /kind feature

/good-first-issue /help

SergeyKanzhelev commented 1 month ago

/priority backlog /triage accepted

RocooHash commented 1 month ago

I can give this a shot

/assign

yunzck8s commented 1 month ago

I want to try it too

/assign

ajit97singh commented 1 month ago

I want to try it too

/assign

SkySibe commented 1 month ago

/assign

abhibongale commented 1 month ago

/assign abhibongale

zhifei92 commented 1 month ago

I would like to work on this, but if anyone has already started working on it, I can stop.

/assign

DevEmilio96 commented 1 month ago

/assign

Karthi-vel commented 1 month ago

/assign @Karthi-vel

SergeyKanzhelev commented 1 month ago

For everybody who signed up to try it, this PR is going to the right direction: https://github.com/kubernetes/kubernetes/pull/127566 but has some shortcomings. I commented on PR, listing here:

  1. There should be no change of behavior when watchdog is not enabled (it is OK for 1 info message that it is not enabled, but NOT ok to report an error)
  2. There should be no change of behavior on Windows. Do not even report that it is not enabled
  3. When watchdog is enabled, but we failed to configure kubelet - kubelet should not start (e.g. time is not configured correctly)
  4. The loop checks must be as lightweight as you can make them. Minimum allocations and no locks
  5. Try to avoid false positives as much as possible.

PR should include code and unit tests. If you have a good idea on how to test e2e by emulating daemon.Foo behavior - it is an extra bonus.

pravin-rgb commented 3 weeks ago

Hi @SergeyKanzhelev I am new to the community. I am a new contributor I want to try it .If its fixed that's fine. If their are still some beginner friendly or good-first issues for me to try and explore .That's appreciable. Let me know. Thank you. /assign

zhifei92 commented 3 weeks ago

I am a new contributor I want to try it .If its fixed that's fine.

@pravin-rgb Thank you for your interest in this issue. I am working on this PR:https://github.com/kubernetes/kubernetes/pull/127566#issuecomment-2416311736

SergeyKanzhelev commented 3 weeks ago

/remove-help /remove-good-first-issue

since @zhifei92 is working on it already

pravin-rgb commented 3 weeks ago

@zhifei92 Thanks for notifying me. Have a great day