kubernetes / node-problem-detector

This is a place for various problem detectors running on the Kubernetes nodes.
Apache License 2.0
2.95k stars 627 forks source link

Health checker doesn't restart kubelet on EKS #861

Open chotiwat opened 7 months ago

chotiwat commented 7 months ago

Hi,

The health-checker-kubelet monitor runs systemctl kill against the kubelet service but kubelet doesn't get restarted because the exit code would be 0. EKS containerd nodes only restart kubelet on failure or SIGPIPE https://github.com/awslabs/amazon-eks-ami/blob/master/files/kubelet-containerd.service.

Am I misunderstanding the purpose of the repair function? Is it meant to kill the kubelet so that another system can take care of it, e.g. rotating out the node?

Is there a reason systemctl kill is used instead of systemctl restart? I was testing this functionality by stopping the kubelet manually and the health checker failed to kill the service with the error message Failed to kill unit kubelet.service: No main process to kill (slightly related: #860).

wangzhen127 commented 6 months ago

Here is how kubelet is configured in k/k: https://github.com/kubernetes/kubernetes/blob/f8930f980d2986f9e486b04c14c3e93e57bdbe12/cluster/gce/gci/configure-helper.sh#L1652

systemctl restart is often used by administrators to reconfigure system components. For example https://github.com/GoogleCloudPlatform/k8s-node-tools/blob/47ae266dcfa0af4edf80b87b4d6e2011bf8318d9/kubelet-log-config/kubelet-log-config.yaml#L89.

Its kernel log pattern is different from systemctl kill. We recently start to depend on this difference for containerd. See https://github.com/kubernetes/node-problem-detector/issues/847.

chotiwat commented 4 months ago

@wangzhen127 Thanks for the explanation, and I see that you added the comment to the code as well 🫡.

Unfortunately, the kubelet service in the EKS AMI is still using Restart=onfailure. Wouldn't it be more robust to be explicit about the behavior of the repair function rather than depending on the system configuration? In this case, I'm assuming the end goal of the repair function is to allow the unhealthy service to start up again, which seems more appropriate to use systemctl restart.

Regarding #847, isn't there a better way to differentiate between unhealthy restarts and normal restarts? I also find it weird that systemctl kill doesn't result in the Stopping ... log. Does that mean it's not graceful (although I don't see --signal=SIGTERM or equivalent in the code)?

I think it's best if we don't have to rely on the difference between systemctl restart and systemctl kill for #847. If that's really not possible, should the repair function be generalized more to support EKS and possibly other cloud providers, perhaps by making the repair command configurable (e.g. custom command or predefined termination modes)?

wangzhen127 commented 4 months ago

I agree that it is desirable to find a better way to differentiate between unhealthy restart and planned restart. Do you have any suggestions?

chotiwat commented 4 months ago

I don't have any suggestions at the moment either. I will update this issue if I got time to play around with it and think of something.

chotiwat commented 4 months ago

I looked into it a bit more. The journald logs by the systemd source is very limited. We could look at the service logs which should have different termination logs between healthy exits and an unhealthy ones, but I'm assuming that's undesirable because we would need to study log patterns for each of the systemd services we want to monitor.

Relying on the current hack, we could still make the repair function more robust by checking the Restart property of the unit with systemctl show and running systemctl start if Restart isn't always or on-success (see this table).

We could also simply always call systemctl start since it's a no-op if the service is already running. This might actually be safer since other systemd service options can alter the restart behavior, e.g. RestartPreventExitStatus.

What do you think?

wangzhen127 commented 3 months ago

I think the reason why the system service is using Restart=always for kubelet and container runtime is because we expect them to always run on the nodes in a cluster. Is there any reason why EKS uses Restart=onfailure? In which case would a user stop kubelet or container runtime?

chotiwat commented 3 months ago

I cannot answer for the EKS maintainers, but I don't think it was a wrong choice. In their healthy state, these server services are expected to run forever, never exiting with code 0 on their own, so Restart=on-failure is perfectly reasonable to me. It is actually the recommended choice by the systemd documentation. https://www.man7.org/linux/man-pages/man5/systemd.service.5.html#OPTIONS:~:text=Setting%20this%20to,an%20alternative%20choice.

In which case would a user stop kubelet or container runtime?

Like you said, I assumed most commonly would be admins reconfiguring the system. This wouldn't make always a better or worse choice than on-failure though, since normally admins would use systemctl stop for that and the Restart= option wouldn't matter. On that note, I'm also curious as well why always was picked over on-failure for Google Cloud.

I think it's even more unnatural to restart a service by relying on Restart=always behavior, especially when we can be explicit and run systemctl restart. Who knows how many cloud-based or hand-rolled, bare-metal kubernetes clusters out there don't use Restart=always for the kubelet service. Node problem detector should aim to be as cloud-agnostic as possible.

Since we cannot use systemctl restart due to the log pattern differentiation dependence, the two solutions I suggested earlier should at least support any systemctl configuration. I'm leaning toward the second approach (always running systemctl start after). If you approve, I would be happy to open a PR as well.

k8s-triage-robot commented 2 weeks ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale