kubernetes / node-problem-detector

This is a place for various problem detectors running on the Kubernetes nodes.
Apache License 2.0
2.98k stars 628 forks source link

docker healthcheck - systemctl unable to get information #482

Closed azman0101 closed 3 years ago

azman0101 commented 3 years ago

When enabling health-checker for docker. The Unhealthy condition is detected when, in fact docker is healthy

I1026 18:25:35.879338       1 plugin.go:239] Start logs from plugin {Type:permanent Condition:ContainerRuntimeUnhealthy Reason:DockerUnhealthy Path:/home/kubernetes/bin/health-checker Args:[--component=docker --enable-repair=true --cooldown-time=2m --health-check-timeout=60s] TimeoutString:0xc000284ed0 Timeout:3m0s}
 I1026 18:25:35.876378      46 health_checker.go:166] command &{docker [docker ps] []  <nil>  0xc0000a02d0 [] <nil> <nil> <nil> 0xc00009c300 0xc0000a83c0 false [] [] [] [] <nil> <nil>} failed: exec: "docker": executable file not found in $PATH, []
I1026 18:25:35.878837      46 health_checker.go:166] command &{/bin/systemctl [systemctl show docker --property=InactiveExitTimestamp] []  <nil>  0xc0000a03c0 [] <nil> 0xc000094c90 exit status 1 0xc00009c420 <nil> true [0xc00009a058 0xc00009a070 0xc00009a088] [0xc00009a058 0xc00009a070 0xc00009a088] [0xc00009a068 0xc00009a080] [0x61a150 0x61a150] 0xc00009c6c0 0xc000082180} failed: exit status 1, []
I1026 18:25:35.878866      46 health_checker.go:147] error in getting uptime for docker: exit status 1
I1026 18:25:35.878875      46 health_checker.go:149] docker is unhealthy, component uptime: 0s
I1026 18:25:35.879398       1 plugin.go:240] End logs from plugin {Type:permanent Condition:ContainerRuntimeUnhealthy Reason:DockerUnhealthy Path:/home/kubernetes/bin/health-checker Args:[--component=docker --enable-repair=true --cooldown-time=2m --health-check-timeout=60s] TimeoutString:0xc000284ed0 Timeout:3m0s}

I tested the command systemctl show docker --property=InactiveExitTimestamp inside a npd container:

kubectl exec -it -n kube-system node-problem-detector-8z9lj -- /bin/sh

# systemctl show docker --property=InactiveExitTimestamp
System has not been booted with systemd as init system (PID 1). Can't operate.

To dig into this issue, I've added hostPID to pod spec.

kubectl exec -it -n kube-system node-problem-detector-28x7t -- /bin/sh
# systemctl show docker --property=InactiveExitTimestamp
Running in chroot, ignoring request: show

Also, the podspec .securityContext.privileged=true

I didn't see how this can work

dntosas commented 3 years ago

Having the same problem and Kubelet is always returning Unknown status ^^

Normal  KubeletIsHealthy         6m8s                   health-checker  Node condition KubeletUnhealthy is now: Unknown, reason: KubeletIsHealthy
KubeletUnhealthy              Unknown     KubeletIsHealthy                kubelet on the node is functioning properly
fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

Iamleos commented 3 years ago
/bin/syst

When enabling health-checker for docker. The Unhealthy condition is detected when, I fact docker is healthy

I1026 18:25:35.879338       1 plugin.go:239] Start logs from plugin {Type:permanent Condition:ContainerRuntimeUnhealthy Reason:DockerUnhealthy Path:/home/kubernetes/bin/health-checker Args:[--component=docker --enable-repair=true --cooldown-time=2m --health-check-timeout=60s] TimeoutString:0xc000284ed0 Timeout:3m0s}
 I1026 18:25:35.876378      46 health_checker.go:166] command &{docker [docker ps] []  <nil>  0xc0000a02d0 [] <nil> <nil> <nil> 0xc00009c300 0xc0000a83c0 false [] [] [] [] <nil> <nil>} failed: exec: "docker": executable file not found in $PATH, []
I1026 18:25:35.878837      46 health_checker.go:166] command &{/bin/systemctl [systemctl show docker --property=InactiveExitTimestamp] []  <nil>  0xc0000a03c0 [] <nil> 0xc000094c90 exit status 1 0xc00009c420 <nil> true [0xc00009a058 0xc00009a070 0xc00009a088] [0xc00009a058 0xc00009a070 0xc00009a088] [0xc00009a068 0xc00009a080] [0x61a150 0x61a150] 0xc00009c6c0 0xc000082180} failed: exit status 1, []
I1026 18:25:35.878866      46 health_checker.go:147] error in getting uptime for docker: exit status 1
I1026 18:25:35.878875      46 health_checker.go:149] docker is unhealthy, component uptime: 0s
I1026 18:25:35.879398       1 plugin.go:240] End logs from plugin {Type:permanent Condition:ContainerRuntimeUnhealthy Reason:DockerUnhealthy Path:/home/kubernetes/bin/health-checker Args:[--component=docker --enable-repair=true --cooldown-time=2m --health-check-timeout=60s] TimeoutString:0xc000284ed0 Timeout:3m0s}

I tested the command systemctl show docker --property=InactiveExitTimestamp inside a npd container:

kubectl exec -it -n kube-system node-problem-detector-8z9lj -- /bin/sh

# systemctl show docker --property=InactiveExitTimestamp
System has not been booted with systemd as init system (PID 1). Can't operate.

To dig into this issue, I've added hostPID to pod spec.

kubectl exec -it -n kube-system node-problem-detector-28x7t -- /bin/sh
# systemctl show docker --property=InactiveExitTimestamp
Running in chroot, ignoring request: show

Also, the podspec .securityContext.privileged=true

I didn't see how this can work

I have apply node-problem detector, but it showed errors, like "command systemctl show kubelet --property=InactiveExitTimestamp failed: exec: "systemctl": executable file not found in $PATH" We have the same images, why i got this problem.

fejta-bot commented 3 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten

fejta-bot commented 3 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community. /close

k8s-ci-robot commented 3 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes/node-problem-detector/issues/482#issuecomment-867638909): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.