node-problem-detector cannot run in non-privileged mode

ialidzhikov commented 2 years ago

/kind bug

What happened?

Running containers in privileged mode is not recommended as privileged containers run with all linux capabilities enabled and can access the host's resources. Running containers in privileged mode opens number of security threads such as breakout to underlying host OS.

Currently the node-problem-detector DaemonSet runs in privileged mode.

https://github.com/kubernetes/node-problem-detector/blob/d8b2940b3cac1d99c9072dd644c7dfb372672114/deployment/node-problem-detector.yaml#L41-L42

Trying to run node-problem-detector in non-privileged mode (even with all capabilities added) one of its monitors fails with:

E0808 06:25:33.740326       1 problem_detector.go:55] Failed to start problem daemon &{/config/kernel-monitor.json 0xc00035b7a0 0xc000443100 {{kmsg map[] /dev/kmsg 5m } 10 kernel-monitor [{KernelDeadlock  {0 0 <nil>} KernelHasNoDeadlock kernel has no deadlock} {ReadonlyFilesystem  {0 0 <nil>} FilesystemIsNotReadOnly Filesystem is not read-only}] [{temporary  OOMKilling Killed process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB.*} {temporary  TaskHung task [\S ]+:\w+ blocked for more than \w+ seconds\.} {temporary  UnregisterNetDevice unregister_netdevice: waiting for \w+ to become free. Usage count = \d+} {temporary  KernelOops BUG: unable to handle kernel NULL pointer dereference at .*} {temporary  KernelOops divide error: 0000 \[#\d+\] SMP} {temporary  Ext4Error EXT4-fs error .*} {temporary  Ext4Warning EXT4-fs warning .*} {temporary  IOError Buffer I/O error .*} {temporary  MemoryReadError CE memory read error .*} {permanent KernelDeadlock DockerHung task docker:\w+ blocked for more than \w+ seconds\.} {permanent ReadonlyFilesystem FilesystemIsReadOnly Remounting filesystem read-only}] 0xc00043d21e} [] <nil> 0xc00045aea0 0xc00044bb80}: failed to create kmsg parser: open /dev/kmsg: operation not permitted

I don't fully understand what it requires to read kernel logs from /dev/kmsg.

What did you expect to happen?

I would expect to be able to run node-problem-detector in non-privileged mode.

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

ialidzhikov commented 1 year ago

/remove-lifecycle rotten

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

ialidzhikov commented 1 year ago

/remove-lifecycle stale

balu-ce commented 1 year ago

Any update on this ?

btiernay commented 1 year ago

Duplicate of https://github.com/kubernetes/node-problem-detector/issues/625

AlexzSouz commented 11 months ago

Duplicate of #625

Both issues DO NOT have a solution for the problem @ialidzhikov mentioned and that I'm currently experiencing. The "duplicate" issue you (@btiernay) shared only contains comments from @k8s-triage-robot. No solution is provided 🤷

Any solution so far?

alazyer commented 11 months ago

How about trying with plugin of journald instead? it works fine for me to detect "NodeOOM", "PodOOM" with pattern ".Out of memory." and ".Memory cgroup out of memory."

k8s-triage-robot commented 8 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

wangzhen127 commented 7 months ago

NPD's goal is to detect infra layer issues. So it needs to read logs in a place where non-privileged containers do not have permission. Additionally, we use health checker in production to repair kubelet and containerd by killing them. Those need privilege.

Depending on how you would like to use NPD, there may be a chance that you can tune your daemonset yaml without the privilege access. @hakman for kops, does it run NPD in non-privilege mode?

wangzhen127 commented 7 months ago

/remove-kind bug

wangzhen127 commented 7 months ago

/remove-lifecycle stale

haardm commented 5 months ago

Hello, I am also facing similar issue while reading from /dev/kmsg using NPD while my container is not given privileged mode. Is there any workaround? We only need to read, no mutating actions on our side.

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 3 weeks ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 3 weeks ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes/node-problem-detector/issues/698#issuecomment-2456099498): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

kubernetes / node-problem-detector

node-problem-detector cannot run in non-privileged mode #698

What happened?

What did you expect to happen?