kubernetes / node-problem-detector

This is a place for various problem detectors running on the Kubernetes nodes.
Apache License 2.0
2.95k stars 627 forks source link

v1.8.13 restart count is 1 in gce-master-scale-performance #866

Closed pacoxu closed 6 months ago

pacoxu commented 7 months ago

v0.8.13 may has another problem(I may open another issue if it is not related)

Another case is the https://testgrid.k8s.io/sig-release-master-informing#gce-master-scale-performance flakes with error like

[measurement call TestMetrics - TestMetrics error: [action gather failed for SystemPodMetrics measurement: restart counts violation: RestartCount(node-problem-detector-gsg92, node-problem-detector)=1, want <= 0]]

It is tracked in https://github.com/kubernetes/kubernetes/issues/123328.

pacoxu commented 7 months ago

I0225 17:35:11.161300 20814 ooms_tracker.go:284] OOM detected: &Event{ObjectMeta:{gce-scale-cluster-minion-heapster.17b72a4a62c4a7d8 default d60b06ff-11d1-4cb8-be1f-7264e084769b 85768 0 2024-02-25 17:11:22 +0000 UTC map[] map[] [] [] [{node-problem-detector Update v1 2024-02-25 17:11:22 +0000 UTC FieldsV1 {"f:count":{},"f:firstTimestamp":{},"f:involvedObject":{},"f:lastTimestamp":{},"f:message":{},"f:reason":{},"f:source":{"f:component":{},"f:host":{}},"f:type":{}} }]},InvolvedObject:ObjectReference{Kind:Node,Namespace:,Name:gce-scale-cluster-minion-heapster,UID:gce-scale-cluster-minion-heapster,APIVersion:,ResourceVersion:,FieldPath:,},Reason:OOMKilling,Message:Memory cgroup out of memory: Killed process 4550 (metrics-server) total-vm:834776kB, anon-rss:100100kB, file-rss:33900kB, shmem-rss:0kB, UID:65534 pgtables:400kB oom_score_adj:999,Source:EventSource{Component:kernel-monitor,Host:gce-scale-cluster-minion-heapster,},FirstTimestamp:2024-02-25 17:11:22 +0000 UTC,LastTimestamp:2024-02-25 17:11:22 +0000 UTC,Count:1,Type:Warning,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}

The original memory limit for NPD pod is 100Mi. It hit the OOM.

It may indict larger memory usage with v0.8.13.

I0225 17:35:11.161300   20814 ooms_tracker.go:284] OOM detected: &Event{ObjectMeta:{gce-scale-cluster-minion-heapster.17b72a4a62c4a7d8  default  d60b06ff-11d1-4cb8-be1f-7264e084769b 85768 0 2024-02-25 17:11:22 +0000 UTC <nil> <nil> map[] map[] [] []  [{node-problem-detector Update v1 2024-02-25 17:11:22 +0000 UTC FieldsV1 {"f:count":{},"f:firstTimestamp":{},"f:involvedObject":{},"f:lastTimestamp":{},"f:message":{},"f:reason":{},"f:source":{"f:component":{},"f:host":{}},"f:type":{}} }]},InvolvedObject:ObjectReference{Kind:Node,Namespace:,Name:gce-scale-cluster-minion-heapster,UID:gce-scale-cluster-minion-heapster,APIVersion:,ResourceVersion:,FieldPath:,},Reason:OOMKilling,Message:Memory cgroup out of memory: Killed process 4550 (metrics-server) total-vm:834776kB, anon-rss:100 100kB, file-rss:33900kB, shmem-rss:0kB, UID:65534 pgtables:400kB oom_score_adj:999,Source:EventSource{Component:kernel-monitor,Host:gce-scale-cluster-minion-heapster,},FirstTimestamp:2024-02-25 17:11:22 +0000 UTC,LastTimestamp:2024-02-25 17:11:22 +0000 UTC,Count:1,Type:Warning,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}
I0225 17:35:11.161519   20814 ooms_tracker.go:284] OOM detected: &Event{ObjectMeta:{gce-scale-cluster-minion-heapster.17b72a4a642cab59  default  7626405d-a166-40f9-b4fc-7fe01ddb27de 85835 0 2024-02-25 17:11:22 +0000 UTC <nil> <nil> map[] map[] [] []  [{node-problem-detector Update v1 2024-02-25 17:11:22 +0000 UTC FieldsV1 {"f:count":{},"f:firstTimestamp":{},"f:involvedObject":{},"f:lastTimestamp":{},"f:message":{},"f:reason":{},"f:source":{"f:component":{},"f:host":{}},"f:type":{}} }]},InvolvedObject:ObjectReference{Kind:Node,Namespace:,Name:gce-scale-cluster-minion-heapster,UID:gce-scale-cluster-minion-heapster,APIVersion:,ResourceVersion:,FieldPath:,},Reason:OOMKilling,Message:Memory cgroup out of memory: Killed process 4550 (metrics-server) total-vm:834776kB, anon-rss:101048kB, file-rss:34028kB, shmem-rss:0kB, UID:65534 pgtables:400kB oom_score_adj:999,Source:EventSource{Component:kernel-monitor,Host:gce-scale-cluster-minion-heapster,},FirstTimestamp:2024-02-25 17:11:22 +0000 UTC,LastTimestamp:2024-02-25 17:11:22 +0000 UTC,Count:1,Type:Warning,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}
I0225 17:35:11.161574   20814 ooms_tracker.go:284] OOM detected: &Event{ObjectMeta:{gce-scale-cluster-minion-heapster.17b72a50237b9584  default  2045b6c3-0c52-4aaa-8cef-bec741102430 155576 0 2024-02-25 17:11:47 +0000 UTC <nil> <nil> map[] map[] [] []  [{node-problem-detector Update v1 2024-02-25 17:11:47 +0000 UTC FieldsV1 {"f:count":{},"f:firstTimestamp":{},"f:involvedObject":{},"f:lastTimestamp":{},"f:message":{},"f:reason":{},"f:source":{"f:component":{},"f:host":{}},"f:type":{}} }]},InvolvedObject:ObjectReference{Kind:Node,Namespace:,Name:gce-scale-cluster-minion-heapster,UID:gce-scale-cluster-minion-heapster,APIVersion:,ResourceVersion:,FieldPath:,},Reason:OOMKilling,Message:Memory cgroup out of memory: Killed process 5059 (metrics-server) total-vm:829304kB, anon-rss:102212kB, file-rss:32828kB, shmem-rss:0kB, UID:65534 pgtables:380kB oom_score_adj:999,Source:EventSource{Component:kernel-monitor,Host:gce-scale-cluster-minion-heapster,},FirstTimestamp:2024-02-25 17:11:47 +0000 UTC,LastTimestamp:2024-02-25 17:11:47 +0000 UTC,Count:1,Type:Warning,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}
I0225 17:35:11.161639   20814 ooms_tracker.go:284] OOM detected: &Event{ObjectMeta:{gce-scale-cluster-minion-heapster.17b72a5024e5e2b7  default  f318a878-4b13-4e0b-a19d-a58bdbaaa3a8 155636 0 2024-02-25 17:11:47 +0000 UTC <nil> <nil> map[] map[] [] []  [{node-problem-detector Update v1 2024-02-25 17:11:47 +0000 UTC FieldsV1 {"f:count":{},"f:firstTimestamp":{},"f:involvedObject":{},"f:lastTimestamp":{},"f:message":{},"f:reason":{},"f:source":{"f:component":{},"f:host":{}},"f:type":{}} }]},InvolvedObject:ObjectReference{Kind:Node,Namespace:,Name:gce-scale-cluster-minion-heapster,UID:gce-scale-cluster-minion-heapster,APIVersion:,ResourceVersion:,FieldPath:,},Reason:OOMKilling,Message:Memory cgroup out of memory: Killed process 5072 (metrics-server) total-vm:829304kB, anon-rss:103180kB, file-rss:33136kB, shmem-rss:0kB, UID:65534 pgtables:380kB oom_score_adj:999,Source:EventSource{Component:kernel-monitor,Host:gce-scale-cluster-minion-heapster,},FirstTimestamp:2024-02-25 17:11:47 +0000 UTC,LastTimestamp:2024-02-25 17:11:47 +0000 UTC,Count:1,Type:Warning,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}

My local npd only use about 10Mi memory.

The problem is not NPD, but metric server. The OOM process is metric server here. I mistakenly read the log.

EDITED: sorry for misreading.

pacoxu commented 7 months ago

Do we have a benchmark about NPD memory usage?

pacoxu commented 7 months ago

/close

The log indicated that metric-server is restarted and NPD just detected it.

It is not a NPD issue.

Edited: I may misunderstand the log.

k8s-ci-robot commented 7 months ago

@pacoxu: Closing this issue.

In response to [this](https://github.com/kubernetes/node-problem-detector/issues/866#issuecomment-1963580448): >/close > >The log indicated that `metric-server` is restarted and NPD just detected it. > >It is not a NPD issue. Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
pacoxu commented 7 months ago

/reopen

The CI still showed the NPD is restarted.

https://kubernetes.slack.com/archives/C0BP8PW9G/p1708936782848279?thread_ts=1708933477.475099&cid=C0BP8PW9G image

https://perf-dash.k8s.io/#/?jobname=gce-5000Nodes&metriccategoryname=SystemPodMetrics&metricname=Load_SystemPodMetrics&RestartCount=RestartCount

k8s-ci-robot commented 7 months ago

@pacoxu: Reopened this issue.

In response to [this](https://github.com/kubernetes/node-problem-detector/issues/866#issuecomment-1963724417): >/reopen > >The CI still showed the NPD is restarted. > >https://kubernetes.slack.com/archives/C0BP8PW9G/p1708936782848279?thread_ts=1708933477.475099&cid=C0BP8PW9G >![image](https://github.com/kubernetes/node-problem-detector/assets/2010320/e1647f79-914c-4476-901e-b4c088f4e56f) > >https://perf-dash.k8s.io/#/?jobname=gce-5000Nodes&metriccategoryname=SystemPodMetrics&metricname=Load_SystemPodMetrics&RestartCount=RestartCount Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
wangzhen127 commented 6 months ago

/close

Looks like https://github.com/kubernetes/kubernetes/issues/123328 has resolved already. Feel free to reopen if the issue stays.

k8s-ci-robot commented 6 months ago

@wangzhen127: Closing this issue.

In response to [this](https://github.com/kubernetes/node-problem-detector/issues/866#issuecomment-2040268742): >/close > >Looks like https://github.com/kubernetes/kubernetes/issues/123328 has resolved already. Feel free to reopen if the issue stays. Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.