Closed pacoxu closed 6 months ago
I0225 17:35:11.161300 20814 ooms_tracker.go:284] OOM detected: &Event{ObjectMeta:{gce-scale-cluster-minion-heapster.17b72a4a62c4a7d8 default d60b06ff-11d1-4cb8-be1f-7264e084769b 85768 0 2024-02-25 17:11:22 +0000 UTC map[] map[] [] [] [{node-problem-detector Update v1 2024-02-25 17:11:22 +0000 UTC FieldsV1 {"f:count":{},"f:firstTimestamp":{},"f:involvedObject":{},"f:lastTimestamp":{},"f:message":{},"f:reason":{},"f:source":{"f:component":{},"f:host":{}},"f:type":{}} }]},InvolvedObject:ObjectReference{Kind:Node,Namespace:,Name:gce-scale-cluster-minion-heapster,UID:gce-scale-cluster-minion-heapster,APIVersion:,ResourceVersion:,FieldPath:,},Reason:OOMKilling,Message:Memory cgroup out of memory: Killed process 4550 (metrics-server) total-vm:834776kB, anon-rss:100100kB, file-rss:33900kB, shmem-rss:0kB, UID:65534 pgtables:400kB oom_score_adj:999,Source:EventSource{Component:kernel-monitor,Host:gce-scale-cluster-minion-heapster,},FirstTimestamp:2024-02-25 17:11:22 +0000 UTC,LastTimestamp:2024-02-25 17:11:22 +0000 UTC,Count:1,Type:Warning,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}
The original memory limit for NPD pod is 100Mi. It hit the OOM.
It may indict larger memory usage with v0.8.13.
I0225 17:35:11.161300 20814 ooms_tracker.go:284] OOM detected: &Event{ObjectMeta:{gce-scale-cluster-minion-heapster.17b72a4a62c4a7d8 default d60b06ff-11d1-4cb8-be1f-7264e084769b 85768 0 2024-02-25 17:11:22 +0000 UTC <nil> <nil> map[] map[] [] [] [{node-problem-detector Update v1 2024-02-25 17:11:22 +0000 UTC FieldsV1 {"f:count":{},"f:firstTimestamp":{},"f:involvedObject":{},"f:lastTimestamp":{},"f:message":{},"f:reason":{},"f:source":{"f:component":{},"f:host":{}},"f:type":{}} }]},InvolvedObject:ObjectReference{Kind:Node,Namespace:,Name:gce-scale-cluster-minion-heapster,UID:gce-scale-cluster-minion-heapster,APIVersion:,ResourceVersion:,FieldPath:,},Reason:OOMKilling,Message:Memory cgroup out of memory: Killed process 4550 (metrics-server) total-vm:834776kB, anon-rss:100 100kB, file-rss:33900kB, shmem-rss:0kB, UID:65534 pgtables:400kB oom_score_adj:999,Source:EventSource{Component:kernel-monitor,Host:gce-scale-cluster-minion-heapster,},FirstTimestamp:2024-02-25 17:11:22 +0000 UTC,LastTimestamp:2024-02-25 17:11:22 +0000 UTC,Count:1,Type:Warning,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}
I0225 17:35:11.161519 20814 ooms_tracker.go:284] OOM detected: &Event{ObjectMeta:{gce-scale-cluster-minion-heapster.17b72a4a642cab59 default 7626405d-a166-40f9-b4fc-7fe01ddb27de 85835 0 2024-02-25 17:11:22 +0000 UTC <nil> <nil> map[] map[] [] [] [{node-problem-detector Update v1 2024-02-25 17:11:22 +0000 UTC FieldsV1 {"f:count":{},"f:firstTimestamp":{},"f:involvedObject":{},"f:lastTimestamp":{},"f:message":{},"f:reason":{},"f:source":{"f:component":{},"f:host":{}},"f:type":{}} }]},InvolvedObject:ObjectReference{Kind:Node,Namespace:,Name:gce-scale-cluster-minion-heapster,UID:gce-scale-cluster-minion-heapster,APIVersion:,ResourceVersion:,FieldPath:,},Reason:OOMKilling,Message:Memory cgroup out of memory: Killed process 4550 (metrics-server) total-vm:834776kB, anon-rss:101048kB, file-rss:34028kB, shmem-rss:0kB, UID:65534 pgtables:400kB oom_score_adj:999,Source:EventSource{Component:kernel-monitor,Host:gce-scale-cluster-minion-heapster,},FirstTimestamp:2024-02-25 17:11:22 +0000 UTC,LastTimestamp:2024-02-25 17:11:22 +0000 UTC,Count:1,Type:Warning,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}
I0225 17:35:11.161574 20814 ooms_tracker.go:284] OOM detected: &Event{ObjectMeta:{gce-scale-cluster-minion-heapster.17b72a50237b9584 default 2045b6c3-0c52-4aaa-8cef-bec741102430 155576 0 2024-02-25 17:11:47 +0000 UTC <nil> <nil> map[] map[] [] [] [{node-problem-detector Update v1 2024-02-25 17:11:47 +0000 UTC FieldsV1 {"f:count":{},"f:firstTimestamp":{},"f:involvedObject":{},"f:lastTimestamp":{},"f:message":{},"f:reason":{},"f:source":{"f:component":{},"f:host":{}},"f:type":{}} }]},InvolvedObject:ObjectReference{Kind:Node,Namespace:,Name:gce-scale-cluster-minion-heapster,UID:gce-scale-cluster-minion-heapster,APIVersion:,ResourceVersion:,FieldPath:,},Reason:OOMKilling,Message:Memory cgroup out of memory: Killed process 5059 (metrics-server) total-vm:829304kB, anon-rss:102212kB, file-rss:32828kB, shmem-rss:0kB, UID:65534 pgtables:380kB oom_score_adj:999,Source:EventSource{Component:kernel-monitor,Host:gce-scale-cluster-minion-heapster,},FirstTimestamp:2024-02-25 17:11:47 +0000 UTC,LastTimestamp:2024-02-25 17:11:47 +0000 UTC,Count:1,Type:Warning,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}
I0225 17:35:11.161639 20814 ooms_tracker.go:284] OOM detected: &Event{ObjectMeta:{gce-scale-cluster-minion-heapster.17b72a5024e5e2b7 default f318a878-4b13-4e0b-a19d-a58bdbaaa3a8 155636 0 2024-02-25 17:11:47 +0000 UTC <nil> <nil> map[] map[] [] [] [{node-problem-detector Update v1 2024-02-25 17:11:47 +0000 UTC FieldsV1 {"f:count":{},"f:firstTimestamp":{},"f:involvedObject":{},"f:lastTimestamp":{},"f:message":{},"f:reason":{},"f:source":{"f:component":{},"f:host":{}},"f:type":{}} }]},InvolvedObject:ObjectReference{Kind:Node,Namespace:,Name:gce-scale-cluster-minion-heapster,UID:gce-scale-cluster-minion-heapster,APIVersion:,ResourceVersion:,FieldPath:,},Reason:OOMKilling,Message:Memory cgroup out of memory: Killed process 5072 (metrics-server) total-vm:829304kB, anon-rss:103180kB, file-rss:33136kB, shmem-rss:0kB, UID:65534 pgtables:380kB oom_score_adj:999,Source:EventSource{Component:kernel-monitor,Host:gce-scale-cluster-minion-heapster,},FirstTimestamp:2024-02-25 17:11:47 +0000 UTC,LastTimestamp:2024-02-25 17:11:47 +0000 UTC,Count:1,Type:Warning,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}
My local npd only use about 10Mi memory.
The problem is not NPD, but metric server. The OOM process is metric server here. I mistakenly read the log.
EDITED: sorry for misreading.
Do we have a benchmark about NPD memory usage?
/close
The log indicated that metric-server
is restarted and NPD just detected it.
It is not a NPD issue.
Edited: I may misunderstand the log.
@pacoxu: Closing this issue.
/reopen
The CI still showed the NPD is restarted.
@pacoxu: Reopened this issue.
/close
Looks like https://github.com/kubernetes/kubernetes/issues/123328 has resolved already. Feel free to reopen if the issue stays.
@wangzhen127: Closing this issue.
v0.8.13 may has another problem(I may open another issue if it is not related)
Another case is the https://testgrid.k8s.io/sig-release-master-informing#gce-master-scale-performance flakes with error like
It is tracked in https://github.com/kubernetes/kubernetes/issues/123328.