Open darkl0rd opened 2 years ago
Just tested this with 0.44.0 - issue persists.
I believe the problem is that we are updating the OOMEvent count on the container itself on this line.
To my understanding when an OOM event occurs the container is destroyed effectively removing it from the metric data.
An example from my testing in AWS ` I0525 14:24:39.848012 1 manager.go:1223] Created an OOM event in container "/ecs/ID/CONTAINER_ID" at 2022-05-25 14:24:40.574306117 +0000 UTC m=+135.585319105
I0525 14:24:39.889468 1 manager.go:1044] Destroyed container: "/ecs/ID/CONTAINER_ID" (aliases: [alias], namespace: "docker") `
So we increment the OOM metric then deregister the metric :(
Is my understanding correct of your implementation @kragniz ?
If the expectation is for the container to be restarted after OOM, this makes the metric unusable in environments where containers are always replaced rather than restarted. ( Such as ECS )
I have the same issue.
One of my container is running out of memory, I can see OOM event in syslog and kmsg, but container_oom_events_total
is always 0. Any clues how to get it working, or other way to detect OOM in containers?
compose.yml
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.46.0
container_name: cadvisor
network_mode: host
user: root
privileged: true
healthcheck:
disable: true
restart: unless-stopped
ports:
- '8080:8080'
volumes:
- /:/rootfs:ro
- /var/run:/var/run:rw
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
command:
- --url_base_prefix=/bRs47jH13fdsBFMQ93/cadvisor
- --housekeeping_interval=15s
- --docker_only=true
- --store_container_labels=false
- --enable_metrics=disk,diskIO,cpu,cpuLoad,process,memory,network,oom_event
devices:
- /dev/kmsg:/dev/kmsg
Having the same issue here.
Event not showing up in kubernetes also.
Last State: Terminated
Reason: Error
Exit Code: 137
Any idea how to solve this ?
Thanks !
Also having the same issue with cAdvisor 0.46.0
Has there been any progress on this bug? I am encountering the same issue on k8s.
describe go-demo pod:
In k8s, new containers are always created to replace oomkilled containers. Therefore, the container_oom_events_total
metric will always be 0. I tried to keep the deleted containers due to oomkilled and was able to query the metrics.
In addition, if the cluster is running on minikube+docker, the containername obtained from /dev/kmsg
will have an additional prefix of /docker/{{id}}
, which does not match the containername watched. Therefore, the metrics will always be 0.
Hitting this issue in kubernetes as well. Commenting for visibility.
I have done various tests of OOMKills under Kubernetes. I have - so far - seen only one use-case where I have observed container_oom_events_total > 0
(specifically container_oom_events_total == 1
).
An OOMKill does not mean in all cases deletion of the container (which will deregister its container_oom_events_total
). The main process of the container (aka pid 1
) may fork one or more Linux processes (actually fork, not using the exec
command). If one of these other processes gets OOMKilled and this does not cause pid 1
to exit as well (at least not until the next cAdvisor scrape), the container will continue to live and you'll see container_oom_events_total == 1
.
@tsipo track the referenced PR #3278 or help pushing it over the line.
@tsipo track the referenced PR #3278 or help pushing it over the line.
See https://github.com/google/cadvisor/pull/3278#issuecomment-2083587616
We encountered this problem just now too.
@Creatone @bobbypage could I gently drag you to this very issue here about bugs with the OOM metrics? Since there is a PR potentially fixing this issue, please see the recent comments there: https://github.com/google/cadvisor/pull/3278#issuecomment-2292982675
Is there anything to be done to get this problem addressed / the PR reviewed?
I have done various tests of OOMKills under Kubernetes. I have - so far - seen only one use-case where I have observed
container_oom_events_total > 0
(specificallycontainer_oom_events_total == 1
). An OOMKill does not mean in all cases deletion of the container (which will deregister itscontainer_oom_events_total
). The main process of the container (akapid 1
) may fork one or more Linux processes (actually fork, not using theexec
command). If one of these other processes gets OOMKilled and this does not causepid 1
to exit as well (at least not until the next cAdvisor scrape), the container will continue to live and you'll seecontainer_oom_events_total == 1
.
An update on my previous comment: k8s 1.28
has enabled cgroup grouping (assuming cgroups v2
) - see here. That means the the use-case I have mentioned of "hidden OOMKills" (OOMKill of a container process which is not pid 1
) should not happen anymore for k8s >= 1.28
and cgroups v2
.
An update on my previous comment: k8s
1.28
has enabled cgroup grouping (assumingcgroups v2
) - see here. That means the the use-case I have mentioned of "hidden OOMKills" (OOMKill of a container process which is notpid 1
) should not happen anymore for k8s >=1.28
andcgroups v2
.
regardless of the k8s version or whether cgroups v2
is used, as long as OOM causes the container to be rebuilt, the OOM metric is lost. I think kubelet or container-manager will still monitor the system OOM events, then kill the container and rebuild it, then the OOM metric will be lost
An update on my previous comment: k8s
1.28
has enabled cgroup grouping (assumingcgroups v2
) - see here. That means the the use-case I have mentioned of "hidden OOMKills" (OOMKill of a container process which is notpid 1
) should not happen anymore for k8s >=1.28
andcgroups v2
.regardless of the k8s version or whether
cgroups v2
is used, as long as OOM causes the container to be rebuilt, the OOM metric is lost. I think kubelet or container-manager will still monitor the system OOM events, then kill the container and rebuild it, then the OOM metric will be lost
This is exactly my point: the only use-case where I have seen that the OOM metric was not lost, was removed in k8s 1.28.
Whether or not cadvisor should provide the OOM metric is a separate discussion. It is only relevant if the container is not deleted after being OOMKilled, which doesn't make a lot of sense for any managed container environment, to be honest.
BTW OOM Kills can be monitored using kube-state-metrics
metrics -kube_pod_container_status_last_terminated_exitcode
(value of 137
is OOMKill) and the recently-added kube_pod_container_status_last_terminated_timestamp
. These ones do not go away as the pod is not lost after the container was deleted and rebuilt.
On node level, node-exporter
provides node_vmstat_oom_kill
(which is a counter of all of the processes - not containers - which were OOM Killed).
On my side, upgrade conmon and solved the issue ( minimum debian 12)
Running Docker (swarm), when OOM events occur the counter never increases. For reference, the node-exporter metric (node_vmstat_oom_kill) does increase.
Running cAdvisor v0.43.0.