google / cadvisor

Analyzes resource usage and performance characteristics of running containers.
Other
16.85k stars 2.31k forks source link

cAdvisor crashed due to OOM #2856

Open MonicaMagoniCom opened 3 years ago

MonicaMagoniCom commented 3 years ago

We have deployed cadvisor v0.39.0 as daemonset in our Kubernetes cluster where nodes have version 1.14.10-gke.42. Even if we have disabled many metrics, cadvisor instances continue experiencing OOM.

Here is our configuration:

resources: limits: cpu: 2500m memory: 700Mi requests: cpu: 100m memory: 200Mi

As you can see in the attached image, the memory has suddenly a pick (without any particular reason) and then it crashes (since the memory limit is 700Mi).

memory-cadvisor

iwankgb commented 3 years ago

How many containers per node are you running?

iwankgb commented 3 years ago

I guess that #2840 might be related. Looks like we may need to bisect 0.36 and 0.37.

MonicaMagoniCom commented 3 years ago

How many containers per node are you running?

We have just one container per each node

MonicaMagoniCom commented 3 years ago

I guess that #2840 might be related. Looks like we may need to bisect 0.36 and 0.37.

Why does it seem to be related? I'm running 0.39

MonicaMagoniCom commented 3 years ago

I add these flags:

and as you see in the attached image, there is no more a sudden increase of the memory, but still the memory used is high and there are restarts due to OOM. The memory increase is linear on the nodes with less resources (which means less load), but it is critical on bigger nodes. The biggest nodes of the cluster have the following values: (Capacity | Allocatable | Total requested) CPU | 8 CPU | 7.91 CPU | 6.15 CPU Memory | 31.62 GB | 27.86 GB | 20.23 GB

The smallest one: CPU | 4 CPU | 3.92 CPU | 3.03 CPU Memory | 16.8 GB | 13.94 GB | 6.77 GB

Schermata da 2021-05-12 14-26-21

jlange-koch commented 3 years ago

We are experiencing a similar behaviour on our GKE cluster.

Curiously it only happens on nodes that have Containerd as runtime (Container-Optimised OS with Containerd (cos_containerd)). If Cadvisor runs on nodes that run with docker runtime (Container-optimised OS with Docker (cos) (default)), it behaves fine.

Cadvisor image: latest kubernetes version: 1.18.20-gke.501

EDIT: This seems to be fixed when using version v0.40.0

wrathchild14 commented 5 months ago

For me, the error was because of the flag --storage_duration=0s, it stored the metrics data indefinitely. I set it to 5 seconds and the OOM error disappeared.