Missing metrics in `cgroup` v2

google / cadvisor

Analyzes resource usage and performance characteristics of running containers.

Other

16.85k stars 2.31k forks source link

Missing metrics in `cgroup` v2 #3062

Open cyrus-mc opened 2 years ago

cyrus-mc commented 2 years ago

This might be related to https://github.com/google/cadvisor/issues/3026 (which I am not sure has been released yet).

On nodes running cgroup v1 the following metrics such as container_cpu_cfs_throttled_* are returned. If cadvisor is run on nodes with cgroup v2 enabled those metrics are not returned.

There could be others but these are the ones I noted when attempting to troubleshoot an issue. To verify I ran the latest of the 0.39.x release and 0.43.x and both exhibited the same behavior.

cyrus-mc commented 2 years ago

@ysksuzuki pinging you here to see if this is related to #3026 that was recently closed, and I think the fix is awaiting release.

chrstphfrtz commented 1 year ago

Is there any update on this? We have the same problem running cadvisor v0.46.0 on our cluster. Changing back to cgroup v1 restores also other metrics like container_memory_max_usage_bytes.

sli720 commented 1 year ago

Is there any update? We are also missing these metrics when using cgroup v2.

mindw commented 1 year ago

Spent some time looking into the code, it seems the underlying crun library doesn't populate MemoryStats.Usage.MaxUsage when Cgroups V2 is used. So the code at https://github.com/google/cadvisor/blob/fdd3d9182bea6f7f11e4f934631c4abef3aa0584/container/libcontainer/handler.go#L801 is insufficient. A possible solution would be to place ret.Memory.MaxUsage = s.MemoryStats.Stats["peak"] in the if cgroups.IsCgroup2UnifiedMode() block. Don't have a V2 system handy right now to test with :(

mindw commented 1 year ago

Went into the rabbit hole a bit further. It seems peak is only available since kernel 5.19 (specifically commit https://github.com/torvalds/linux/commit/8e20d4b332660a32e842e20c34cfc3b3456bc6dc). So, next would be to test on 6.1 kernel.

haircommander commented 11 months ago

I'm beginning to fix this in https://github.com/opencontainers/runc/pull/4038

micpjwi commented 4 months ago

I'm beginning to fix this in opencontainers/runc#4038

I think this fix made it's way into runc 1.11.0. Since then, I've seen that COS-113 mentions runc 1.12.0. And so does cAdvisor 0.49.0.

But I guess the fix suggested here would still need to be implemented before the metric would be exposed by cAdvisor. Would anyone know the status of this?

msannikov commented 1 month ago

This seems to be fixed. I've seen both container_cpu_cfs_throttled_* and container_memory_max_usage_bytes being present and not zero when run on a node with kubelet v1.30 (has cadvisor 0.49.0) when cgroup v2 is used.

wallrj commented 1 month ago

Went into the rabbit hole a bit further. It seems peak is only available since kernel 5.19 (specifically commit torvalds/linux@8e20d4b). So, next would be to test on 6.1 kernel.

@mindw Thanks for digging. I stumbled across your comments while trying to measure the peak memory use of cert-manager components. I was testing on Kind on a Windows WSL2 virtual machine and observed container_memory_max_usage_bytes having only zero values.

The default WSL2 kernel is v5.15 but happily a new WSL2 v6 kernel is soon to be released so I'll report back when that is available.