google / cadvisor

Analyzes resource usage and performance characteristics of running containers.
Other
16.92k stars 2.31k forks source link

POD Networking metrics missing with crio-1.30 and kubelet-1.30 #3577

Open rkojedzinszky opened 1 month ago

rkojedzinszky commented 1 month ago

It seems that https://github.com/google/cadvisor/commit/eac1257f76a4a55bb2bf836f41a577a2fcb148a4 breaks network metric collection. At least, with crio-1.30 with defaults.

If I apply the simple diff, I can get container metrics again:

--- a/vendor/github.com/google/cadvisor/container/crio/handler.go
+++ b/vendor/github.com/google/cadvisor/container/crio/handler.go
@@ -155,6 +155,7 @@ func newCrioContainerHandler(
        // reported. This stops metrics being reported multiple times for each
        // container in a pod.
        metrics := common.RemoveNetMetrics(includedMetrics, cInfo.Labels["io.kubernetes.container.name"] != "POD")
+       metrics = includedMetrics

        libcontainerHandler := containerlibcontainer.NewHandler(cgroupManager, rootFs, cInfo.Pid, metrics)

The command used to test this:

curl -s -H "Authorization: bearer $TOKEN" -k https://192.168.111.122:10250/metrics/cadvisor | grep "^container_network_transmit_packets_total"

Perhaps, cri-o does not have or keep a POD named container?

rkojedzinszky commented 1 month ago

It seems that, if I set drop_infra_ctr = false in crio.conf, it also solves the problem.

kwilczynski commented 4 weeks ago

Cc @kolyshkin, the author of the mentioned Pull Request.

kolyshkin commented 4 weeks ago

From a cursory look, commit eac1257 does not change the metrics being reported. If you take a look at the (removed) needNet method and its usage in the code before the commit, you will see that the metrics were dropped (after collecting) using the same criterion (checking if "io.kubernetes.container.name" label is "POD").

kolyshkin commented 4 weeks ago

To fix the issue, I guess the criteria used for excluding network metrics (i.e. the second argument of common.RemoveNetMetrics) should be different depending on whether cri-o uses infra container or not. Perhaps @haircommander can shed more light.

sohankunkerkar commented 3 weeks ago

I think this piece of code will not work for CRI-O. If the infra container has empty network metrics, the crio handler uses a running container in the pod to gather the metrics. Therefore, metrics must be collected from all containers to ensure that if there's a running container in the podthe necessary metrics are gathered.

kolyshkin commented 3 weeks ago

I think this piece of code will not work for CRI-O. If the infra container has empty network metrics, the crio handler uses a running container in the pod to gather the metrics. Therefore, metrics must be collected from all containers to ensure that if there's a running container in the podthe necessary metrics are gathered.

This is what I meant in the comment above -- find a way to see if infra container is used, and fix the second argument to common.RemoveNetMetrics accordingly.

haircommander commented 2 weeks ago

cri-o should still be creating an empty cgroup for the infra container so cadvisor is aware of it, and then reporting the PID of the infra container as being one of the other containers in the pod (so network metrics can be collected). It's possible something in that broke, we should make sure cadvisor is sees the infra container cgroup (and that cri-o create it)

haircommander commented 1 week ago

I actually think https://github.com/google/cadvisor/commit/ca820b635076e6d7bfb85b39202836157966cb7b would fix this. @iwankgb do you think it'd be possible to cherry-pick this to a 0.49.1 (that we create after the pick) so we can pull into kubernetes 1.30?

iwankgb commented 5 days ago

@haircommander I would love to help, but I am not able to cut a release. Someone from Google (@bobbypage, is this still you?) needs to do this.