Open rkojedzinszky opened 1 month ago
It seems that, if I set drop_infra_ctr = false
in crio.conf, it also solves the problem.
Cc @kolyshkin, the author of the mentioned Pull Request.
From a cursory look, commit eac1257 does not change the metrics being reported. If you take a look at the (removed) needNet
method and its usage in the code before the commit, you will see that the metrics were dropped (after collecting) using the same criterion (checking if "io.kubernetes.container.name" label is "POD").
To fix the issue, I guess the criteria used for excluding network metrics (i.e. the second argument of common.RemoveNetMetrics
) should be different depending on whether cri-o uses infra container or not. Perhaps @haircommander can shed more light.
I think this piece of code will not work for CRI-O. If the infra container has empty network metrics, the crio handler uses a running container in the pod to gather the metrics. Therefore, metrics must be collected from all containers to ensure that if there's a running container in the podthe necessary metrics are gathered.
I think this piece of code will not work for CRI-O. If the infra container has empty network metrics, the crio handler uses a running container in the pod to gather the metrics. Therefore, metrics must be collected from all containers to ensure that if there's a running container in the podthe necessary metrics are gathered.
This is what I meant in the comment above -- find a way to see if infra container is used, and fix the second argument to common.RemoveNetMetrics
accordingly.
cri-o should still be creating an empty cgroup for the infra container so cadvisor is aware of it, and then reporting the PID of the infra container as being one of the other containers in the pod (so network metrics can be collected). It's possible something in that broke, we should make sure cadvisor is sees the infra container cgroup (and that cri-o create it)
I actually think https://github.com/google/cadvisor/commit/ca820b635076e6d7bfb85b39202836157966cb7b would fix this. @iwankgb do you think it'd be possible to cherry-pick this to a 0.49.1 (that we create after the pick) so we can pull into kubernetes 1.30?
@haircommander I would love to help, but I am not able to cut a release. Someone from Google (@bobbypage, is this still you?) needs to do this.
It seems that https://github.com/google/cadvisor/commit/eac1257f76a4a55bb2bf836f41a577a2fcb148a4 breaks network metric collection. At least, with crio-1.30 with defaults.
If I apply the simple diff, I can get container metrics again:
The command used to test this:
Perhaps, cri-o does not have or keep a
POD
named container?