google / cadvisor

Analyzes resource usage and performance characteristics of running containers.
Other
16.7k stars 2.3k forks source link

container_spec_cpu_quota is missing #3154

Open jzeng4 opened 1 year ago

jzeng4 commented 1 year ago

We use the cgroups metrics that is generated by cadvisor. Through our experiments, we found that there are some missing metrics for some of our application containers. Those metrics include:

It seems that all of them are related to cpu. I wonder if I need to specify some configurations to enable those metrics per application?

cadvisor_version_info{cadvisorRevision="",cadvisorVersion="",dockerVersion="Unknown",kernelVersion="5.4.189.1-rolling-lts-linkedin",osVersion="CentOS Linux 7 (Core)"} 1

jzeng4 commented 1 year ago

We enabled the soft quota for cpu. Not sure if this matters?

jzeng4 commented 1 year ago

I found the reasons why missing the above metrics:

If "cpu.cfs_quota_us" is "-1" (or say cpu soft quota is enabled), the cadvisor source code doesn't read the value. So "spec.Cpu.Quota" is zero in this case.

For the other missing metrics (container_cpu_cfs_periods_total, container_cpu_cfs_throttled_periods_total, container_cpu_cfs_throttled_seconds_total), since "spec.Cpu.Quota" is zero, those metrics are skipped, e.g. https://github.com/google/cadvisor/blob/master/metrics/prometheus.go#L193

davidg-datascene commented 1 year ago

We are hitting the same problem. For reference I've logged an issue with Azure as AKS is our cluster stack - https://github.com/Azure/AKS/issues/3216 but the issue probably falls under this domain.

lusson-luo commented 1 year ago

I found the reasons why missing the above metrics:

If "cpu.cfs_quota_us" is "-1" (or say cpu soft quota is enabled), the cadvisor source code doesn't read the value. So "spec.Cpu.Quota" is zero in this case.

For the other missing metrics (container_cpu_cfs_periods_total, container_cpu_cfs_throttled_periods_total, container_cpu_cfs_throttled_seconds_total), since "spec.Cpu.Quota" is zero, those metrics are skipped, e.g. https://github.com/google/cadvisor/blob/master/metrics/prometheus.go#L193

so, how do you resolve?

davidg-datascene commented 1 year ago

Any update on this? Thanks

lusson-luo commented 1 year ago

@davidg-datascene i use kube_pod_container_resource_requests{resource="cpu"} replace container_spec_cpu_quota

davidg-datascene commented 1 year ago

@lusson-luo - Hi, thanks, but still need the throttled metrics i.e container_cpu_cfs_throttled_periods_total and container_cpu_cfs_throttled_seconds_total.

lusson-luo commented 1 year ago

sorry, this two metrics i don't know

lusson-luo commented 1 year ago

@davidg-datascene i use kube_pod_container_resource_requests{resource="cpu"} replace container_spec_cpu_quota

i write error, use kube_pod_container_resource_limits{resource="cpu"} to replace container_spec_cpu_quota

moritzschmitz-oviva commented 1 year ago

For us the metric becomes empty (or not measured) when the container does not have a CPU limit. Does that make sense?

davidg-datascene commented 1 year ago

For us the metric becomes empty (or not measured) when the container does not have a CPU limit. Does that make sense?

I thought this at first but the deployment spec contains limits for CPU and tested changes to Mi values as well but made no difference.

spec of deployment:

resources:
    limits:
       cpu: "3"
       memory: 3Gi
    requests:
        cpu: "1"
        memory: 3Gi

The cluster is getting upgraded to AKS 1.24 in next few weeks and all the VM worker nodes will need replacing with a newer image/version. I'll check again when nodes are replaced to see if issue persists.

davidg-datascene commented 1 year ago

I've found where the issue lies in our case, at least. In our terraform that defines the Azure AKS node pools we set a _kubeletconfig.  For every node pool where this has been set we are missing the CFS quota metrics.  If I log into one of the nodes and check the kubelet config, I can see the value as false:

terraform yaml config for kubelet:

kubelet_config:
      container_log_max_size_mb: 100
cat ./etc/default/kubeletconfig.json | grep CFS
    "cpuCFSQuota": false,
    cat ./etc/default/kubeletconfig.json | grep -i log
    "containerLogMaxSize": "100M",   <-- we are only changing this value but cpuCFSQuota is false, hence no CFS stats

For where we are not setting a kubelet_config there is no  kubeletconfig.json file so accepts the defaults as per the Azure configuration - https://learn.microsoft.com/en-us/azure/aks/custom-node-configuration meaning CFS quota is true and you get the stats. The fix but not yet tested is we should be able to set https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/kubernetes_cluster_node_pool [cpu_cfs_quota_enabled] in the terraform code to enable the CFS stats.

ervinb commented 8 months ago

Went deep on this, and I was banging my head why container_cpu_cfs_throttled_seconds_total is the only one missing while the others are there.

Turns out the Helm chart for kube-prometheus-stack drops that metric by default.

Just something to check for folks arriving from Google.

tim-hilt commented 8 months ago

@ervinb same for me! Lesson learned: Always check on the most fundamental level, which in this case was "are the limits actually set?"

Good idea to leave that information here.

If the cadvisor-metrics for limits are still not reliable enough, there are a few alternative ways that you can check to get the limits as a prometheus-metric:

harche commented 5 months ago

If the limits are not set on the pod, then cpu.cfs_quota_us is set to -1. This is because there isn't anything to read related to quota or throttling if the quota itself is disabled. This is the reason cadvisor first checks if the cpu.cfs_quota_us is set to -1 and then doesn't set spec.Cpu.Quota. As a results of not setting spec.Cpu.Quota the metrics in questions here (such as container_cpu_cfs_periods_total) aren't reported by cadvisor because when the limits aren't set anything related to quota or throttling has no meaning.

IMHO, cadvisor is behaving as expected and this maybe not be a bug.

/cc @bobbypage WDYT?

jan-kantert commented 4 months ago

If the limits are not set on the pod, then cpu.cfs_quota_us is set to -1. This is because there isn't anything to read related to quota or throttling if the quota itself is disabled. This is the reason cadvisor first checks if the cpu.cfs_quota_us is set to -1 and then doesn't set spec.Cpu.Quota. As a results of not setting spec.Cpu.Quota the metrics in questions here (such as container_cpu_cfs_periods_total) aren't reported by cadvisor because when the limits aren't set anything related to quota or throttling has no meaning.

IMHO, cadvisor is behaving as expected and this maybe not be a bug.

This seems odd to me. Maybe I am missing something. Imagine the following scenario:

  1. Node has only 1 CPU (for simplicity)
  2. Pod A has CPU requests of 50m and no CPU limit
  3. Pod B has CPU requests of 950m and a limit of 1
  4. Both pods are CPU bound and content for CPU resources.

What will happen here is that Pod A will get throttled massively. I would expect container_cpu_cfs_throttled_seconds_total to go as well. This would happen due to CPU starvation at the node level not due to limits. I guess this would happen if I set a limit of 1 CPU to pod A.

Am I missing something? If that case above is not covered by cadvisor how could I monitor burstabled (or best-effort) pods getting throttled by the scheduler?

harche commented 4 months ago

Setting a CPU request indicates the minimum CPU required for a pod to run efficiently, while setting a CPU limit caps the maximum CPU usage. When a pod has no CPU limit (cpu.cfs_quota_us = -1), it's not bound by the CFS quota, implying it can utilize CPU resources beyond its request, up to what's available on the node, subject to scheduling decisions.

cAdvisor's approach to not report metrics like container_cpu_cfs_periods_total for pods without a defined CPU limit is logical because these metrics track usage against a specific limit. Since there's no limit for Pod A, the concept of CFS quota-based throttling doesn't apply directly. However, the underlying issue you're hinting at is more about CPU resource contention among pods rather than traditional CFS throttling.

To monitor situations where burstable or best-effort pods face "effective throttling" due to resource contention (as opposed to CFS quota limits), you might need a broader set of metrics. You can analyze container_cpu_usage_seconds_total in relation to CPU requests.

jan-kantert commented 4 months ago

Setting a CPU request indicates the minimum CPU required for a pod to run efficiently, while setting a CPU limit caps the maximum CPU usage. When a pod has no CPU limit (cpu.cfs_quota_us = -1), it's not bound by the CFS quota, implying it can utilize CPU resources beyond its request, up to what's available on the node, subject to scheduling decisions.

cAdvisor's approach to not report metrics like container_cpu_cfs_periods_total for pods without a defined CPU limit is logical because these metrics track usage against a specific limit. Since there's no limit for Pod A, the concept of CFS quota-based throttling doesn't apply directly. However, the underlying issue you're hinting at is more about CPU resource contention among pods rather than traditional CFS throttling.

Our current workaround is to simply set cpu limit = number of cpus of node. That will provide perfect metrics. I wish there was a way to get them without that hack. Especially, in mixed clusters where we got different types of nodes that sometimes prevents pods from bursting.

(Theoretically, we could set limit = 1000 but in practise the cluster scheduler will prevent that.)

To monitor situations where burstable or best-effort pods face "effective throttling" due to resource contention (as opposed to CFS quota limits), you might need a broader set of metrics. You can analyze container_cpu_usage_seconds_total in relation to CPU requests.

Unfortunately, that did not turn as as helpful in practise. Simply, because you cannot distinguish between:

This is quite problematic/challenging when combining with vertical pod autoscalers which would always asume case (a) and, then, would not increase reservations.

Would you suggest that we should write another component to expose the same metric for our usecase? We would like to follow the recommendation to not set CPU limits unless we explicitly need them. However, this currently would limit our ability to use VPAs without massive overprovisioning and our ability to monitor for starvation.