container_spec_cpu_quota is missing

jzeng4 commented 2 years ago

We use the cgroups metrics that is generated by cadvisor. Through our experiments, we found that there are some missing metrics for some of our application containers. Those metrics include:

container_cpu_cfs_periods_total
container_cpu_cfs_throttled_periods_total
container_cpu_cfs_throttled_seconds_total
container_spec_cpu_quota

It seems that all of them are related to cpu. I wonder if I need to specify some configurations to enable those metrics per application?

cadvisor_version_info{cadvisorRevision="",cadvisorVersion="",dockerVersion="Unknown",kernelVersion="5.4.189.1-rolling-lts-linkedin",osVersion="CentOS Linux 7 (Core)"} 1

jzeng4 commented 2 years ago

We enabled the soft quota for cpu. Not sure if this matters?

jzeng4 commented 2 years ago

I found the reasons why missing the above metrics:

If "cpu.cfs_quota_us" is "-1" (or say cpu soft quota is enabled), the cadvisor source code doesn't read the value. So "spec.Cpu.Quota" is zero in this case.

For the other missing metrics (container_cpu_cfs_periods_total, container_cpu_cfs_throttled_periods_total, container_cpu_cfs_throttled_seconds_total), since "spec.Cpu.Quota" is zero, those metrics are skipped, e.g. https://github.com/google/cadvisor/blob/master/metrics/prometheus.go#L193

davidg-datascene commented 2 years ago

We are hitting the same problem. For reference I've logged an issue with Azure as AKS is our cluster stack - https://github.com/Azure/AKS/issues/3216 but the issue probably falls under this domain.

lusson-luo commented 2 years ago

I found the reasons why missing the above metrics:

If "cpu.cfs_quota_us" is "-1" (or say cpu soft quota is enabled), the cadvisor source code doesn't read the value. So "spec.Cpu.Quota" is zero in this case.

For the other missing metrics (container_cpu_cfs_periods_total, container_cpu_cfs_throttled_periods_total, container_cpu_cfs_throttled_seconds_total), since "spec.Cpu.Quota" is zero, those metrics are skipped, e.g. https://github.com/google/cadvisor/blob/master/metrics/prometheus.go#L193

so, how do you resolve?

davidg-datascene commented 2 years ago

Any update on this? Thanks

lusson-luo commented 2 years ago

@davidg-datascene i use kube_pod_container_resource_requests{resource="cpu"} replace container_spec_cpu_quota

davidg-datascene commented 2 years ago

@lusson-luo - Hi, thanks, but still need the throttled metrics i.e container_cpu_cfs_throttled_periods_total and container_cpu_cfs_throttled_seconds_total.

lusson-luo commented 2 years ago

sorry, this two metrics i don't know

lusson-luo commented 2 years ago

@davidg-datascene i use kube_pod_container_resource_requests{resource="cpu"} replace container_spec_cpu_quota

i write error, use kube_pod_container_resource_limits{resource="cpu"} to replace container_spec_cpu_quota

moritzschmitz-oviva commented 1 year ago

For us the metric becomes empty (or not measured) when the container does not have a CPU limit. Does that make sense?

davidg-datascene commented 1 year ago

For us the metric becomes empty (or not measured) when the container does not have a CPU limit. Does that make sense?

I thought this at first but the deployment spec contains limits for CPU and tested changes to Mi values as well but made no difference.

spec of deployment:

resources:
    limits:
       cpu: "3"
       memory: 3Gi
    requests:
        cpu: "1"
        memory: 3Gi

The cluster is getting upgraded to AKS 1.24 in next few weeks and all the VM worker nodes will need replacing with a newer image/version. I'll check again when nodes are replaced to see if issue persists.

davidg-datascene commented 1 year ago

I've found where the issue lies in our case, at least. In our terraform that defines the Azure AKS node pools we set a _kubeletconfig. For every node pool where this has been set we are missing the CFS quota metrics. If I log into one of the nodes and check the kubelet config, I can see the value as false:

terraform yaml config for kubelet:

kubelet_config:
      container_log_max_size_mb: 100

cat ./etc/default/kubeletconfig.json | grep CFS
    "cpuCFSQuota": false,
    cat ./etc/default/kubeletconfig.json | grep -i log
    "containerLogMaxSize": "100M",   <-- we are only changing this value but cpuCFSQuota is false, hence no CFS stats

For where we are not setting a kubelet_config there is no kubeletconfig.json file so accepts the defaults as per the Azure configuration - https://learn.microsoft.com/en-us/azure/aks/custom-node-configuration meaning CFS quota is true and you get the stats. The fix but not yet tested is we should be able to set https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/kubernetes_cluster_node_pool [cpu_cfs_quota_enabled] in the terraform code to enable the CFS stats.

ervinb commented 1 year ago

Went deep on this, and I was banging my head why container_cpu_cfs_throttled_seconds_total is the only one missing while the others are there.

Turns out the Helm chart for kube-prometheus-stack drops that metric by default.

Just something to check for folks arriving from Google.

tim-hilt commented 1 year ago

@ervinb same for me! Lesson learned: Always check on the most fundamental level, which in this case was "are the limits actually set?"

Good idea to leave that information here.

If the cadvisor-metrics for limits are still not reliable enough, there are a few alternative ways that you can check to get the limits as a prometheus-metric:

use kube-state-metrics
The kubelet of your k8s cluster usually deploys a cadvisor-instance on its own
The kube-scheduler exposes limits-metrics, however getting this to work is quite cumbersome, because the cli-arguments for the scheduler have to be adapted

harche commented 9 months ago

If the limits are not set on the pod, then cpu.cfs_quota_us is set to -1. This is because there isn't anything to read related to quota or throttling if the quota itself is disabled. This is the reason cadvisor first checks if the cpu.cfs_quota_us is set to -1 and then doesn't set spec.Cpu.Quota. As a results of not setting spec.Cpu.Quota the metrics in questions here (such as container_cpu_cfs_periods_total) aren't reported by cadvisor because when the limits aren't set anything related to quota or throttling has no meaning.

IMHO, cadvisor is behaving as expected and this maybe not be a bug.

/cc @bobbypage WDYT?

jan-kantert commented 8 months ago

If the limits are not set on the pod, then cpu.cfs_quota_us is set to -1. This is because there isn't anything to read related to quota or throttling if the quota itself is disabled. This is the reason cadvisor first checks if the cpu.cfs_quota_us is set to -1 and then doesn't set spec.Cpu.Quota. As a results of not setting spec.Cpu.Quota the metrics in questions here (such as container_cpu_cfs_periods_total) aren't reported by cadvisor because when the limits aren't set anything related to quota or throttling has no meaning.

IMHO, cadvisor is behaving as expected and this maybe not be a bug.

This seems odd to me. Maybe I am missing something. Imagine the following scenario:

Node has only 1 CPU (for simplicity)
Pod A has CPU requests of 50m and no CPU limit
Pod B has CPU requests of 950m and a limit of 1
Both pods are CPU bound and content for CPU resources.

What will happen here is that Pod A will get throttled massively. I would expect container_cpu_cfs_throttled_seconds_total to go as well. This would happen due to CPU starvation at the node level not due to limits. I guess this would happen if I set a limit of 1 CPU to pod A.

Am I missing something? If that case above is not covered by cadvisor how could I monitor burstabled (or best-effort) pods getting throttled by the scheduler?

harche commented 8 months ago

Setting a CPU request indicates the minimum CPU required for a pod to run efficiently, while setting a CPU limit caps the maximum CPU usage. When a pod has no CPU limit (cpu.cfs_quota_us = -1), it's not bound by the CFS quota, implying it can utilize CPU resources beyond its request, up to what's available on the node, subject to scheduling decisions.

cAdvisor's approach to not report metrics like container_cpu_cfs_periods_total for pods without a defined CPU limit is logical because these metrics track usage against a specific limit. Since there's no limit for Pod A, the concept of CFS quota-based throttling doesn't apply directly. However, the underlying issue you're hinting at is more about CPU resource contention among pods rather than traditional CFS throttling.

To monitor situations where burstable or best-effort pods face "effective throttling" due to resource contention (as opposed to CFS quota limits), you might need a broader set of metrics. You can analyze container_cpu_usage_seconds_total in relation to CPU requests.

jan-kantert commented 8 months ago

Setting a CPU request indicates the minimum CPU required for a pod to run efficiently, while setting a CPU limit caps the maximum CPU usage. When a pod has no CPU limit (cpu.cfs_quota_us = -1), it's not bound by the CFS quota, implying it can utilize CPU resources beyond its request, up to what's available on the node, subject to scheduling decisions.

cAdvisor's approach to not report metrics like container_cpu_cfs_periods_total for pods without a defined CPU limit is logical because these metrics track usage against a specific limit. Since there's no limit for Pod A, the concept of CFS quota-based throttling doesn't apply directly. However, the underlying issue you're hinting at is more about CPU resource contention among pods rather than traditional CFS throttling.

Our current workaround is to simply set cpu limit = number of cpus of node. That will provide perfect metrics. I wish there was a way to get them without that hack. Especially, in mixed clusters where we got different types of nodes that sometimes prevents pods from bursting.

(Theoretically, we could set limit = 1000 but in practise the cluster scheduler will prevent that.)

To monitor situations where burstable or best-effort pods face "effective throttling" due to resource contention (as opposed to CFS quota limits), you might need a broader set of metrics. You can analyze container_cpu_usage_seconds_total in relation to CPU requests.

Unfortunately, that did not turn as as helpful in practise. Simply, because you cannot distinguish between:

(a) Pod did use exactly as much as it requested. Everthing is fine.
(b) Pod wanted to use more resources but did not get them.

This is quite problematic/challenging when combining with vertical pod autoscalers which would always asume case (a) and, then, would not increase reservations.

Would you suggest that we should write another component to expose the same metric for our usecase? We would like to follow the recommendation to not set CPU limits unless we explicitly need them. However, this currently would limit our ability to use VPAs without massive overprovisioning and our ability to monitor for starvation.

google / cadvisor

container_spec_cpu_quota is missing #3154