solo-io / gloo

The Feature-rich, Kubernetes-native, Next-Generation API Gateway Built on Envoy
https://docs.solo.io/
Apache License 2.0
4.03k stars 433 forks source link

Incorrect prometheus query in Production Deployment docs #6335

Open alexgottscha opened 2 years ago

alexgottscha commented 2 years ago

Version

master

Describe the requested changes

In the latest Production Deployment documentation, this PromQL query to find pods being CPU-throttled appears to use the incorrect metrics:

container_cpu_cfs_throttled_seconds_total / container_cpu_cfs_throttled_periods_total – This is a generic expression that will show whether or not a given container is being throttled for CPU, which will result is performance issues and service degradation.

Per Stackoverflow, it looks like the correct metric query to use would be container_cpu_cfs_throttled_periods_total / container_cpu_cfs_periods_total > 0, which gives the percentage of throttled CPU cycles for a given pod.

If I'm misunderstanding the metric produced by the query currently in documentation, it would be helpful to explain exactly what each metric in container_cpu_cfs_throttled_seconds_total / container_cpu_cfs_throttled_periods_total does, and what comparison to use when generating alerts (e.g. > 0? < 1?)

Link to any relevant existing docs

https://docs.solo.io/gloo-edge/latest/operations/production_deployment/

Browser Information

No response

Additional Context

No response

github-actions[bot] commented 1 month ago

This issue has been marked as stale because of no activity in the last 180 days. It will be closed in the next 180 days unless it is tagged "no stalebot" or other activity occurs.