paypal / load-watcher

Load watcher is a cluster-wide aggregator of metrics, developed for Trimaran: Real Load Aware Scheduler in Kubernetes.
Other
63 stars 33 forks source link

Prometheus metrics may have done aggregation. #28

Open wangchen615 opened 3 years ago

wangchen615 commented 3 years ago

The current load-watcher Prometheus pkg was using the metric of instance:node_cpu:ratio to calculate the node utilization However, when this value is still below 60%, I found another metric instance:node_cpu_utilisation:rate1m was very large and was around 90%. Apparently, the Prometheus metric had some smoothing for the metric, and the one we used may already have a smoothing over a large time window, which might be larger than 1m. Let's guess for 5m.

We are not sure which Prometheus metric is consistent with the metric obtained directly from the metric server, so there needs more testing.

screencapture-prometheus-k8s-openshift-monitoring-diana-roks-d5524207867702d0568abed2cd076001-0000-us-south-containers-appdomain-cloud-graph-2021-04-23-00_33_53

WLBF commented 2 years ago

instance:node_cpu:ratio metric's time window is 5m.

sum(rate(node_cpu_seconds_total{mode!="idle",mode!="iowait",mode!="steal"}[5m])) WITHOUT (cpu, mode) / ON(instance) GROUP_LEFT() count(sum(node_cpu_seconds_total) BY (instance, cpu)) BY (instance)

https://github.com/prometheus-operator/kube-prometheus/blob/7a3879ba49bfe8df5eec03847fe5bcd2d4094c73/jsonnet/kube-prometheus/components/mixin/rules/node-rules.libsonnet#L20