kubernetes-monitoring / kubernetes-mixin

A set of Grafana dashboards and Prometheus alerts for Kubernetes.
Apache License 2.0
2.11k stars 598 forks source link

The change from rate() to irate() is a breaking change #670

Open rmak-cpi opened 3 years ago

rmak-cpi commented 3 years ago

Submitting an issue so folks encountering the same problem as me can have an easier time finding out what happened. I recently upgraded to kube-prometheus-stack and discovered that CPU data no longer shows on old Grafana Kubernetes / Compute Resources / Workload dashboard that was hosted somewhere else. It turns out that the following change from rate() to irate() was the cause of the issue:

https://github.com/kubernetes-monitoring/kubernetes-mixin/commit/e996e00fa3a0c17a7a9f5d01f6c1a3544731bd33

In particular, the renaming from node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate to node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate can break any dashboard/rules etc referencing the old name. As a (temporary) workaround, I think it's possible for me to just recreate the old node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate recording rule.

Please feel free to add brief comments confirming the observation and close it. Thanks!

tahajahangir commented 3 years ago

Using irate also does not result in better details. It only makes graphs more random and more noisy (using only sample of points) and final graph may completely discard spikes (while rate always show average rate over the interval). https://valyala.medium.com/why-irate-from-prometheus-doesnt-capture-spikes-45f9896d7832

Original bad PR: #619

paulfantom commented 3 years ago

while rate always show average rate over the interval

Average removes spikes and thus removes important data. In a lot of cases you want those spikes to be present on graphs. This is in contrast to alerts where you probably don't want to have spikes and rate (or other statistic methods) is a better choice.

More in https://www.robustperception.io/irate-graphs-are-better-graphs

tahajahangir commented 3 years ago

Average removes spikes and thus removes important data.

rate does not remove spikes, it makes them flat over a time period. In contrast irate may completely remove them (as they are not happened at all).

irate will only use the last two points of data in a time range (ignoring all other points), and will result in more noisy graphs. rate will consider first and last data point.

paulfantom commented 3 years ago

rate does not remove spikes, it makes them flat over a time period

When it comes to graphs "removal" and "making spikes flat" are the same. In both cases, you are losing data from visualization.

bboreham commented 3 years ago

It's unpredictable whether the important spikes are at the end of the window, in which case irate can see them, or earlier in the window, in which case irate will ignore them. I prefer consistent behaviour, and no discarding of points, so rate() is better.

github-actions[bot] commented 4 days ago

This issue has not had any activity in the past 30 days, so the stale label has been added to it.

Thank you for your contributions!

skl commented 3 days ago

Duplicate of #679

bboreham commented 3 days ago

This one is about the name changing, while #679 is about the behaviour.

skl commented 2 days ago

@bboreham ah ok, I was thinking about contributing a rate version of all the existing irate recording rules - to maintain backwards compatibility and give users the choice. I thought that might address both tickets, hence marking as dupe. wdyt?

bboreham commented 2 days ago

Duplicating the recording rules would indeed address the breaking change, relating to other uses of the data. To fix #679 would require changing the dashboards.