Open rmak-cpi opened 3 years ago
Using irate
also does not result in better details. It only makes graphs more random and more noisy (using only sample of points) and final graph may completely discard spikes (while rate
always show average rate over the interval).
https://valyala.medium.com/why-irate-from-prometheus-doesnt-capture-spikes-45f9896d7832
Original bad PR: #619
while rate always show average rate over the interval
Average removes spikes and thus removes important data. In a lot of cases you want those spikes to be present on graphs. This is in contrast to alerts where you probably don't want to have spikes and rate
(or other statistic methods) is a better choice.
More in https://www.robustperception.io/irate-graphs-are-better-graphs
Average removes spikes and thus removes important data.
rate
does not remove spikes, it makes them flat over a time period. In contrast irate
may completely remove them (as they are not happened at all).
irate
will only use the last two points of data in a time range (ignoring all other points), and will result in more noisy graphs. rate
will consider first and last data point.
rate does not remove spikes, it makes them flat over a time period
When it comes to graphs "removal" and "making spikes flat" are the same. In both cases, you are losing data from visualization.
It's unpredictable whether the important spikes are at the end of the window, in which case irate
can see them, or earlier in the window, in which case irate
will ignore them.
I prefer consistent behaviour, and no discarding of points, so rate()
is better.
This issue has not had any activity in the past 30 days, so the
stale
label has been added to it.
stale
label will be removed if there is new activitykeepalive
label to exempt this issue from the stale check actionThank you for your contributions!
Duplicate of #679
This one is about the name changing, while #679 is about the behaviour.
@bboreham ah ok, I was thinking about contributing a rate
version of all the existing irate
recording rules - to maintain backwards compatibility and give users the choice. I thought that might address both tickets, hence marking as dupe. wdyt?
Duplicating the recording rules would indeed address the breaking change, relating to other uses of the data. To fix #679 would require changing the dashboards.
Submitting an issue so folks encountering the same problem as me can have an easier time finding out what happened. I recently upgraded to kube-prometheus-stack and discovered that CPU data no longer shows on old Grafana Kubernetes / Compute Resources / Workload dashboard that was hosted somewhere else. It turns out that the following change from rate() to irate() was the cause of the issue:
https://github.com/kubernetes-monitoring/kubernetes-mixin/commit/e996e00fa3a0c17a7a9f5d01f6c1a3544731bd33
In particular, the renaming from node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate to node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate can break any dashboard/rules etc referencing the old name. As a (temporary) workaround, I think it's possible for me to just recreate the old node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate recording rule.
Please feel free to add brief comments confirming the observation and close it. Thanks!