How is the error budget calculated?

slok / sloth

🦥 Easy and simple Prometheus SLO (service level objectives) generator

https://sloth.dev

Apache License 2.0

2.09k stars 172 forks source link

How is the error budget calculated? #348

Open jkblume opened 2 years ago

jkblume commented 2 years ago

Hi there, thanks for the work on this project. It helps a lot on transfering knowledge on SLO topic to the very complex prometheus queries!

I'm wondering how the error budget is calculated and I can't find any documentation on the project page or the queries.

Does the service have to run for 30 days and then the number of requests is taken, which I need in some form to show the current consumption of the budget? I'm wondering because I can't really figure out the numbers displayed in the dashboard (see image). But it could also be because I am currently only testing the project in a test project, which has only been running for a few hours.

slok commented 2 years ago

Hi @jkblume

It's been a while since you started testing sloth, do you remain to have the same doubts as the day you created the issue? Do your numbers have more sense now after running sloth for a while?

Numbers Explanation:

Current burning budget for 892%: The service it's running with a 91.38% SLI having a target of 99 (1% error is a 100% error budget).
Remaining error budget (month) 99.6%: Since the 1st of the current month how much error budget is remaining.
Remaining error budget (30d window) NaN: since now, 30 days how much error budget is remaining (if just set there is not enough data).

Best,

susenj commented 1 year ago

Hi @slok ,

Proabably digging some old grave here, but I just stumbled across this and would like to know if you can help with my understanding on your statement:

Remaining error budget (30d window) NaN: since now, 30 days how much error budget is remaining (if just set there is not enough data).

What I undetstand with this is - if in case there is any discontinuity of data or, no data is available at any time - the metric will say NaN. Or, there could be a case where the sloth is just setup and the at least last 30 days data is not available - we would see NaN.

The reason why I am asking this is I am not very much able to interpret this in my dashboard that shows negative remaining budget percentage as well as some NaNs.

Thanks a bunch for this awesome project. susenj