slok / sloth

🦥 Easy and simple Prometheus SLO (service level objectives) generator
https://sloth.dev
Apache License 2.0
2.09k stars 173 forks source link

Improving Sloth SLOs dashboard #196

Open w-reichert opened 3 years ago

w-reichert commented 3 years ago

Hi Xabier, first of all, many thanks for the Sloth SLOs sample dashboard (https://grafana.com/grafana/dashboards/14348)! We are using it since a while. :-)

I noticed that the color coding and ranges for Remaining error budget (month) is not correct. It starts in red if there are no values, it is yellow if there are no errors, and it is green if the budget is below 40%. Furthermore, I suppose negative values should be cut off since empty is empty.

My suggestions:

          "description": "This month remaining error budget, starts the 1st of the month and ends  28th-31st (not rolling window)",
          "fieldConfig": {
            "defaults": {
              "color": {
                "mode": "thresholds"
              },
              "mappings": [],
              "max": 1,
              "min": 0,
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {
                    "color": "grey",
                    "value": null
                  },
                  {
                    "color": "red",
                    "value": 0
                  },
                  {
                    "color": "orange",
                    "value": 0.01
                  },
                  {
                    "color": "light-yellow",
                    "value": 0.2
                  },
                  {
                    "color": "green",
                    "value": 0.8
                  }
                ]
              },
              "unit": "percentunit"
            },
            "overrides": []
          },

and alike for A rolling window of the total period (30d) error budget remaining.

Furthermore, cutting of negative budget (also occurs twice).

        "expr": "1-clamp_max(sum_over_time( ... ) , 1)",

Thanks and regards Wolfgang

slok commented 3 years ago

Hey @w-reichert!

Thanks for bringing this up!

I'm planning some changes in Sloth that may affect the dashboards... so when I tackle these, it would be a good time to revisit because I may affect the current dashboards.

Best,

itkq commented 3 years ago

I'm using Sloth SLOs and Grafana dashboard too. It is pretty easy for use and has been working great so far! I also have a feature request for the dashboard. I usually see Month error budget burn chart panel for monitoring, but don't understand if the current burn rate is good at a glance. I would suggest that showing the graph in different colors or drawing an additional line by a burn rate of 1. I'm trying the latter solution that looks like: image

Anyway, thanks for providing this product!

w-reichert commented 3 years ago

Xabier, thanks for the quick response. When you have a new version of Sloth and/or the dashboard we would love to test it and provide feedback.

Regards, Wolfgang

rellupuru commented 3 years ago

@slok Thank you for your great contributions to SRE world. I see v0.9.0 is released did you included the above ask in this release?

slok commented 3 years ago

Not yet, I'll need a bit more of time

slok commented 2 years ago

Hi @w-reichert!

I've revised what you said about the colors, and I did that on purpose. Mainly the error budget you have means that it has been decided to be consumed, so, the perfect error budget left would be 0%. Below that, means that you didn't achieve the reliability you were supposed to have, and above that, means that you didn't consume enough (few experiments, to slow shipping features...).

Anyhow, I would happily change that if people prefer that kind of semaphore coloring while you are approaching 0% error budget left. Regarding the negative, part, you are right, I didn't do that so people are aware of how much they fail.

slok commented 2 years ago

@itkq Check #216

w-reichert commented 2 years ago

Hi Xabier @slok, thanks for looking into my recommendations.

Actually the issue we saw started with a red NaN value. Obviously this happens if a service is not running long enough to collect 30-day metrics. Hence my suggestion to begin with "color": "grey" for "value": null. Then "red" may follow for a high negative value.