kubernetes-monitoring / kubernetes-mixin

A set of Grafana dashboards and Prometheus alerts for Kubernetes.
Apache License 2.0
2.11k stars 595 forks source link

Summary/SLOs dashbords #326

Open omerlh opened 4 years ago

omerlh commented 4 years ago

It would be nice to have a dashboard that shows the current status of the cluster in a glance and contains very minimal information, like:

All dashboards should be single stats with links to the relevant dashboards for drill down.

Thoughts?

brancz commented 4 years ago

Agreed, this would be super useful! @metalmatze was working on some generated slo alerts via a library, we might be able to combine that effort.

metalmatze commented 4 years ago

Yes. We somewhat started the process of creating these dashboards. It would be best to put them into metalmatze/slo-libsonnet as @brancz mentioned.

omerlh commented 4 years ago

I'll file another issue there, maybe worth considering using the SLO operator? It looks very interesting

brancz commented 4 years ago

I think we want to be a bit less opinionated than that and just deliver text files essentially.

omerlh commented 4 years ago

Something I just thought about: Maybe we can use existing alerts for that? Alert is pretty similar to SLO - it's looking on an SLI and a threshold, and when the threshold is breached it's firing. For example, KubeAPIHighLatency defines SLO for the API server latency - less than 1 second. So we can now have a nice SLO widget by using the following calculation: (number of minutes the alert was not firing)/(total number of minutes)

What do you think about this approach?

brancz commented 4 years ago

I would expect this type of dashboard to show a 28d windowed where we count the number of seconds in the last 28 days where we reached out SLO availability made up of a number of SLIs which a breakdown should show.

omerlh commented 4 years ago

We can easily do that, by looking on this formula: (number of minutes the alert was not firing)/(total number of minutes)

For 28 days. The break down is simply showing the results (when the alert was fired) in a table.

brancz commented 4 years ago

I think if you choose that as an indicator that’s fine but I don’t think we should be promoting that practice. We should base this on well established Workflows on SLOs/SLAs just like the alerts that we generate from the slo-libsonnet library.

omerlh commented 4 years ago

I'm just trying to avoid the duplication, this is why I thought of using the existing alert for that - as it's already defining SLI/SLO. This is just for display purposes... So you'll have a nice gauge panel showing the current SLO.

omerlh commented 4 years ago

I was able to did something using the SLO library, something like that:

image

I was thinking to add it to the API server dashboard.

Thoughts?

I also want to add an error budget panel (see metalmatze/slo-libsonnet#27), but this maybe will go on another PR.

metalmatze commented 4 years ago

I've actually been working on a similar dashboard out of bounds of this project for now talking to the apiserver people at Red Hat. It's a bit more time but we should be able to come up with a proper SLO based dashboard soonish.

metalmatze commented 4 years ago

Here's a screenshot. Currently I only added recording rules. Next are alerting rules, multi burn rates for errors and latency, after that actually create the dashboard from below in Jsonnet.

Screenshot from 2020-03-03 17-18-17

omerlh commented 4 years ago

Nice! I think it might be better to use the HTTP methods instead of read/write. Besdies that - I'm looking forward to try it out :)

yashbhutwala commented 4 years ago

hey, how can I try this? is there plans to merge metalmatze/slo-libsonnet to here? just starting out with grafana dashboarding, so please excuse my noob questions 😄

metalmatze commented 4 years ago

This has been merged already and is on master.

yashbhutwala commented 4 years ago

Awesome @metalmatze!! so metalmatze/slo-libsonnet doesn't have anything else, that may be useful? iiuc, this is at the kubernetes apiserver level, is there a similar dashboard at the pod/service level?

specifically, as mentioned in this comment by @omerlh

metalmatze commented 4 years ago

I recreated the bits and pieces from slo-libsonnet specifically for the Kubernetes APIServer, as we can be even more specific. It's not a generic HTTP server to us. We know a bit more about the internals here. Something like this doesn't make sense for Pods or Services as these are application specific. That's something where the slo-libsonnet helps you for application specific SLOs.

metalmatze commented 4 years ago

RED from nginx ingress stats - requests, duration, errors (5XX)

Is out of scope for the kubernetes-mixin project. There are a lot of clusters out there that most likely do not even run nginx. With that being said, I have a nginx-ingress RED dashboard for myself: https://gist.github.com/metalmatze/f8351a56fa4393853c4a598efa60ab53

github-actions[bot] commented 3 days ago

This issue has not had any activity in the past 30 days, so the stale label has been added to it.

Thank you for your contributions!