Summary/SLOs dashbords - Githubissues

omerlh commented 4 years ago

It would be nice to have a dashboard that shows the current status of the cluster in a glance and contains very minimal information, like:

RED slo of the API server - requests, duration, errors
RED from nginx ingress stats - requests, duration, errors (5XX)
Cluster USE (memory, CPU, network)

All dashboards should be single stats with links to the relevant dashboards for drill down.

Thoughts?

brancz commented 4 years ago

Agreed, this would be super useful! @metalmatze was working on some generated slo alerts via a library, we might be able to combine that effort.

metalmatze commented 4 years ago

Yes. We somewhat started the process of creating these dashboards. It would be best to put them into metalmatze/slo-libsonnet as @brancz mentioned.

omerlh commented 4 years ago

I'll file another issue there, maybe worth considering using the SLO operator? It looks very interesting

brancz commented 4 years ago

I think we want to be a bit less opinionated than that and just deliver text files essentially.

omerlh commented 4 years ago

Something I just thought about: Maybe we can use existing alerts for that? Alert is pretty similar to SLO - it's looking on an SLI and a threshold, and when the threshold is breached it's firing. For example, KubeAPIHighLatency defines SLO for the API server latency - less than 1 second. So we can now have a nice SLO widget by using the following calculation: (number of minutes the alert was not firing)/(total number of minutes)

What do you think about this approach?

brancz commented 4 years ago

I would expect this type of dashboard to show a 28d windowed where we count the number of seconds in the last 28 days where we reached out SLO availability made up of a number of SLIs which a breakdown should show.

omerlh commented 4 years ago

We can easily do that, by looking on this formula: (number of minutes the alert was not firing)/(total number of minutes)

For 28 days. The break down is simply showing the results (when the alert was fired) in a table.

brancz commented 4 years ago

I think if you choose that as an indicator that’s fine but I don’t think we should be promoting that practice. We should base this on well established Workflows on SLOs/SLAs just like the alerts that we generate from the slo-libsonnet library.

omerlh commented 4 years ago

I'm just trying to avoid the duplication, this is why I thought of using the existing alert for that - as it's already defining SLI/SLO. This is just for display purposes... So you'll have a nice gauge panel showing the current SLO.

omerlh commented 4 years ago

I was able to did something using the SLO library, something like that:

I was thinking to add it to the API server dashboard.

Thoughts?

I also want to add an error budget panel (see metalmatze/slo-libsonnet#27), but this maybe will go on another PR.

metalmatze commented 4 years ago

I've actually been working on a similar dashboard out of bounds of this project for now talking to the apiserver people at Red Hat. It's a bit more time but we should be able to come up with a proper SLO based dashboard soonish.

metalmatze commented 4 years ago

Here's a screenshot. Currently I only added recording rules. Next are alerting rules, multi burn rates for errors and latency, after that actually create the dashboard from below in Jsonnet.

Screenshot from 2020-03-03 17-18-17

omerlh commented 4 years ago

Nice! I think it might be better to use the HTTP methods instead of read/write. Besdies that - I'm looking forward to try it out :)

yashbhutwala commented 4 years ago

hey, how can I try this? is there plans to merge metalmatze/slo-libsonnet to here? just starting out with grafana dashboarding, so please excuse my noob questions 😄

metalmatze commented 4 years ago

This has been merged already and is on master.

yashbhutwala commented 4 years ago

Awesome @metalmatze!! so metalmatze/slo-libsonnet doesn't have anything else, that may be useful? iiuc, this is at the kubernetes apiserver level, is there a similar dashboard at the pod/service level?

specifically, as mentioned in this comment by @omerlh

RED from nginx ingress stats - requests, duration, errors (5XX)

metalmatze commented 4 years ago

I recreated the bits and pieces from slo-libsonnet specifically for the Kubernetes APIServer, as we can be even more specific. It's not a generic HTTP server to us. We know a bit more about the internals here. Something like this doesn't make sense for Pods or Services as these are application specific. That's something where the slo-libsonnet helps you for application specific SLOs.

metalmatze commented 4 years ago

RED from nginx ingress stats - requests, duration, errors (5XX)

Is out of scope for the kubernetes-mixin project. There are a lot of clusters out there that most likely do not even run nginx. With that being said, I have a nginx-ingress RED dashboard for myself: https://gist.github.com/metalmatze/f8351a56fa4393853c4a598efa60ab53

github-actions[bot] commented 3 days ago

This issue has not had any activity in the past 30 days, so the stale label has been added to it.

The stale label will be removed if there is new activity
The issue will be closed in 7 days if there is no new activity
Add the keepalive label to exempt this issue from the stale check action

Thank you for your contributions!

kubernetes-monitoring / kubernetes-mixin

Summary/SLOs dashbords #326