Open omerlh opened 4 years ago
Agreed, this would be super useful! @metalmatze was working on some generated slo alerts via a library, we might be able to combine that effort.
Yes. We somewhat started the process of creating these dashboards. It would be best to put them into metalmatze/slo-libsonnet as @brancz mentioned.
I'll file another issue there, maybe worth considering using the SLO operator? It looks very interesting
I think we want to be a bit less opinionated than that and just deliver text files essentially.
Something I just thought about: Maybe we can use existing alerts for that? Alert is pretty similar to SLO - it's looking on an SLI and a threshold, and when the threshold is breached it's firing. For example, KubeAPIHighLatency defines SLO for the API server latency - less than 1 second. So we can now have a nice SLO widget by using the following calculation: (number of minutes the alert was not firing)/(total number of minutes)
What do you think about this approach?
I would expect this type of dashboard to show a 28d windowed where we count the number of seconds in the last 28 days where we reached out SLO availability made up of a number of SLIs which a breakdown should show.
We can easily do that, by looking on this formula: (number of minutes the alert was not firing)/(total number of minutes)
For 28 days. The break down is simply showing the results (when the alert was fired) in a table.
I think if you choose that as an indicator that’s fine but I don’t think we should be promoting that practice. We should base this on well established Workflows on SLOs/SLAs just like the alerts that we generate from the slo-libsonnet library.
I'm just trying to avoid the duplication, this is why I thought of using the existing alert for that - as it's already defining SLI/SLO. This is just for display purposes... So you'll have a nice gauge panel showing the current SLO.
I was able to did something using the SLO library, something like that:
I was thinking to add it to the API server dashboard.
Thoughts?
I also want to add an error budget panel (see metalmatze/slo-libsonnet#27), but this maybe will go on another PR.
I've actually been working on a similar dashboard out of bounds of this project for now talking to the apiserver people at Red Hat. It's a bit more time but we should be able to come up with a proper SLO based dashboard soonish.
Here's a screenshot. Currently I only added recording rules. Next are alerting rules, multi burn rates for errors and latency, after that actually create the dashboard from below in Jsonnet.
Nice! I think it might be better to use the HTTP methods instead of read/write. Besdies that - I'm looking forward to try it out :)
hey, how can I try this? is there plans to merge metalmatze/slo-libsonnet to here? just starting out with grafana dashboarding, so please excuse my noob questions 😄
This has been merged already and is on master.
Awesome @metalmatze!! so metalmatze/slo-libsonnet doesn't have anything else, that may be useful? iiuc, this is at the kubernetes apiserver level, is there a similar dashboard at the pod/service level?
specifically, as mentioned in this comment by @omerlh
I recreated the bits and pieces from slo-libsonnet specifically for the Kubernetes APIServer, as we can be even more specific. It's not a generic HTTP server to us. We know a bit more about the internals here. Something like this doesn't make sense for Pods or Services as these are application specific. That's something where the slo-libsonnet helps you for application specific SLOs.
RED from nginx ingress stats - requests, duration, errors (5XX)
Is out of scope for the kubernetes-mixin project. There are a lot of clusters out there that most likely do not even run nginx. With that being said, I have a nginx-ingress RED dashboard for myself: https://gist.github.com/metalmatze/f8351a56fa4393853c4a598efa60ab53
This issue has not had any activity in the past 30 days, so the
stale
label has been added to it.
stale
label will be removed if there is new activitykeepalive
label to exempt this issue from the stale check actionThank you for your contributions!
It would be nice to have a dashboard that shows the current status of the cluster in a glance and contains very minimal information, like:
All dashboards should be single stats with links to the relevant dashboards for drill down.
Thoughts?