add Keycloak SLO docs fixes

kami619 commented 3 weeks ago

Fixes #579

kami619 commented 3 weeks ago

@ahus1 I have addressed most of the review feedback, and I still have some doubts on how to implement your input on the PromQL queries. I will connect with you offline and work on it.

kami619 commented 3 weeks ago

@ ahus1 I added what we thought would be good concerning the structure of the doc. If you can double-check the error rate related PromQL queries, that would be great.

ahus1 commented 3 weeks ago

@kami619 - I pushed a reworked version:

It takes the definition from Google SRE book about SLI and SLO. We shouldn't try to come up with our own definition.
As the Google definition first defines the SLI, and then defines the SLO on top of it, the SLI goes first in the table
As per Google SRE, there is no SLI vs. SLI metric - the SLI is already the metrics. Therefore the column have changed
I tried to add some structural headings
The SLOs have now an interval each. I'm quite sure about the availability (one month, although one can discuss 4 weeks), I'm less sure about the interval for the error rate and response time which are 5 min for now.
In the service definition I tried to be a little bit more specific by referencing "the applications that use Keycloak"
The characteristics column references the names from the 4 golden signals (Errors and Latency)
I found that PromQL allows for multi-line expressions, and for comments using # so I used callouts. This enabled me to do real PromQL and not only pseudo PromQL.
The PromQL examples will work in our SLO dashboards and can be copy-pasted there for testing.
The URLs in the query are now all authentication requests, not only those that we have in our dashboard today, so you see more .* there. I did that to match the service description
Each expression returns only a value for a Keycloak instance, not one per URL or per Pod, as the users wouldn't care about the values per pod, and therefore the SLO also doesn't. Still when admins break it down to find a cause, they will keep those labels - but that will be IMHO a different diagram.
The expressions use rate() instead of irate() to create averages. When using irate(), this provides us the changes between datapoints which is great for analysis, but doesn't provide averages asked for by SLOs

Let me know your thoughts on this one. Have a great weekend!

kami619 commented 3 weeks ago

@ahus1

Thanks, Alex for the detailed review.

The SRE book-styled definitions are a standard approach, and I agree with those changes.
SLO interval definitions also create a good starting point for discussions outside of our team; it's a good idea to do it that way.
Thanks for making the PromQL query section more presentable and practical, I owe you one on this :)
Coming to the 4 golden signals, we are not referencing the other two Throughput(Traffic) and Saturation, do we want to mention them or want to keep them out of this page for now?

ahus1 commented 3 weeks ago

Hi Kamesh, when looking at SLIs that measure user facing behaviour, I don't see the need for capturing throughput and saturation separately. If something saturates within Keycloak this would affect the SLIs we already have (slower responses, more errors). There will be other dashboards for trouble shooting and capacity planning where we will come back to them.

I think we reached the "as simple as possible" here.

If you agree we can merge it and as the community for feedback once it has been published on GitHub pages.

kami619 commented 3 weeks ago

I am very happy with the final result. Thanks again @ahus1. I think we can merge it now.

keycloak / keycloak-benchmark

add Keycloak SLO docs fixes #1020