Closed kami619 closed 3 weeks ago
@ahus1 I have addressed most of the review feedback, and I still have some doubts on how to implement your input on the PromQL queries. I will connect with you offline and work on it.
@ ahus1 I added what we thought would be good concerning the structure of the doc. If you can double-check the error rate related PromQL queries, that would be great.
@kami619 - I pushed a reworked version:
#
so I used callouts. This enabled me to do real PromQL and not only pseudo PromQL..*
there. I did that to match the service descriptionrate()
instead of irate()
to create averages. When using irate()
, this provides us the changes between datapoints which is great for analysis, but doesn't provide averages asked for by SLOsLet me know your thoughts on this one. Have a great weekend!
@ahus1
Thanks, Alex for the detailed review.
The SRE book-styled definitions are a standard approach, and I agree with those changes.
SLO interval definitions also create a good starting point for discussions outside of our team; it's a good idea to do it that way.
Thanks for making the PromQL query section more presentable and practical, I owe you one on this :)
Coming to the 4 golden signals, we are not referencing the other two Throughput(Traffic) and Saturation, do we want to mention them or want to keep them out of this page for now?
Hi Kamesh, when looking at SLIs that measure user facing behaviour, I don't see the need for capturing throughput and saturation separately. If something saturates within Keycloak this would affect the SLIs we already have (slower responses, more errors). There will be other dashboards for trouble shooting and capacity planning where we will come back to them.
I think we reached the "as simple as possible" here.
If you agree we can merge it and as the community for feedback once it has been published on GitHub pages.
I am very happy with the final result. Thanks again @ahus1. I think we can merge it now.
Fixes #579