Closed dimakis closed 1 year ago
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign lavlas for approval. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
I'm working on the runbook, so I will update this with the runbook when I have it
/hold
@dimakis there's something weird with the commit history in your PR, did you branch out of main?
@dimakis there's something weird with the commit history in your PR, did you branch out of main?
ah I messed up and accidentally committed deadman snitch and PD keys, so I had to go through the steps that SRE gave me to change the history
If I understand this PR, this PR will check the status of the MCAD application, MCAD is launched via KAR controller, the question then would be how does the App know if it's working or not?
If I understand this PR, this PR will check the status of the MCAD application, MCAD is launched via KAR controller, the question then would be how does the App know if it's working or not?
Yes, it'll just check the availability is all really. It's the bare minimum SRE requirement for release. I've got a doc which outlines a more detailed SLI/SLO metric plan to be introduced later but we're just starting at the basic level to implement then building on top of that.
I added a commit to this PR with additional rules involving SLO alerts for MCAD and CodeFlare Operator. You will find several probe_success:burnrate
recording rules and alerting rules. Added 3 alerts for each component based on the error budget burn rate. These were based off the way Data Science Pipelines Operator has defined them.
The importance of these alerts is to ensure the high availability of these components by tracking the rate at which they are failing to meet their SLOs. Taking as an example the MCAD Controller, an alert is triggered when the error budget burn rate for probe success falls below 99.95%, where the alert considers a 5-minute and 1-hour burn rate and fires if the condition is met continuously for 2 minutes.
@astefanutti I'm very interested in your code review on this and/or changes to be made? Thanks in advance!
PR needs rebase.
…ity alerts
Adding of kuberay, codeflare and mcad components as scrape targets for the prometheus deployment. Addition of
up
alert for the each component pod.The SOPs for each component are in this PR
MCAD and CodeFlare components need to expose metrics endpoint before the alerts are passing. With this PR, Prometheus is actively looking for the pods and the alerts fire.
DO NOT MERGE UNTIL WORK IS CARRIED OUT IN MCAD AND CODEFLARE
Description
How Has This Been Tested?
Tested on an OSD cluster with the RHODS addon. Kill the kuberay-operator pods, and watch the alert fire in the prometheus UI. (don't forget to scale down the codeflare and rhods operators) Once these changes are there, the prometheus deployment must be restarted.
Merge criteria:
[UPSTREAM]
has been prepended to the commit message.