red-hat-data-services / odh-deployer

The odh-deployer image creates a custom resource for the image in operator image in odh-operator-allinone
Apache License 2.0
5 stars 42 forks source link

refactor: addition of codeflare components as scrape targets and availabil… #365

Closed dimakis closed 1 year ago

dimakis commented 1 year ago

…ity alerts

Adding of kuberay, codeflare and mcad components as scrape targets for the prometheus deployment. Addition of up alert for the each component pod.

The SOPs for each component are in this PR

MCAD and CodeFlare components need to expose metrics endpoint before the alerts are passing. With this PR, Prometheus is actively looking for the pods and the alerts fire.

DO NOT MERGE UNTIL WORK IS CARRIED OUT IN MCAD AND CODEFLARE

Description

How Has This Been Tested?

Tested on an OSD cluster with the RHODS addon. Kill the kuberay-operator pods, and watch the alert fire in the prometheus UI. (don't forget to scale down the codeflare and rhods operators) Once these changes are there, the prometheus deployment must be restarted.

Merge criteria:

openshift-ci[bot] commented 1 year ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign lavlas for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/red-hat-data-services/odh-deployer/blob/main/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
dimakis commented 1 year ago

I'm working on the runbook, so I will update this with the runbook when I have it

dimakis commented 1 year ago

/hold

lucferbux commented 1 year ago

@dimakis there's something weird with the commit history in your PR, did you branch out of main?

dimakis commented 1 year ago

@dimakis there's something weird with the commit history in your PR, did you branch out of main?

ah I messed up and accidentally committed deadman snitch and PD keys, so I had to go through the steps that SRE gave me to change the history

asm582 commented 1 year ago

If I understand this PR, this PR will check the status of the MCAD application, MCAD is launched via KAR controller, the question then would be how does the App know if it's working or not?

dimakis commented 1 year ago

If I understand this PR, this PR will check the status of the MCAD application, MCAD is launched via KAR controller, the question then would be how does the App know if it's working or not?

Yes, it'll just check the availability is all really. It's the bare minimum SRE requirement for release. I've got a doc which outlines a more detailed SLI/SLO metric plan to be introduced later but we're just starting at the basic level to implement then building on top of that.

ChristianZaccaria commented 1 year ago

I added a commit to this PR with additional rules involving SLO alerts for MCAD and CodeFlare Operator. You will find several probe_success:burnrate recording rules and alerting rules. Added 3 alerts for each component based on the error budget burn rate. These were based off the way Data Science Pipelines Operator has defined them.

The importance of these alerts is to ensure the high availability of these components by tracking the rate at which they are failing to meet their SLOs. Taking as an example the MCAD Controller, an alert is triggered when the error budget burn rate for probe success falls below 99.95%, where the alert considers a 5-minute and 1-hour burn rate and fires if the condition is met continuously for 2 minutes.

@astefanutti I'm very interested in your code review on this and/or changes to be made? Thanks in advance!

openshift-merge-robot commented 1 year ago

PR needs rebase.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
ChristianZaccaria commented 1 year ago

Moved to here: https://github.com/red-hat-data-services/odh-deployer/pull/380