red-hat-data-services / odh-deployer

The odh-deployer image creates a custom resource for the image in operator image in odh-operator-allinone
Apache License 2.0
5 stars 42 forks source link

Add alerts for DSP Operator #319

Closed DharmitD closed 1 year ago

DharmitD commented 1 year ago

Description

Adding alerting rules for the Data Science Pipelines Operator.

How Has This Been Tested?

Merge criteria:

DharmitD commented 1 year ago

This PR is blocked until https://github.com/red-hat-data-services/odh-deployer/pull/314 and https://gitlab.cee.redhat.com/service/managed-tenants-sops/-/merge_requests/82 are merged. Once they're merged, this PR will be rebased and marked ready for review.

jgarciao commented 1 year ago

Remember to update the prometheus init-container (wait-for-deployment) in order to wait until the data science pipeline operator is active, or the alert will fire

DharmitD commented 1 year ago

Remember to update the prometheus init-container (wait-for-deployment) in order to wait until the data science pipeline operator is active, or the alert will fire

Done, updated the prometheus init container.

jgarciao commented 1 year ago

Same for is needed for blackbox-exporter's init container (I just saw it). Also, if you call curl with parameter -sS in the init container the logs are cleaner

DharmitD commented 1 year ago

Same for is needed for blackbox-exporter's init container (I just saw it). Also, if you call curl with parameter -sS in the init container the logs are cleaner

Done, updated to have these changes, and rebased to main.

harshad16 commented 1 year ago

Tested the changes : The alerts fires info level alerts Screenshot from 2023-04-14 15-20-23 Screenshot from 2023-04-14 15-20-11

Process for testing:

harshad16 commented 1 year ago

Retested with changes:

Waited for all 3 alerts to fire: Screenshot from 2023-04-14 15-48-10 Screenshot from 2023-04-14 15-47-44

Process for testing:

After applying the rules. Scale down the service. Alerts started firing.

/lgtm

HumairAK commented 1 year ago

/approve

HumairAK commented 1 year ago

From @jgarciao :

"If you are confident with PR319 and merge it, it will be included in a pre RC build DevOps will create on Monday (so I'll be able to test that with all the changes)" ~ Jorge

/label qe-approved

openshift-ci[bot] commented 1 year ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: harshad16, HumairAK

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/red-hat-data-services/odh-deployer/blob/main/OWNERS)~~ [HumairAK] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment