red-hat-data-services / odh-deployer

The odh-deployer image creates a custom resource for the image in operator image in odh-operator-allinone
Apache License 2.0
5 stars 42 forks source link

feat: addition of codeflare stack to monitoring #380

Closed dimakis closed 11 months ago

dimakis commented 1 year ago

Closes #373

Description

For reference if needed: https://github.com/red-hat-data-services/odh-deployer/pull/365

The respective SOPs are on these merge requests: https://gitlab.cee.redhat.com/service/managed-tenants-sops/-/merge_requests/95 https://gitlab.cee.redhat.com/dsaridak/managed-tenants-sops/-/merge_requests/1

We want the alerts' runbook/triage to point to their SOPs

How Has This Been Tested?

Merge criteria:

openshift-ci[bot] commented 1 year ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign lavlas for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/red-hat-data-services/odh-deployer/blob/main/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
dimakis commented 1 year ago

@ChristianZaccaria can you verify this and stick up a couple of screen shots please?

ChristianZaccaria commented 1 year ago

@ChristianZaccaria can you verify this and stick up a couple of screen shots please?

MCAD and KubeRay are not getting scraped at the moment.

For MCAD, this has to do with this PR for MCAD. There are two things to this:

A similar issue could be for KubeRay.

image

zdtsw commented 1 year ago

unless i missed something, should not this be added into https://github.com/red-hat-data-services/odh-manifests/blob/master/monitoring/prometheus/prometheus-configs.yaml as well? it wont be in use for v2 for now (since we have not done cloud-service part) we are not keep syncing changes from odh-deployer but mostly from odh-manifests, it might get lost somewhere in the near future when the common monitoring solution is moving into operator entierly

ChristianZaccaria commented 1 year ago

unless i missed something, should not this be added into https://github.com/red-hat-data-services/odh-manifests/blob/master/monitoring/prometheus/prometheus-configs.yaml as well? it wont be in use for v2 for now (since we have not done cloud-service part) we are not keep syncing changes from odh-deployer but mostly from odh-manifests, it might get lost somewhere in the near future when the common monitoring solution is moving into operator entierly

Yes this should be added in the odh-manifests too, great spot thanks Wen! I will update this PR asap. Just having that small issue on scraping KubeRay.

zdtsw commented 1 year ago

unless i missed something, should not this be added into https://github.com/red-hat-data-services/odh-manifests/blob/master/monitoring/prometheus/prometheus-configs.yaml as well? it wont be in use for v2 for now (since we have not done cloud-service part) we are not keep syncing changes from odh-deployer but mostly from odh-manifests, it might get lost somewhere in the near future when the common monitoring solution is moving into operator entierly

Yes this should be added in the odh-manifests too, great spot thanks Wen! I will update this PR asap. Just having that small issue on scraping KubeRay.

you dont need to get this into odh-manifests, since that repo will be dead before CF with rhods on managed cluster. as long as the content of this PR is correct, operator will take care of the config.

zdtsw commented 1 year ago

suggest to close this PR, since we are not gonna use odh-deployer for any future development. ref to new config: https://github.com/red-hat-data-services/rhods-operator/blob/rhods-2.4/config/monitoring/prometheus/apps/prometheus-configs.yaml#L271