opendatahub-io / ai-edge

ODH integration with AI at the Edge usecases
Apache License 2.0
8 stars 17 forks source link

RHOAIENG-2435: add opentelemetry operator install and collector specs #232

Closed StevenTobin closed 4 months ago

StevenTobin commented 5 months ago

Description

Deploy an otel collector to gather metrics from the inference containers.

JIRA: RHOAIENG-2435

How Has This Been Tested?

Merge criteria:

StevenTobin commented 5 months ago

/retest

openshift-ci-robot commented 5 months ago

@StevenTobin: This pull request references RHOAIENG-2435 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to [this](https://github.com/opendatahub-io/ai-edge/pull/232): > > >## Description > >Deploy an otel collector to gather metrics from the inference containers. > >## How Has This Been Tested? > > > > >- Deploy ACM and connect the edge cluster as normal >- Follow the instructions in the [README](https://github.com/StevenTobin/ai-edge/blob/add_opentelemetry_collection_of_metrics/README.md#observability-setup) to deploy the observability pieces >- Confirm that the inference container metrics are available in the grafana instance in the `open-cluster-management-observability` namespace > >## Merge criteria: > > > >- [x] The commits are squashed in a cohesive manner and have meaningful messages. >- [x] Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious). >- [x] The developer has manually tested the changes and verified that the changes work > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=opendatahub-io%2Fai-edge). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-ci-robot commented 5 months ago

@StevenTobin: This pull request references RHOAIENG-2435 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to [this](https://github.com/opendatahub-io/ai-edge/pull/232): > > >## Description > >Deploy an otel collector to gather metrics from the inference containers. > >JIRA: [RHOAIENG-2435](https://issues.redhat.com//browse/RHOAIENG-2435) > >## How Has This Been Tested? > > > > >- Deploy ACM and connect the edge cluster as normal >- Follow the instructions in the [README](https://github.com/StevenTobin/ai-edge/blob/add_opentelemetry_collection_of_metrics/README.md#observability-setup) to deploy the observability pieces >- Confirm that the inference container metrics are available in the grafana instance in the `open-cluster-management-observability` namespace > >## Merge criteria: > > > >- [x] The commits are squashed in a cohesive manner and have meaningful messages. >- [x] Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious). >- [x] The developer has manually tested the changes and verified that the changes work > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=opendatahub-io%2Fai-edge). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-ci-robot commented 5 months ago

@StevenTobin: This pull request references RHOAIENG-2435 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to [this](https://github.com/opendatahub-io/ai-edge/pull/232): > > >## Description > >Deploy an otel collector to gather metrics from the inference containers. > >JIRA: [RHOAIENG-2435](https://issues.redhat.com//browse/RHOAIENG-2435) > >## How Has This Been Tested? > > > > >- Deploy ACM and connect the edge cluster as normal >- Follow the instructions in the [README](https://github.com/StevenTobin/ai-edge/blob/add_opentelemetry_collection_of_metrics/README.md#observability-setup) to deploy the observability pieces >- Confirm that the inference container metrics are available in the grafana instance in the `open-cluster-management-observability` namespace > >## Merge criteria: > > > >- [x] The commits are squashed in a cohesive manner and have meaningful messages. >- [x] Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious). >- [x] The developer has manually tested the changes and verified that the changes work > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=opendatahub-io%2Fai-edge). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
grdryn commented 4 months ago

/lgtm cancel

StevenTobin commented 4 months ago

/retest

StevenTobin commented 4 months ago

@StevenTobin: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command ci/prow/test-ai-edge 8b04dcb link true /test test-ai-edge Full PR test history. Your PR dashboard.

Looks like this is an infrastructure failure. Cannot list pods in clusters-294308c85dab77341a2e namespace

StevenTobin commented 4 months ago

/retest

StevenTobin commented 4 months ago

@StevenTobin: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command ci/prow/test-ai-edge 8b04dcb link true /test test-ai-edge Full PR test history. Your PR dashboard.

This still looks like an infrastructure failure to me.

ERROR   Failed to create cluster    {"error": "failed to create infra: failed to list availability zones: UnauthorizedOperation: You are not authorized to perform this operation. User: <snip> is not authorized to perform: ec2:DescribeAvailabilityZones with an explicit deny in a service control policy\n\tstatus code: 403
grdryn commented 4 months ago

/override ci/prow/test-ai-edge

I've tested this out locally (both the contents of this PR, and I've run the go tests)

openshift-ci[bot] commented 4 months ago

@grdryn: Overrode contexts on behalf of grdryn: ci/prow/test-ai-edge

In response to [this](https://github.com/opendatahub-io/ai-edge/pull/232#issuecomment-2048079765): >/override ci/prow/test-ai-edge > >I've tested this out locally (both the contents of this PR, and I've run the go tests) Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
LaVLaS commented 4 months ago

/approve

LGTM Adding approval based on the other 2 reviews

openshift-ci[bot] commented 4 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: LaVLaS

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/opendatahub-io/ai-edge/blob/main/OWNERS)~~ [LaVLaS] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment