openshift / cluster-monitoring-operator

Manage the OpenShift monitoring stack
Apache License 2.0
247 stars 360 forks source link

OU-450: Migrate Monitoring Plugin Deployment from nginx to Go server #2412

Closed PeterYurkovich closed 3 days ago

PeterYurkovich commented 1 month ago

This PR looks to update CMO to format the monitoring-plugin kubernetes objects such it will work with the new go backend for monitoring-plugin. This PR should not be merged until its associated PR in the monitoring-plugin repo has been merged.

The primary change in this PR is the removal of the nginx configuration which is no longer needed as the go backend will serve the web server.

openshift-ci-robot commented 1 month ago

@PeterYurkovich: This pull request references OU-450 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.17.0" version, but no target version was set.

In response to [this](https://github.com/openshift/cluster-monitoring-operator/pull/2412): >This PR looks to update CMO to format the monitoring-plugin kubernetes objects such it will work with the new go backend for monitoring-plugin. This PR should not be merged until its associated [PR](https://github.com/openshift/monitoring-plugin/pull/128) in the monitoring-plugin repo has been merged. > >The primary change in this PR is the removal of the nginx configuration which is no longer needed as the go backend will serve the web server. > >* [ ] I added CHANGELOG entry for this change. >* [x] No user facing changes, so no entry in CHANGELOG was needed. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-monitoring-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-ci[bot] commented 1 month ago

Skipping CI for Draft Pull Request. If you want CI signal for your change, please convert it to an actual PR. You can still manually trigger a test run with /test all

PeterYurkovich commented 1 month ago

/test all

PeterYurkovich commented 1 month ago

/retest

PeterYurkovich commented 1 month ago

/test go-fmt

PeterYurkovich commented 1 month ago

/retest

PeterYurkovich commented 1 month ago

/test jsonnet-fmt

PeterYurkovich commented 1 month ago

/test all

PeterYurkovich commented 1 month ago

/hold

PeterYurkovich commented 1 month ago

Adding hold to ensure that the test image commit is removed before merging.

jgbernalp commented 1 month ago

/retest

jgbernalp commented 1 month ago

/retitle OU-450: Update Monitoring Plugin Deployment

jgbernalp commented 1 month ago

/unhold

Tai-RedHat commented 1 month ago

faild to launch with cluster-bot: logs logs2

level=error msg=Cluster operator monitoring Available is False with UpdatingConsolePluginComponentsFailed: UpdatingConsolePluginComponents: reconciling Console Plugin Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/monitoring-plugin: context deadline exceeded: got 2 unavailable replicas
Tai-RedHat commented 1 month ago

re-test with cluster-bot, monitoring-plugin works well with no other regression issues. /label qe-approved

openshift-ci-robot commented 1 month ago

@PeterYurkovich: This pull request references OU-450 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.18.0" version, but no target version was set.

In response to [this](https://github.com/openshift/cluster-monitoring-operator/pull/2412): >This PR looks to update CMO to format the monitoring-plugin kubernetes objects such it will work with the new go backend for monitoring-plugin. This PR should not be merged until its associated [PR](https://github.com/openshift/monitoring-plugin/pull/128) in the monitoring-plugin repo has been merged. > >The primary change in this PR is the removal of the nginx configuration which is no longer needed as the go backend will serve the web server. > >* [ ] I added CHANGELOG entry for this change. >* [x] No user facing changes, so no entry in CHANGELOG was needed. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-monitoring-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
Tai-RedHat commented 1 month ago

BTW, can I get a brief explanation of why we deleted the monitor-plugin configmap?

jgbernalp commented 1 month ago

BTW, can I get a brief explanation of why we deleted the monitor-plugin configmap?

We are migrating the backend to Go, so the config map holding the old nginx configuration is not needed anymore.

jgbernalp commented 2 weeks ago

/retest

machine424 commented 1 week ago

Thanks for this.

Great to see that we’re working on unifying the plugin backends. I had raised this with the console folks https://redhat-internal.slack.com/archives/C6A3NV5J9/p1700750927381839 but never pushed it. I assume the next step would be to unify the Go backends now (https://github.com/openshift/monitoring-plugin/tree/main/pkg and https://github.com/openshift/logging-view-plugin/tree/main/pkg/server etc.) into a common repository?

I will run some of the IPv6 and dual-stack payloads here as it took us some time and tweaks to make nginx work there:

/payload-job periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-ovn-dualstack periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-ovn-ipv6

Also, looking at some of the backend logs, I can see:

time="2024-08-22T21:20:59Z" level=info msg="enabled features: []\n" module=main
time="2024-08-22T21:20:59Z" level=warning msg="cannot read config file, serving plugin with default configuration, tried /etc/plugin/config.yaml" error="open /etc/plugin/config.yaml: no such file or directory" module=server
time="2024-08-22T21:20:59Z" level=error msg="cannot read base manifest file" error="open opt/app-root/web/dist/plugin-manifest.json: no such file or directory" module=manifest
time="2024-08-22T21:20:59Z" level=info msg="listening on https://:9443" module=server

I don’t know if that’s normal. If it is, could we hide or omit the error?

Note that our e2e tests regarding the plugin aren’t complete. Maybe we’re letting something slip. I don’t think we check that the backend is reachable e.g. (I’ll try to add one in CMO, even though the more realistic test should involve the console being able to reach the backend).

Also, adding a readiness probe to the Deployment could be helpful.

openshift-ci[bot] commented 1 week ago

@machine424: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/94996e30-644c-11ef-8921-93ab56f198e1-0

openshift-ci[bot] commented 1 week ago

@machine424: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/bf4bc240-644c-11ef-91c4-a6d85d676f0b-0

machine424 commented 1 week ago

here is the PR to add that "readiness/reachability" check https://github.com/openshift/cluster-monitoring-operator/pull/2444, please tell me if that's sufficient.

PeterYurkovich commented 1 week ago

here is the PR to add that "readiness/reachability" check https://github.com/openshift/cluster-monitoring-operator/pull/2444, please tell me if that's sufficient.

That readiness check seems sufficient to me. I wouldn't expect or want CMO to check for much other than valid JSON as changes upstream shouldn't impact the downstream distribution tests.

Great to see that we’re working on unifying the plugin backends. I had raised this with the console folks redhat-internal.slack.com/archives/C6A3NV5J9/p1700750927381839 but never pushed it. I assume the next step would be to unify the Go backends now (openshift/monitoring-plugin@main/pkg and openshift/logging-view-plugin@main/pkg/server etc.) into a common repository?

The primary reasons for us making this change are:

1) Standardization across all our plugins (1, 2, 3, 4, 5) 2) Greater control over the server in preparation for changes being made server-side to the package-manifest.json as well as feature flags to support an Alerting UI in ACM

I think that bringing all the backends together in a single repository could be helpful, but it isn't in our short term goals that I'm aware of.

I don’t know if that’s normal. If it is, could we hide or omit the error?

That error seems out of place, as the go server should be able to locate the plugin-manifest.json. I'll take a look at it

PeterYurkovich commented 1 week ago

/retest

PeterYurkovich commented 1 week ago

/retest

PeterYurkovich commented 1 week ago

/retest

PeterYurkovich commented 1 week ago

/retest

machine424 commented 1 week ago

Thanks for the context.

I see the server exposes a /health route https://github.com/openshift/monitoring-plugin/blob/eb2a1bd822dc7a6f490276121bd654c5b000127d/pkg/server.go#L82 what do you think about adding a readinessProbe to the Deployment based on that like we do here https://github.com/openshift/cluster-monitoring-operator/blob/f7e92e869c43fa0455d656dcfc89045b60e5baa1/assets/admission-webhook/deployment.yaml#L60-L64? (no need to set up certs and stuff as kubelet will ignore that).

That will help us ensure real minimal readiness (the server starts up and can serve a route) of the plugin on all the flavors (ipv4, ipv6, dual stack etc.) that origin tests run on + https://github.com/openshift/cluster-monitoring-operator/pull/2444 will help us ensure that /plugin-manifest.json as well is accessible on the flavors that CMO e2e tests run on.

PeterYurkovich commented 1 week ago

/test all

PeterYurkovich commented 1 week ago

/retest

PeterYurkovich commented 5 days ago

The latest commits add a readiness probe as requested and fix the error that was being shown. An invalid default value was fixed upstream (fix) and this PR now also passes in that value.

machine424 commented 4 days ago

Good news, thanks again for this.

(I can see the server is logging every call to /health, I think we can hide those, it's not a blocker for this PR)

I'll re-reun those ipv6 and dual stack payloads just in case:

/payload-job periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-ovn-dualstack periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-ovn-ipv6

But this /lgtm I'll /hold just in case @simonpasquier has anything to add.

openshift-ci[bot] commented 4 days ago

@machine424: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/bf47de70-6a9e-11ef-911b-0b6f05cec670-0

openshift-ci[bot] commented 4 days ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: machine424, PeterYurkovich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/cluster-monitoring-operator/blob/master/OWNERS)~~ [machine424] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
machine424 commented 3 days ago

going once, going twice... /unhold /retitle OU-450: Migrate Monitoring Plugin Deployment from nginx to Go server

Thanks again @PeterYurkovich for this. (Tell me if I should create a follow up ticket to reduce the verbosity on /health)

PeterYurkovich commented 3 days ago

I'll create the ticket in our board. Most of the backend code is shared currently, so I'll need to be done across all of our plugins, and also make sure that our other deployments have a similar readiness/liveliness probe.

Thanks for working through this

Edit: Issue here - https://issues.redhat.com/browse/OU-525

openshift-ci-robot commented 3 days ago

/retest-required

Remaining retests: 0 against base HEAD 1c9d5ecfcc0ddf6bfeff9a3a11310d39c6ba30b5 and 2 for PR HEAD 40da92314327f62552580d6d9a6d193efea75596 in total

openshift-ci[bot] commented 3 days ago

@PeterYurkovich: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
openshift-bot commented 2 days ago

[ART PR BUILD NOTIFIER]

Distgit: cluster-monitoring-operator This PR has been included in build cluster-monitoring-operator-container-v4.18.0-202409052113.p0.g886a4be.assembly.stream.el9. All builds following this will include this PR.