openshift / cluster-node-tuning-operator

Manage node-level tuning by orchestrating the tuned daemon.
Apache License 2.0
102 stars 105 forks source link

OCPBUGS-30647: NTO operand (openshift-tuned) fixes #1024

Closed jmencak closed 7 months ago

jmencak commented 7 months ago

Changes:

openshift-ci[bot] commented 7 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jmencak

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/cluster-node-tuning-operator/blob/master/OWNERS)~~ [jmencak] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
jmencak commented 7 months ago

/cc @yanirq @ffromani @MarSik This is the cleanup PR I talked about during the bug scrub. Once this merges, I'm no longer aware of any races in the operand code. Next plan it to remove the tuned/rendered resource. This will come in a separate PR.

ffromani commented 7 months ago

thanks, I'll carefully review ASAP

yanirq commented 7 months ago

/retitle OCPBUGS-30647: NTO operand (openshift-tuned) fixes

openshift-ci-robot commented 7 months ago

@jmencak: This pull request references Jira Issue OCPBUGS-30647, which is invalid:

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to [this](https://github.com/openshift/cluster-node-tuning-operator/pull/1024): >Changes: > * fix logging in `updateTunedProfile()` and optimize the calls to update node annotations and update Profile.Status > * clean up `tunedStop()` to return only one value > * during `TuneD` process shutdown, handle the fact the `TuneD` process might have already exitted > * the openshift-tuned operand now no longer unnecessarily exits when `TuneD` process exits; when `TuneD` process exits, wait for k8s object changes and only then restart `TuneD` > * do not use buffered channels > * the indication that `TuneD` is reloading is now a status bit potentially reportable back to the operator > * introduce Change type for the `TuneD` event processor to avoid races, where it was previously possible to change `TuneD` configuration during `TuneD` profile reload > * register the fact `TuneD` finished reloading in case the primary `TuneD` profile does not exist > * conditional `TuneD` reload when Cloud Provider changes > * minor logging and comment improvements Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-node-tuning-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
yanirq commented 7 months ago

/retest-required

jmencak commented 7 months ago

Infra issues /retest

jmencak commented 7 months ago

/jira refresh

openshift-ci-robot commented 7 months ago

@jmencak: This pull request references Jira Issue OCPBUGS-30647, which is valid.

3 validation(s) were run on this bug * bug is open, matching expected state (open) * bug target version (4.16.0) matches configured target version for branch (4.16.0) * bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (liqcui@redhat.com), skipping review request.

In response to [this](https://github.com/openshift/cluster-node-tuning-operator/pull/1024#issuecomment-2042467434): >/jira refresh > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-node-tuning-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
ffromani commented 7 months ago

/test e2e-gcp-pao-workloadhints

jmencak commented 7 months ago

initial review, I'll need more time to process the state machine changes. So far this seems a very nice set of improvements and cleanups, but more than ever I'd have split each change from the bullet list in its own commit to make it easier to review and to enable git bisect and easier troubleshooting later on.

Thank you for the initial review, Francesco. As for the split to several commits/bisect, OpenShift CI has a merge_method squash, so I believe this will not benefit you much unless you rely on github's web interface, right? If you insist on this, I can try to do the split later on once all the reviewers are happy with the code.

ffromani commented 7 months ago

initial review, I'll need more time to process the state machine changes. So far this seems a very nice set of improvements and cleanups, but more than ever I'd have split each change from the bullet list in its own commit to make it easier to review and to enable git bisect and easier troubleshooting later on.

Thank you for the initial review, Francesco. As for the split to several commits/bisect, OpenShift CI has a merge_method squash, so I believe this will not benefit you much unless you rely on github's web interface, right? If you insist on this, I can try to do the split later on once all the reviewers are happy with the code.

If we ever split the commits is better to do it sooner to help the review :) Incidentally, I'm also against the merge method squash because it pretty much prevents any hope of (git) bisect in exchange to (AFAICT) only a marginally easier way to backport PRs. The tradeoff is pretty negative, but perhaps there are reasons I'm missing. Anyway I'm not pushing for this split, especially since it seems I'm the odd one here going against the flow. ("when in Rome...")

ffromani commented 7 months ago

thanks @jmencak for the quick updates. I'll finish the review ASAP - still need to make sure I have a decent grasp of all the flows changed here.

jmencak commented 7 months ago

/retest

ffromani commented 7 months ago

/hold

I'd welcome more reviews from different ppl

jmencak commented 7 months ago

/retest

jmencak commented 7 months ago

/retest

jmencak commented 7 months ago

/hold

I'd welcome more reviews from different ppl

Thank you. Any more things I should change, @yanirq , @MarSik so that we unblock Francesco? Thank you!

yanirq commented 7 months ago

/retest-required

MarSik commented 7 months ago

/lgtm /hold cancel

ffromani commented 7 months ago

/retest

openshift-ci-robot commented 7 months ago

/retest-required

Remaining retests: 0 against base HEAD 226cd8323a8f2beeb521ffd86522811b6e0ebee4 and 2 for PR HEAD 8f0e0639683309a10646a49d3961e3b5677c53b9 in total

openshift-ci[bot] commented 7 months ago

@jmencak: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
openshift-ci-robot commented 7 months ago

@jmencak: Jira Issue OCPBUGS-30647: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-30647 has been moved to the MODIFIED state.

In response to [this](https://github.com/openshift/cluster-node-tuning-operator/pull/1024): >Changes: > * fix logging in `updateTunedProfile()` and optimize the calls to update node annotations and update Profile.Status > * clean up `tunedStop()` to return only one value > * during `TuneD` process shutdown, handle the fact the `TuneD` process might have already exitted > * the openshift-tuned operand now no longer unnecessarily exits when `TuneD` process exits; when `TuneD` process exits, wait for k8s object changes and only then restart `TuneD` > * do not use buffered channels > * the indication that `TuneD` is reloading is now a status bit potentially reportable back to the operator > * introduce Change type for the `TuneD` event processor to avoid races, where it was previously possible to change `TuneD` configuration during `TuneD` profile reload > * register the fact `TuneD` finished reloading in case the primary `TuneD` profile does not exist > * conditional `TuneD` reload when Cloud Provider changes > * minor logging and comment improvements Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-node-tuning-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-merge-robot commented 7 months ago

Fix included in accepted release 4.16.0-0.nightly-2024-04-16-015315