openshift / cluster-node-tuning-operator

Manage node-level tuning by orchestrating the tuned daemon.
Apache License 2.0
102 stars 105 forks source link

OCPBUGS-37734: Backport fix for OCPBUGS-36355 #1126

Closed jmencak closed 3 months ago

jmencak commented 3 months ago

This is a backport of #1095 which fixed OCPBUGS-36355 in 4.15.

Summary of changes:

Resolves: OCPBUGS-37734

openshift-ci-robot commented 3 months ago

@jmencak: This pull request references Jira Issue OCPBUGS-37734, which is valid. The bug has been moved to the POST state.

7 validation(s) were run on this bug * bug is open, matching expected state (open) * bug target version (4.14.z) matches configured target version for branch (4.14.z) * bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST) * release note type set to "Release Note Not Required" * dependent bug [Jira Issue OCPBUGS-36355](https://issues.redhat.com//browse/OCPBUGS-36355) is in the state Closed (Done-Errata), which is one of the valid states (VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA)) * dependent [Jira Issue OCPBUGS-36355](https://issues.redhat.com//browse/OCPBUGS-36355) targets the "4.15.z" version, which is one of the valid target versions: 4.15.0, 4.15.z * bug has dependents

No GitHub users were found matching the public email listed for the QA contact in Jira (liqcui@redhat.com), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to [this](https://github.com/openshift/cluster-node-tuning-operator/pull/1126): >This is a backport of #1095 which fixed OCPBUGS-36355 in 4.15. > >Summary of changes: > * Change the operand's home directory from TuneD's artifacts directory /var/lib/tuned to /var/lib/ocp-tuned > * Remove bin/run. While this means a little code duplication across Containerfiles, we no longer need to do anything special at run time. This should make things easier for the future. > * Do not inherit --enable-leader-election and --version NTO flags as they are not handled by subcommands anyway (yet) > * Remove openshift-tuned binary and use NTO subcommand instead. > * /var/lib/tuned/profiles-data is no longer used, remove it. > * Remove openshift-tuned PID file code. It is no longer used. > * Clean up after #844 > * Remove TuneD timeout code and reload on ERRORs > * Fix logging in updateTunedProfile() and optimize the calls to update node annotations and update Profile.Status > * Clean up tunedStop() to return only one value > * During TuneD process shutdown, handle the fact the TuneD process might have already exitted > * The openshift-tuned operand now no longer unnecessarily exits when TuneD process exits; when TuneD process exits, wait for k8s object changes and only then restart TuneD > * Do not use buffered channels > * The indication that TuneD is reloading is now a status bit potentially reportable back to the operator > * Introduce Change type for the TuneD event processor to avoid races, where it was previously possible to change TuneD configuration during TuneD profile reload > * Register the fact TuneD finished reloading in case the primary TuneD profile does not exist > * Conditional TuneD reload when Cloud Provider changes > * Minor logging and comment improvements > >Resolves: OCPBUGS-37734 Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-node-tuning-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-ci[bot] commented 3 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jmencak

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/cluster-node-tuning-operator/blob/release-4.14/OWNERS)~~ [jmencak] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
jmencak commented 3 months ago

e2e-hypershift looks like an infra issue, e2e-gcp-pao-updating-profile is slightly suspicious but could be infra too. /retest

jmencak commented 3 months ago
Step e2e-hypershift-create-management-cluster failed after 31m8s. 

I don't believe this is due to this PR, let's retest and if this persist, let's debug in the CI. /retest

openshift-ci[bot] commented 3 months ago

@jmencak: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
jmencak commented 3 months ago

With this fix, I also tried to bring the operand code to the 4.15 as possible to make future 4.14 patches simpler. Manual testing passed.

jmencak commented 3 months ago

@ffromani , @swatisehgal , can I have some eyes on this backport? Many thanks!

ffromani commented 3 months ago

/lgtm

I reviewed the changes and they seem OK. The most important part is the verification/testing anyway. The fact CI is passing gives us quite a bit of confidence.

jmencak commented 3 months ago

/label backport-risk-assessed @liqcui , can we please have the cherry-pick-approved label? Thank you!

liqcui commented 3 months ago

/label cherry-pick-approved

openshift-ci-robot commented 3 months ago

@jmencak: Jira Issue OCPBUGS-37734: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-37734 has been moved to the MODIFIED state.

In response to [this](https://github.com/openshift/cluster-node-tuning-operator/pull/1126): >This is a backport of #1095 which fixed OCPBUGS-36355 in 4.15. > >Summary of changes: > * Change the operand's home directory from TuneD's artifacts directory /var/lib/tuned to /var/lib/ocp-tuned > * Remove bin/run. While this means a little code duplication across Containerfiles, we no longer need to do anything special at run time. This should make things easier for the future. > * Do not inherit --enable-leader-election and --version NTO flags as they are not handled by subcommands anyway (yet) > * Remove openshift-tuned binary and use NTO subcommand instead. > * /var/lib/tuned/profiles-data is no longer used, remove it. > * Remove openshift-tuned PID file code. It is no longer used. > * Clean up after #844 > * Remove TuneD timeout code and reload on ERRORs > * Fix logging in updateTunedProfile() and optimize the calls to update node annotations and update Profile.Status > * Clean up tunedStop() to return only one value > * During TuneD process shutdown, handle the fact the TuneD process might have already exitted > * The openshift-tuned operand now no longer unnecessarily exits when TuneD process exits; when TuneD process exits, wait for k8s object changes and only then restart TuneD > * Do not use buffered channels > * The indication that TuneD is reloading is now a status bit potentially reportable back to the operator > * Introduce Change type for the TuneD event processor to avoid races, where it was previously possible to change TuneD configuration during TuneD profile reload > * Register the fact TuneD finished reloading in case the primary TuneD profile does not exist > * Conditional TuneD reload when Cloud Provider changes > * Minor logging and comment improvements > >Resolves: OCPBUGS-37734 Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-node-tuning-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-bot commented 3 months ago

[ART PR BUILD NOTIFIER]

Distgit: cluster-node-tuning-operator This PR has been included in build cluster-node-tuning-operator-container-v4.14.0-202408150010.p0.g7d04f1d.assembly.stream.el9. All builds following this will include this PR.

openshift-merge-robot commented 3 months ago

Fix included in accepted release 4.14.0-0.nightly-2024-08-16-000851