openshift / cluster-node-tuning-operator

Manage node-level tuning by orchestrating the tuned daemon.
Apache License 2.0
102 stars 105 forks source link

OCPBUGS-30647: Remove TuneD timeout code and reload on ERRORs #998

Closed jmencak closed 8 months ago

jmencak commented 8 months ago

In the past, there were reports of TuneD being stuck during profile application (e.g. rhbz#2013940). As a workaround a timeout with exponential backoff was implemented to restart TuneD if it was ever stuck not finishing profile application. However, with some TuneD profiles and hardware configurations, it can take very long for a TuneD profile to be applied. In environments, such as latency-sensitive deployments a repeated profile application can make things matters worse. Therefore, remove the timeout functionality.

Also, remove the reload on TuneD ERRORs. This is an openshift-tuned bug.

openshift-ci-robot commented 8 months ago

@jmencak: This pull request references Jira Issue OCPBUGS-30647, which is invalid:

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to [this](https://github.com/openshift/cluster-node-tuning-operator/pull/998): >In the past, there were reports of TuneD being stuck during profile application (e.g. rhbz#2013940). As a workaround a timeout with exponential backoff was implemented to restart TuneD if it was ever stuck not finishing profile application. However, with some TuneD profiles and hardware configurations, it can take very long for a TuneD profile to be applied. In environments, such as latency-sensitive deployments a repeated profile application can make things matters worse. Therefore, remove the timeout functionality. > >Also, remove the reload on TuneD ERRORs. This is an openshift-tuned bug. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-node-tuning-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
jmencak commented 8 months ago

/jira refresh

openshift-ci-robot commented 8 months ago

@jmencak: This pull request references Jira Issue OCPBUGS-30647, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug * bug is open, matching expected state (open) * bug target version (4.16.0) matches configured target version for branch (4.16.0) * bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (liqcui@redhat.com), skipping review request.

In response to [this](https://github.com/openshift/cluster-node-tuning-operator/pull/998#issuecomment-2004960561): >/jira refresh > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-node-tuning-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-ci[bot] commented 8 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jmencak

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/cluster-node-tuning-operator/blob/master/OWNERS)~~ [jmencak] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
jmencak commented 8 months ago

/retest

jmencak commented 8 months ago

/retest

yanirq commented 8 months ago

In the past, there were reports of TuneD being stuck during profile application (e.g. rhbz#2013940). As a workaround a timeout with exponential backoff was implemented to restart TuneD if it was ever stuck not finishing profile application.

Can this still be the case here ? what is the alternative for the old workaround ?

jmencak commented 8 months ago

In the past, there were reports of TuneD being stuck during profile application (e.g. rhbz#2013940). As a workaround a timeout with exponential backoff was implemented to restart TuneD if it was ever stuck not finishing profile application.

Can this still be the case here ? what is the alternative for the old workaround ?

We never root caused the issue. This could have been something kernel-related. The alternative would be finding the root cause and fix it. If we are to block kubelet to TuneD completion in the future, we'll need a timeout mechanism of some sort, but likely systemd-based.

jmencak commented 8 months ago

/retest

jmencak commented 8 months ago

/retest

jmencak commented 8 months ago

/retest

jmencak commented 8 months ago

/retest

yanirq commented 8 months ago

/lgtm

openshift-ci[bot] commented 8 months ago

@jmencak: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
openshift-ci-robot commented 8 months ago

@jmencak: Jira Issue OCPBUGS-30647: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-30647 has been moved to the MODIFIED state.

In response to [this](https://github.com/openshift/cluster-node-tuning-operator/pull/998): >In the past, there were reports of TuneD being stuck during profile application (e.g. rhbz#2013940). As a workaround a timeout with exponential backoff was implemented to restart TuneD if it was ever stuck not finishing profile application. However, with some TuneD profiles and hardware configurations, it can take very long for a TuneD profile to be applied. In environments, such as latency-sensitive deployments a repeated profile application can make things matters worse. Therefore, remove the timeout functionality. > >Also, remove the reload on TuneD ERRORs. This is an openshift-tuned bug. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-node-tuning-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-bot commented 8 months ago

[ART PR BUILD NOTIFIER]

This PR has been included in build cluster-node-tuning-operator-container-v4.16.0-202403241518.p0.gdf5d582.assembly.stream.el9 for distgit cluster-node-tuning-operator. All builds following this will include this PR.

openshift-merge-robot commented 8 months ago

Fix included in accepted release 4.16.0-0.nightly-2024-03-25-025514

openshift-merge-robot commented 7 months ago

Fix included in accepted release 4.16.0-0.nightly-2024-04-16-015315

jmencak commented 4 months ago

Cherry-pick to 4.15 will fail (https://github.com/openshift/cluster-node-tuning-operator/pull/970 is also needed), but do it anyway so that the automation creates the necessary Jira bugs. /cherry-pick release-4.15

openshift-cherrypick-robot commented 4 months ago

@jmencak: #998 failed to apply on top of branch "release-4.15":

Applying: OCPBUGS-30647: Remove TuneD timeout code and reload on ERRORs
Using index info to reconstruct a base tree...
M   pkg/tuned/controller.go
M   pkg/tuned/run.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/tuned/run.go
Auto-merging pkg/tuned/controller.go
CONFLICT (content): Merge conflict in pkg/tuned/controller.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 OCPBUGS-30647: Remove TuneD timeout code and reload on ERRORs
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".
In response to [this](https://github.com/openshift/cluster-node-tuning-operator/pull/998#issuecomment-2198661660): >Cherry-pick to 4.15 will fail (https://patch-diff.githubusercontent.com/raw/openshift/cluster-node-tuning-operator/pull/970 is also needed), but do it anyway so that the automation creates the necessary Jira bugs. >/cherry-pick release-4.15 > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.