Closed jmencak closed 8 months ago
@jmencak: This pull request references Jira Issue OCPBUGS-30647, which is invalid:
Comment /jira refresh
to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.
The bug has been updated to refer to the pull request using the external bug tracker.
/jira refresh
@jmencak: This pull request references Jira Issue OCPBUGS-30647, which is valid. The bug has been moved to the POST state.
No GitHub users were found matching the public email listed for the QA contact in Jira (liqcui@redhat.com), skipping review request.
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: jmencak
The full list of commands accepted by this bot can be found here.
The pull request process is described here
/retest
/retest
In the past, there were reports of TuneD being stuck during profile application (e.g. rhbz#2013940). As a workaround a timeout with exponential backoff was implemented to restart TuneD if it was ever stuck not finishing profile application.
Can this still be the case here ? what is the alternative for the old workaround ?
In the past, there were reports of TuneD being stuck during profile application (e.g. rhbz#2013940). As a workaround a timeout with exponential backoff was implemented to restart TuneD if it was ever stuck not finishing profile application.
Can this still be the case here ? what is the alternative for the old workaround ?
We never root caused the issue. This could have been something kernel-related. The alternative would be finding the root cause and fix it. If we are to block kubelet to TuneD completion in the future, we'll need a timeout mechanism of some sort, but likely systemd-based.
/retest
/retest
/retest
/retest
/lgtm
@jmencak: all tests passed!
Full PR test history. Your PR dashboard.
@jmencak: Jira Issue OCPBUGS-30647: All pull requests linked via external trackers have merged:
Jira Issue OCPBUGS-30647 has been moved to the MODIFIED state.
[ART PR BUILD NOTIFIER]
This PR has been included in build cluster-node-tuning-operator-container-v4.16.0-202403241518.p0.gdf5d582.assembly.stream.el9 for distgit cluster-node-tuning-operator. All builds following this will include this PR.
Fix included in accepted release 4.16.0-0.nightly-2024-03-25-025514
Fix included in accepted release 4.16.0-0.nightly-2024-04-16-015315
Cherry-pick to 4.15 will fail (https://github.com/openshift/cluster-node-tuning-operator/pull/970 is also needed), but do it anyway so that the automation creates the necessary Jira bugs. /cherry-pick release-4.15
@jmencak: #998 failed to apply on top of branch "release-4.15":
Applying: OCPBUGS-30647: Remove TuneD timeout code and reload on ERRORs
Using index info to reconstruct a base tree...
M pkg/tuned/controller.go
M pkg/tuned/run.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/tuned/run.go
Auto-merging pkg/tuned/controller.go
CONFLICT (content): Merge conflict in pkg/tuned/controller.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 OCPBUGS-30647: Remove TuneD timeout code and reload on ERRORs
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".
In the past, there were reports of TuneD being stuck during profile application (e.g. rhbz#2013940). As a workaround a timeout with exponential backoff was implemented to restart TuneD if it was ever stuck not finishing profile application. However, with some TuneD profiles and hardware configurations, it can take very long for a TuneD profile to be applied. In environments, such as latency-sensitive deployments a repeated profile application can make things matters worse. Therefore, remove the timeout functionality.
Also, remove the reload on TuneD ERRORs. This is an openshift-tuned bug.