Closed ffromani closed 1 month ago
@ffromani: This pull request references Jira Issue OCPBUGS-28647, which is valid.
Requesting review from QA contact: /cc @shajmakh
The bug has been updated to refer to the pull request using the external bug tracker.
/cc @MarSik @bartwensley
/cc @jmencak
/retest
Thank you for the PR! I'll try to understand the code and the motivation tomorrow. Some initial thoughts.
The key distinction is if the recommended profile changes or not, and there's a desire to defer application of changes only if a profile is update, not the first time it is applied.
Do you mean "only if profile is updated"? Also, which profile? TuneD
profile or the k8s CR? I assume the former, so we should be specific to make things clear.
* (in-place) profile update is a change which does NOT trigger the recommended profile, and updates the setting, usually but not exclusively the sysctls.
IIUC, s/NOT trigger the recommended profile/NOT cause a switch to a different TuneD profile/ ?
We change the way the annotation is used. We now require a value, which can be either
* always: every Tuned object annotated this way will have its application deferred
Is this correct? Shouldn't we say something along the lines "if at least 1 Tuned object annotated this way exists, profile applications will be deferred"? Similarly for "update".
Edit: I take this back, I was thinking about the old implementation.
Also, the code uses "never" DeferMode, should we document this in the PR description and the commit?
thanks @jmencak , I will clarify the commit message indeed.
After some manual testing I see one potential issue with
tuned.openshift.io/deferred: "update"
1) Create the following manifest
apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
name: openshift-profile
namespace: openshift-cluster-node-tuning-operator
annotations:
tuned.openshift.io/deferred: "update"
spec:
profile:
- data: |
[main]
summary=Custom OpenShift profile
include=openshift-node
[sysctl]
kernel.shmmni=8192
name: openshift-profile
recommend:
- match:
- label: profile
priority: 20
profile: openshift-profile
2) Profile is applied. Now change the value of kernel.shmmni
to something like kernel.shmmni=16384
. Profile application is refused on the grounds of the deferred update.
3) Delete the TuneD
pod responsible setting the profile => profile is applied by the new pod.
Questions:
TuneD
pod restarts do not apply the changes?After some manual testing I see one potential issue with [...]
- Profile is applied. Now change the value of
kernel.shmmni
to something likekernel.shmmni=16384
. Profile application is refused on the grounds of the deferred update.Delete the
TuneD
pod responsible setting the profile => profile is applied by the new pod. Questions:
- Do we feel this is an issue? Should we make the information about the deferred state permanent so that
TuneD
pod restarts do not apply the changes?
yes, it's a bug.
* Should we get this pre-merge tested?
I guess yes
/retest
Infra issues /retest
Thank you for the fix, Francesco. I can confirm the issue is no longer present. /lgtm but it would be nice to do some basic pre-merge testing to verify the functionality @liqcui . I'm sure Liquan will have questions on how to test this so this could also help to improve the docs/commit message. /hold to give other reviewers a chance to review this.
/retest
/retest
I still thik it's (mostly) infra issue
/retest
I still thik it's (mostly) infra issue
Failures in basic/sysctl_d_override.go
in e2e-aws-operator look suspicious. I haven't seen them in a long time. Prior to deferred updates, I tested cca 500 iterations of the entire e2e-aws-operator test suite without any issues. Let's take a break and investigate on Monday. :)
some failures look legit (hugepages), but some other still look like infra issues
/retest
/retest I still thik it's (mostly) infra issue
Failures in
basic/sysctl_d_override.go
in e2e-aws-operator look suspicious. I haven't seen them in a long time. Prior to deferred updates, I tested cca 500 iterations of the entire e2e-aws-operator test suite without any issues. Let's take a break and investigate on Monday. :)
sorry, added my comment before reading this one. I'll abstain from further retests
sorry, added my comment before reading this one. I'll abstain from further retests
No problem. At the moment there seem to be 4 test runs with the override test failing. This is most likely a legit issue.
/lgtm cancel based on this.
/hold cancel
ok, I can reproduce locally. Debugging.
interestingly, I can reproduce even on current master, but sporadically (1 every 4 runs so far)
ok, the problem lies in the fact that previously we didn't recover the recommended profile name and that caused by side effect a TuneD
reload on NTO operand restart. Now, the question is: should actually a NTO operand restart always trigger a TuneD
restart, or that was accidental?
Now, the question is: should actually a NTO operand restart always trigger a
TuneD
restart, or that was accidental?
NTO operand "supervises" the TuneD
daemon, so yes, that's the design. NTO operand restart will first stop
TuneD
and then start
TuneD
because the TuneD
daemon will no longer be running. The NTO operand (golang code) shuts TuneD
down as it shuts down.
Now, the question is: should actually a NTO operand restart always trigger a
TuneD
restart, or that was accidental?NTO operand "supervises" the
TuneD
daemon, so yes, that's the design. NTO operand restart will firststop
TuneD
and thenstart
TuneD
because theTuneD
daemon will no longer be running. The NTO operand (golang code) shutsTuneD
down as it shuts down.
good, This means my previous fix in this PR is wrong, but this also means I need to document that we should NOT recover the recommended profile because this prevents TuneD to be restarted. I feel it should be more explicit, but this is probably stuff for another separate PR.
NTO operand "supervises" the
TuneD
daemon, so yes, that's the design. NTO operand restart will firststop
TuneD
and thenstart
TuneD
because theTuneD
daemon will no longer be running. The NTO operand (golang code) shutsTuneD
down as it shuts down.good, This means my previous fix in this PR is wrong, but this also means I need to document that we should NOT recover the recommended profile because this prevents TuneD to be restarted. I feel it should be more explicit, but this is probably stuff for another separate PR.
I'm afraid you lost me completely, but I agree a thorough documentation is a good start. I believe we lack some usecases for the "always" and "update" annotations so that this can be pre-merge tested by QE. This could also be potentially covered by e2e tests.
my latest push should fix the sysctl_d_override
test but will break again https://github.com/openshift/cluster-node-tuning-operator/pull/1129#issuecomment-2277436661 . Working on the latter.
the last upload should fix all the known issues.
good: none of the current failures seems to be related to the deferred updates, whose tests all seems green! bad: a bunch of new failures to debug before this PR can move forward
OTOH https://github.com/openshift/cluster-node-tuning-operator/pull/1131 is getting a different failure set :\
/test e2e-gcp-pao
/test e2e-aws-ovn-techpreview /test e2e-aws-ovn-techpreview
/test e2e-gcp-pao-workloadhints
/test e2e-gcp-pao-updating-profile
/retest
e2e-hypershift:
Looks like issues not caused by this PR
FAIL: TestCreateCluster (4091.13s)
/retest
/hold for @MarSik and @yanirq review. (xref: https://github.com/openshift/cluster-node-tuning-operator/pull/1129#pullrequestreview-2234947331)
Thank you for the changes.
/lgtm
Found one typo change afeter
. but that can be fixed later on if you prefer.
Thank you for the changes. /lgtm Found one typo
change afeter
. but that can be fixed later on if you prefer.
thanks, fixed
/lgtm
@ffromani: all tests passed!
Full PR test history. Your PR dashboard.
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: ffromani, MarSik
The full list of commands accepted by this bot can be found here.
The pull request process is described here
/unhold
@ffromani: Jira Issue OCPBUGS-28647: All pull requests linked via external trackers have merged:
Jira Issue OCPBUGS-28647 has been moved to the MODIFIED state.
[ART PR BUILD NOTIFIER]
Distgit: cluster-node-tuning-operator This PR has been included in build cluster-node-tuning-operator-container-v4.18.0-202408140944.p0.g3655f22.assembly.stream.el9. All builds following this will include this PR.
/cherry-pick release-4.17
@yanirq: new pull request created: #1138
To fully support the usecase described in OCPBUGS-28647 and fix the issue, we need to further distinguish between first-time profile change and (in-place) profile change. This is required to better support a GitOps flow.
The key distinction is if the recommended profile changes or not, and there's a desire to defer application of changes only when a profile is updated (e.g. sysctl modified), not the first time it is applied.
Thus:
We change the way the annotation is used. We now require a value, which can be either