Closed jlebon closed 3 months ago
@jlebon: This pull request references Jira Issue OCPBUGS-33694, which is invalid:
Comment /jira refresh
to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.
The bug has been updated to refer to the pull request using the external bug tracker.
Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all
Keeping as draft for now since this is not yet tested. @jbtrystram will go through the testing procedures.
/jira refresh
@jlebon: This pull request references Jira Issue OCPBUGS-33694, which is valid.
Requesting review from QA contact: /cc @mike-nguyen
I did some testing today with this patch, following these steps :
1 - launch an clusterbot instance with OVN from 4.14.0
2- Upgrade to latest 4.14 - 4.14.31
Links are still there :
[core@ip-10-0-96-207 ~]$ ls -l /etc/systemd/system/network-online.target.requires/
lrwxrwxrwx. 1 root root 47 Jun 24 11:22 node-valid-hostname.service -> /etc/systemd/system/node-valid-hostname.service
0 - cherry pick this against 4.14 : #4424
1 - create a release image with cluster bot build openshift/machine-config-operator#4424
2 - Launch 4.14.0 cluster with OVN
3 - update cluster to the release built with this patch: oc adm upgrade –to-image=registry.build03.ci.openshift.org/ci-ln-0ct9g72/release
4- debug pod :
ls -l /etc/systemd/system/network-online.target.requires/
# nothing
ls -l /etc/systemd/system/network-online.target.wants/ovs-configuration.service
#nothing
/lgtm
I did some more testing, with this PR: 1 - launched a cluster with this change 2 - created the following machine config:
# Generated by Butane; do not edit
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: worker
name: 99-worker-hello
spec:
config:
ignition:
version: 3.4.0
systemd:
units:
- contents: |
[Unit]
Description=A hello world unit!
After=network-online.target
Requires=network-online.target
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/bin/echo "Hello, World!"
[Install]
WantedBy=default.target
enabled: true
name: hello.service
3 - apply the machine config and wait for deployment
4 - in a debug pod, confirm that symlink /etc/systemd/system/default.target.wants/hello.service
exist.
5 - update the prevous machine config with:
...
[Install]
WantedBy=multi-users.target
...
6 - apply the machine config and wait for deployment
7 - in a debug pod, confirm that symlink /etc/systemd/system/default.target.wants/hello.service
don't exist.
8 - in debug pod, confirm that symlink /etc/systemd/system/multi-users.target.wants/hello.service
exist.
Thanks so much @jbtrystram for testing!
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: jlebon, travier, yuqi-zhang
The full list of commands accepted by this bot can be found here.
The pull request process is described here
@jlebon: Jira Issue OCPBUGS-33694: All pull requests linked via external trackers have merged:
Jira Issue OCPBUGS-33694 has been moved to the MODIFIED state.
@jlebon: The following tests failed, say /retest
to rerun all failed tests or /retest-required
to rerun all mandatory failed tests:
Test name | Commit | Details | Required | Rerun command |
---|---|---|---|---|
ci/prow/e2e-gcp-op-techpreview | 51c2df2be0f2354a3b67834de7a48de3f7133cbb | link | false | /test e2e-gcp-op-techpreview |
ci/prow/e2e-aws-ovn-upgrade-out-of-change | 51c2df2be0f2354a3b67834de7a48de3f7133cbb | link | false | /test e2e-aws-ovn-upgrade-out-of-change |
Full PR test history. Your PR dashboard.
[ART PR BUILD NOTIFIER]
This PR has been included in build ose-machine-config-operator-container-v4.17.0-202406260443.p0.g5b05f00.assembly.stream.el9 for distgit ose-machine-config-operator. All builds following this will include this PR.
/cherry-pick 4.16
/cherry-pick release-4.16
@jbtrystram: cannot checkout 4.16
: error checking out "4.16": exit status 1 error: pathspec '4.16' did not match any file(s) known to git
@jbtrystram: new pull request created: #4436
When overwriting a systemd unit with new content, we need to account for the case where the new unit content has a different
[Install]
section. If it does, then simply overwriting will leak the previous enablement symlinks and become node state. That's OK most of the time, but this can cause real issues as we've seen with the combination of #3967 which does exactly that (changing[Install]
sections) and #4213 which assumed that those symlinks were cleaned up. More details on that cocktail in:https://issues.redhat.com/browse/OCPBUGS-33694?focusedId=24917003#comment-24917003
Fix this by always checking if the unit is currently enabled, and if so, running
systemctl disable
before overwriting its contents. The unit will then be re-enabled (or not) based on the MachineConfig.Fixes: https://issues.redhat.com/browse/OCPBUGS-33694
daemon/update: add workaround for OCPBUGS-33694
Due to the bug detailed in the previous commit, we now have nodes out there that have stale enablement symlinks on-disk. It would be too risky to try to catch them all and clean them up, but at least let's clean up the ones that are known to be problematic.
- What I did
See commit messages above.
- How to verify it
To verify the bug fix (first commit):
To verify the workaround (second commit):
- Description for the changelog
Stop leaking enablement symlinks when writing systemd units.