openshift / machine-config-operator

Apache License 2.0
245 stars 408 forks source link

OCPBUGS-33694: daemon/update: disable systemd unit before overwriting #4421

Closed jlebon closed 3 months ago

jlebon commented 3 months ago

When overwriting a systemd unit with new content, we need to account for the case where the new unit content has a different [Install] section. If it does, then simply overwriting will leak the previous enablement symlinks and become node state. That's OK most of the time, but this can cause real issues as we've seen with the combination of #3967 which does exactly that (changing [Install] sections) and #4213 which assumed that those symlinks were cleaned up. More details on that cocktail in:

https://issues.redhat.com/browse/OCPBUGS-33694?focusedId=24917003#comment-24917003

Fix this by always checking if the unit is currently enabled, and if so, running systemctl disable before overwriting its contents. The unit will then be re-enabled (or not) based on the MachineConfig.

Fixes: https://issues.redhat.com/browse/OCPBUGS-33694


daemon/update: add workaround for OCPBUGS-33694

Due to the bug detailed in the previous commit, we now have nodes out there that have stale enablement symlinks on-disk. It would be too risky to try to catch them all and clean them up, but at least let's clean up the ones that are known to be problematic.


- What I did

See commit messages above.

- How to verify it

To verify the bug fix (first commit):

To verify the workaround (second commit):

- Description for the changelog

Stop leaking enablement symlinks when writing systemd units.

openshift-ci-robot commented 3 months ago

@jlebon: This pull request references Jira Issue OCPBUGS-33694, which is invalid:

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to [this](https://github.com/openshift/machine-config-operator/pull/4421): >When overwriting a systemd unit with new content, we need to account for >the case where the new unit content has a different `[Install]` section. >If it does, then simply overwriting will leak the previous enablement >symlinks and become node state. That's OK most of the time, but this can >cause real issues as we've seen with the combination of #3967 which does >exactly that (changing `[Install]` sections) and #4213 which assumed >that those symlinks were cleaned up. More details on that cocktail in: > >https://issues.redhat.com/browse/OCPBUGS-33694?focusedId=24917003#comment-24917003 > >Fix this by always checking if the unit is currently enabled, and if so, >running `systemctl disable` *before* overwriting its contents. The unit >will then be re-enabled (or not) based on the MachineConfig. > >Fixes: https://issues.redhat.com/browse/OCPBUGS-33694 > >--- > >daemon/update: add workaround for OCPBUGS-33694 > >Due to the bug detailed in the previous commit, we now have nodes out >there that have stale enablement symlinks on-disk. It would be too risky >to try to catch them all and clean them up, but at least let's clean up >the ones that are known to be problematic. > >--- > >**- What I did** > >See commit messages above. > >**- How to verify it** > >To verify the bug fix (first commit): > >- Boot to an OCP version that doesn't have https://github.com/openshift/machine-config-operator/pull/3967. >- Upgrade to an OCP version that has the bug fix (and implicitly also has https://github.com/openshift/machine-config-operator/pull/3967). >- Verify that the service is disabled stale enablement symlinks are removed. > >To verify the workaround (second commit): > >- Boot to an OCP version that doesn't have https://github.com/openshift/machine-config-operator/pull/3967. >- Upgrade to an OCP version that does have https://github.com/openshift/machine-config-operator/pull/3967. >- Upgrade to an OCP version that has the workaround. >- Verify that the stale enablement symlinks are removed. > >**- Description for the changelog** > >Stop leaking enablement symlinks when writing systemd units. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fmachine-config-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-ci[bot] commented 3 months ago

Skipping CI for Draft Pull Request. If you want CI signal for your change, please convert it to an actual PR. You can still manually trigger a test run with /test all

jlebon commented 3 months ago

Keeping as draft for now since this is not yet tested. @jbtrystram will go through the testing procedures.

jlebon commented 3 months ago

/jira refresh

openshift-ci-robot commented 3 months ago

@jlebon: This pull request references Jira Issue OCPBUGS-33694, which is valid.

3 validation(s) were run on this bug * bug is open, matching expected state (open) * bug target version (4.17.0) matches configured target version for branch (4.17.0) * bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @mike-nguyen

In response to [this](https://github.com/openshift/machine-config-operator/pull/4421#issuecomment-2183119826): >/jira refresh Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fmachine-config-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
jbtrystram commented 3 months ago

I did some testing today with this patch, following these steps :

current state:

1 - launch an clusterbot instance with OVN from 4.14.0
2- Upgrade to latest 4.14 - 4.14.31 Links are still there :

[core@ip-10-0-96-207 ~]$ ls -l /etc/systemd/system/network-online.target.requires/
lrwxrwxrwx. 1 root root 47 Jun 24 11:22 node-valid-hostname.service -> /etc/systemd/system/node-valid-hostname.service

with this patch:

0 - cherry pick this against 4.14 : #4424
1 - create a release image with cluster bot build openshift/machine-config-operator#4424 2 - Launch 4.14.0 cluster with OVN 3 - update cluster to the release built with this patch: oc adm upgrade –to-image=registry.build03.ci.openshift.org/ci-ln-0ct9g72/release 4- debug pod :

ls -l /etc/systemd/system/network-online.target.requires/
# nothing
ls -l  /etc/systemd/system/network-online.target.wants/ovs-configuration.service
#nothing
travier commented 3 months ago

/lgtm

jbtrystram commented 3 months ago

I did some more testing, with this PR: 1 - launched a cluster with this change 2 - created the following machine config:

# Generated by Butane; do not edit
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 99-worker-hello
spec:
  config:
    ignition:
      version: 3.4.0
    systemd:
      units:
        - contents: |
            [Unit]
            Description=A hello world unit!
            After=network-online.target
            Requires=network-online.target
            [Service]
            Type=oneshot
            RemainAfterExit=yes
            ExecStart=/usr/bin/echo "Hello, World!"
            [Install]
            WantedBy=default.target
          enabled: true
          name: hello.service

3 - apply the machine config and wait for deployment 4 - in a debug pod, confirm that symlink /etc/systemd/system/default.target.wants/hello.service exist. 5 - update the prevous machine config with:

...
            [Install]
            WantedBy=multi-users.target
...

6 - apply the machine config and wait for deployment 7 - in a debug pod, confirm that symlink /etc/systemd/system/default.target.wants/hello.service don't exist. 8 - in debug pod, confirm that symlink /etc/systemd/system/multi-users.target.wants/hello.service exist.

jlebon commented 3 months ago

Thanks so much @jbtrystram for testing!

openshift-ci[bot] commented 3 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlebon, travier, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/machine-config-operator/blob/master/OWNERS)~~ [yuqi-zhang] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
openshift-ci-robot commented 3 months ago

@jlebon: Jira Issue OCPBUGS-33694: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-33694 has been moved to the MODIFIED state.

In response to [this](https://github.com/openshift/machine-config-operator/pull/4421): >When overwriting a systemd unit with new content, we need to account for >the case where the new unit content has a different `[Install]` section. >If it does, then simply overwriting will leak the previous enablement >symlinks and become node state. That's OK most of the time, but this can >cause real issues as we've seen with the combination of #3967 which does >exactly that (changing `[Install]` sections) and #4213 which assumed >that those symlinks were cleaned up. More details on that cocktail in: > >https://issues.redhat.com/browse/OCPBUGS-33694?focusedId=24917003#comment-24917003 > >Fix this by always checking if the unit is currently enabled, and if so, >running `systemctl disable` *before* overwriting its contents. The unit >will then be re-enabled (or not) based on the MachineConfig. > >Fixes: https://issues.redhat.com/browse/OCPBUGS-33694 > >--- > >daemon/update: add workaround for OCPBUGS-33694 > >Due to the bug detailed in the previous commit, we now have nodes out >there that have stale enablement symlinks on-disk. It would be too risky >to try to catch them all and clean them up, but at least let's clean up >the ones that are known to be problematic. > >--- > >**- What I did** > >See commit messages above. > >**- How to verify it** > >To verify the bug fix (first commit): > >- Boot to an OCP version that doesn't have https://github.com/openshift/machine-config-operator/pull/3967. >- Upgrade to an OCP version that has the bug fix (and implicitly also has https://github.com/openshift/machine-config-operator/pull/3967). >- Verify that there are no stale enablement symlinks. > >To verify the workaround (second commit): > >- Boot to an OCP version that doesn't have https://github.com/openshift/machine-config-operator/pull/3967. >- Upgrade to an OCP version that does have https://github.com/openshift/machine-config-operator/pull/3967. >- Upgrade to an OCP version that has the workaround. >- Verify that the stale enablement symlinks are removed. > >**- Description for the changelog** > >Stop leaking enablement symlinks when writing systemd units. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fmachine-config-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-ci[bot] commented 3 months ago

@jlebon: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-op-techpreview 51c2df2be0f2354a3b67834de7a48de3f7133cbb link false /test e2e-gcp-op-techpreview
ci/prow/e2e-aws-ovn-upgrade-out-of-change 51c2df2be0f2354a3b67834de7a48de3f7133cbb link false /test e2e-aws-ovn-upgrade-out-of-change

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
openshift-bot commented 3 months ago

[ART PR BUILD NOTIFIER]

This PR has been included in build ose-machine-config-operator-container-v4.17.0-202406260443.p0.g5b05f00.assembly.stream.el9 for distgit ose-machine-config-operator. All builds following this will include this PR.

jbtrystram commented 3 months ago

/cherry-pick 4.16

jbtrystram commented 3 months ago

/cherry-pick release-4.16

openshift-cherrypick-robot commented 3 months ago

@jbtrystram: cannot checkout 4.16: error checking out "4.16": exit status 1 error: pathspec '4.16' did not match any file(s) known to git

In response to [this](https://github.com/openshift/machine-config-operator/pull/4421#issuecomment-2190915173): >/cherry-pick 4.16 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
openshift-cherrypick-robot commented 3 months ago

@jbtrystram: new pull request created: #4436

In response to [this](https://github.com/openshift/machine-config-operator/pull/4421#issuecomment-2190916380): >/cherry-pick release-4.16 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.