harvester / harvester

Open source hyperconverged infrastructure (HCI) software
https://harvesterhci.io/
Apache License 2.0
3.74k stars 313 forks source link

[BUG] Upgrade workaround for rancher-system-agent does not get cleaned up #4965

Closed starbops closed 6 months ago

starbops commented 8 months ago

Describe the bug

In #4149, we added a workaround for not restarting the RKE2 server/agent on the nodes during the upgrade. Though we have thought of the cleanup procedure, there are still loose ends...

When an upgrade fails at a later phase (specifically, after the apply manifest phase has been completed), the workaround above is not cleaned up. This is because the cleanup code will be executed in the node upgrade phase for each node. So the workaround, i.e., the override.conf for the rancher-system-agent system service, remains on the nodes. That will cause the later upgrade to get into trouble.

To Reproduce

  1. Upgrade from v1.1.2 to v1.2.1
  2. Intentionally break the upgrade by removing the Upgrade CR when the upgrade progress goes beyond phase 3 (upgrading system service)
  3. Verify that there are two files remain on the nodes under /run/systemd/system/rancher-system-agent.service.d/:
    • override.conf
    • 10-harvester-upgrade.env

Expected behavior

The files generated by the workaround shouldn't exist after the upgrade, even if the upgrade failed.

Support bundle

Environment

Additional context

The removal of the workaround could be applied more widely, not just restricted to versions before v1.2.0: https://github.com/harvester/harvester/blob/22d9e700e0e05231fd2b610584da43c0a3ce6fd9/package/upgrade/upgrade_node.sh#L279-L283

The workaround is to remove the /run/systemd/system/rancher-system-agent.service.d/ directory, daemon reload, and restart rancher-system-agent on each node before start over an upgrade.

starbops commented 8 months ago

The plan is to persist the workaround as a part of the regular upgrade flow. We don't want the RKE2 server/agent to accidentally restart during the Rancher upgrade, so this is a way to protect them. That is to say, we'll remove the version checking for installing and remove the rancher-system-agent system service files.

harvesterhci-io-github-bot commented 8 months ago

Pre Ready-For-Testing Checklist

albinsun commented 6 months ago

Provide test result on v1.3-4372cbc4-head, close as fixed.

Environment

Steps

  1. Setup 2 node harvester cluster

  2. Upgrade to v1.3.0-rc3 ^1 and intentionally break the upgrade by removing the Upgrade CR when the upgrade progress goes beyond phase 3 (upgrading system service)

    • Workaround files should still there :heavy_check_mark: ![image](https://github.com/harvester/harvester/assets/2773781/d81d183f-de8e-491b-ad39-60ff8a73b48b)
  3. Upgrade to v1.3.0-rc3 again

    • Upgrade should success :heavy_check_mark:
    • Workaround files should be cleand up :heavy_check_mark:

    image

harvesterhci-io-github-bot commented 5 months ago

added backport-needed/1.2.2 issue: #5380.