[BUG] Upgrade workaround for rancher-system-agent does not get cleaned up

starbops commented 8 months ago

Describe the bug

In #4149, we added a workaround for not restarting the RKE2 server/agent on the nodes during the upgrade. Though we have thought of the cleanup procedure, there are still loose ends...

When an upgrade fails at a later phase (specifically, after the apply manifest phase has been completed), the workaround above is not cleaned up. This is because the cleanup code will be executed in the node upgrade phase for each node. So the workaround, i.e., the override.conf for the rancher-system-agent system service, remains on the nodes. That will cause the later upgrade to get into trouble.

To Reproduce

Upgrade from v1.1.2 to v1.2.1
Intentionally break the upgrade by removing the Upgrade CR when the upgrade progress goes beyond phase 3 (upgrading system service)
Verify that there are two files remain on the nodes under /run/systemd/system/rancher-system-agent.service.d/:
- override.conf
- 10-harvester-upgrade.env

Expected behavior

The files generated by the workaround shouldn't exist after the upgrade, even if the upgrade failed.

Support bundle

Environment

Harvester ISO version: upgrade from v1.1.2 to v1.2.1
Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630):

Additional context

The removal of the workaround could be applied more widely, not just restricted to versions before v1.2.0: https://github.com/harvester/harvester/blob/22d9e700e0e05231fd2b610584da43c0a3ce6fd9/package/upgrade/upgrade_node.sh#L279-L283

The workaround is to remove the /run/systemd/system/rancher-system-agent.service.d/ directory, daemon reload, and restart rancher-system-agent on each node before start over an upgrade.

starbops commented 8 months ago

The plan is to persist the workaround as a part of the regular upgrade flow. We don't want the RKE2 server/agent to accidentally restart during the Rancher upgrade, so this is a way to protect them. That is to say, we'll remove the version checking for installing and remove the rancher-system-agent system service files.

harvesterhci-io-github-bot commented 8 months ago

Pre Ready-For-Testing Checklist

[x] ~~If labeled: require/HEP Has the Harvester Enhancement Proposal PR submitted? The HEP PR is at:~~
[x] Where is the reproduce steps/test steps documented? The reproduce steps/test steps are at: #4965 #4966
[x] Is there a workaround for the issue? If so, where is it documented? The workaround is at: #4965
[x] Have the backend code been merged (harvester, harvester-installer, etc) (including backport-needed/*)? The PR is at: #4966
- [x] Does the PR include the explanation for the fix or the feature?
- [x] ~~Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart? The PR for the YAML change is at: The PR for the chart change is at:~~
[x] ~~If labeled: area/ui Has the UI issue filed or ready to be merged? The UI issue/PR is at:~~
[x] ~~If labeled: require/doc, require/knowledge-base Has the necessary document PR submitted or merged? The documentation/KB PR is at:~~

[x] If NOT labeled: not-require/test-plan Has the e2e test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue?
- ~~The automation skeleton PR is at:~~
- ~~The automation test case PR is at:~~
[x] ~~If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility? The compatibility issue is filed at:~~

albinsun commented 6 months ago

Provide test result on v1.3-4372cbc4-head, close as fixed.

Environment

Harvester
- Version: v1.3-4372cbc4-head
- Profile: QEMU/KVM, 2 nodes (8C/16G/500G)
- ui-source: Auto

Steps

Setup 2 node harvester cluster
Upgrade to v1.3.0-rc3 ^1 and intentionally break the upgrade by removing the Upgrade CR when the upgrade progress goes beyond phase 3 (upgrading system service)
- Workaround files should still there :heavy_check_mark:
  ![image](https://github.com/harvester/harvester/assets/2773781/d81d183f-de8e-491b-ad39-60ff8a73b48b)
Upgrade to v1.3.0-rc3 again
- Upgrade should success :heavy_check_mark:
- Workaround files should be cleand up :heavy_check_mark:

harvesterhci-io-github-bot commented 5 months ago

added backport-needed/1.2.2 issue: #5380.

harvester / harvester