Closed starbops closed 6 months ago
The plan is to persist the workaround as a part of the regular upgrade flow. We don't want the RKE2 server/agent to accidentally restart during the Rancher upgrade, so this is a way to protect them. That is to say, we'll remove the version checking for installing and remove the rancher-system-agent
system service files.
[x] ~~If labeled: require/HEP Has the Harvester Enhancement Proposal PR submitted? The HEP PR is at:~~
[x] Where is the reproduce steps/test steps documented? The reproduce steps/test steps are at: #4965 #4966
[x] Is there a workaround for the issue? If so, where is it documented? The workaround is at: #4965
[x] Have the backend code been merged (harvester, harvester-installer, etc) (including backport-needed/*
)?
The PR is at: #4966
[x] Does the PR include the explanation for the fix or the feature?
[x] ~~Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart? The PR for the YAML change is at: The PR for the chart change is at:~~
[x] ~~If labeled: area/ui Has the UI issue filed or ready to be merged? The UI issue/PR is at:~~
[x] ~~If labeled: require/doc, require/knowledge-base Has the necessary document PR submitted or merged? The documentation/KB PR is at:~~
[x] If NOT labeled: not-require/test-plan Has the e2e test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue?
[x] ~~If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility
?
The compatibility issue is filed at:~~
Provide test result on v1.3-4372cbc4-head
, close as fixed.
v1.3-4372cbc4-head
Auto
Setup 2 node harvester cluster
Upgrade to v1.3.0-rc3
^1 and intentionally break the upgrade by removing the Upgrade CR when the upgrade progress goes beyond phase 3 (upgrading system service)
Upgrade to v1.3.0-rc3
again
added backport-needed/1.2.2
issue: #5380.
Describe the bug
In #4149, we added a workaround for not restarting the RKE2 server/agent on the nodes during the upgrade. Though we have thought of the cleanup procedure, there are still loose ends...
When an upgrade fails at a later phase (specifically, after the apply manifest phase has been completed), the workaround above is not cleaned up. This is because the cleanup code will be executed in the node upgrade phase for each node. So the workaround, i.e., the
override.conf
for therancher-system-agent
system service, remains on the nodes. That will cause the later upgrade to get into trouble.To Reproduce
/run/systemd/system/rancher-system-agent.service.d/
:override.conf
10-harvester-upgrade.env
Expected behavior
The files generated by the workaround shouldn't exist after the upgrade, even if the upgrade failed.
Support bundle
Environment
Additional context
The removal of the workaround could be applied more widely, not just restricted to versions before v1.2.0: https://github.com/harvester/harvester/blob/22d9e700e0e05231fd2b610584da43c0a3ce6fd9/package/upgrade/upgrade_node.sh#L279-L283
The workaround is to remove the
/run/systemd/system/rancher-system-agent.service.d/
directory, daemon reload, and restartrancher-system-agent
on each node before start over an upgrade.