harvester / harvester

Open source hyperconverged infrastructure (HCI) software
https://harvesterhci.io/
Apache License 2.0
3.91k stars 329 forks source link

[BUG] A pending fail pod from different hub is generated after upgrade rancher #5382

Closed albinsun closed 1 month ago

albinsun commented 8 months ago

Describe the bug Based on harvester v1.3.0, by performing a Rancher upgrade form v2.7.11 to v2.8.3-rc4, we found that a new generated pod is stuck in pending fail.

Note that new one comes from a different image rancher/harvester-cloud-provider:v0.2.0 compare to the running one registry.rancher.com/rancher/harvester-cloud-provider:v0.2.0 image

To Reproduce

  1. Setup RKE2 cluster based on Harvester v1.3.0 + Rancher v2.7.11

    • :green_circle: Create RKE2 cluster ![image](https://github.com/harvester/harvester/assets/2773781/a6f0d767-c42f-4e19-af74-466d86bc2106)
    • :green_circle: Check deployment with pvc ![image](https://github.com/harvester/harvester/assets/2773781/6dfbfd32-13e0-418a-bb87-98df9ac13484)
    • :green_circle: Check LB ![image](https://github.com/harvester/harvester/assets/2773781/51a6c841-af92-46e6-aa5c-c60074ffe6d1)
  2. Upgrade Rancher to v2.8.3-rc4

    • :green_circle: Cluster is Active ![image](https://github.com/harvester/harvester/assets/2773781/59ee4261-9510-47cc-84b0-be7d0238dea1)
    • :red_circle: Apps are in health ${\color{red} \text{0/1 nodes are available: 1 node(s) didn't match pod anti-affinity rules. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..}}$ ![image](https://github.com/harvester/harvester/assets/2773781/d2486c64-1738-4e15-9042-93fb2ac65728)
    • :green_circle: Check deployment with pvc ![image](https://github.com/harvester/harvester/assets/2773781/18317f50-4b3c-463c-a3c1-2847d14f3912)
    • :green_circle: Check LB ![image](https://github.com/harvester/harvester/assets/2773781/dadee14d-f807-454b-aa38-a12e990eb759)

Expected behavior harvester-cloud-provider should be in health after upgrade

Support bundle

  1. Harvester: supportbundle_cloudproviderhub_2024-03-18T06-38-10Z.zip
  2. Rancher: rancher-2024-03-18_06_41_29.tar.gz

Environment

albinsun commented 7 months ago

FYI, hit similiar issue during v1.2.1 but with opposite image source

Context (Rancher Community -> Prime)

  1. harvester-v1.2.1 + rancher-v2.7.10
  2. Upgrade Rancher to v2.7.11 (Prime)

image

image

albinsun commented 7 months ago

Workaround

[!NOTE] Delete the old pod and let the rolling update continue.

image

bk201 commented 6 months ago

cc @starbops, we need to backport https://github.com/harvester/charts/pull/239 to the release branch to fix this problem.

harvesterhci-io-github-bot commented 5 months ago

Pre Ready-For-Testing Checklist

albinsun commented 1 month ago

Test basically fine using Rancher marketplace to update harvester-cloud-provider.

Per discussed with @starbops, there are mainly 2 paths for harvester-cloud-provider chart

  1. Rancher Marketplace
  2. RKE2 self-contained

This ticket fix 1) but user may still hit via 2), no sure if we can inform RKE2 to use latest harvester-cloud-provider when they release or some other cross-team integrations?

cc. @bk201 and @FrankYang0529

Environment

Test Case

:green_circle: Upgrade harvester-cloud-provider via Marketplace

  1. Import Harvester to rancher-v2.8.8 ![image](https://github.com/user-attachments/assets/37828990-7501-4920-bce7-48eae85884d2)
  2. Create v1.28.13+rke2r1 cluster ![image](https://github.com/user-attachments/assets/4a57e9eb-fd87-420e-a32d-6ab12eb5c4f1)
  3. Check harvester-cloud-provider app version less than v0.2.5 * `harvester-cloud-provider-v0.2.4` ![image](https://github.com/user-attachments/assets/a0a45e5f-9706-43d2-a8ba-0e6ecf55c644)
  4. Upgrade harvester-cloud-provider to v0.2.5+ * Upgrade to `103.0.3-up0.2.6` ![image](https://github.com/user-attachments/assets/9d9e89ec-621b-4db6-8835-ecd120251559) ![image](https://github.com/user-attachments/assets/9cac5698-b75d-4165-9596-9b9902863e7d)
  5. Should NOT have pending fail harvester-cloud-provider pod ![image](https://github.com/user-attachments/assets/5676b505-e10d-4d25-87ae-a3787c21a775)

:yellow_circle: harvester-cloud-provider may rollout when upgrade Rancher

  1. Import Harvester to rancher-v2.8.8 ![image](https://github.com/user-attachments/assets/7f408a94-9010-43aa-a3b0-e0a8625f02a2)
  2. Create v1.27.16 +rke2r2 cluster ![image](https://github.com/user-attachments/assets/514de86f-dc05-4d31-a510-80486357bc14)
  3. Check harvester-cloud-provider version less than v0.2.5 ![image](https://github.com/user-attachments/assets/f7e58ff2-f9f7-472c-8bd5-3c833bab357c)
  4. Deploy Nginx w/ PVC ![image](https://github.com/user-attachments/assets/e66e61bd-63a7-406d-9d32-6597bdd26b28)
  5. Upgrade Rancher to v2.9.3-alpha5 ![image](https://github.com/user-attachments/assets/39dfcbbc-d0d0-49b9-ab62-e0d29bcccbf7)
  6. Check harvester-cloud-provider is v0.2.5+ * :warning: `rancher-v2.9.3-alpha5` still use 0.2.4 by default. ![image](https://github.com/user-attachments/assets/03a02b3a-ea0d-4a15-8d87-db928e66b5bf) * Manually upgrade to v0.2.6 ![image](https://github.com/user-attachments/assets/5409b14e-1da6-43a5-a507-b7ecbaf9d9e1) ![image](https://github.com/user-attachments/assets/2ef2d99a-56ca-4632-8940-f0c930a7ec7e)
starbops commented 1 month ago

I noticed that the harvester-cloud-provider's image changed from registry.rancher.com/rancher/harvester-cloud-provider:v0.2.1 to rancher/harvester-cloud-provider:v0.2.1. That's the reason why the deployment was rolled out during the Rancher upgrade, but I'm unsure what caused the difference.