harvester / harvester

Open source hyperconverged infrastructure (HCI) software
https://harvesterhci.io/
Apache License 2.0
3.79k stars 317 forks source link

[BUG] Upgrade from v1.0.3 to v1.1.1-rc1, Harvester cluster stuck in pre-draining the second node #3164

Closed TachunLin closed 5 months ago

TachunLin commented 1 year ago

Describe the bug

Build a v1.0.3 Harvester cluster with fixed IPv4 and dynamic IPv6 IP and upgrade to v1.1.1-rc1

To Reproduce Steps to reproduce the behavior:

  1. Open the KVM virtual machine manager

  2. Open the Connection Details -> Virtual Networks

  3. Create a new virtual network workload

  4. Add the following XML content, add three specific mac address to use fixed IP

    <network>
    <name>workload</name>
    <uuid>ac62e6bf-6869-41a9-a2b7-25c06c7601c9</uuid>
    <forward mode="nat">
      <nat>
        <port start="1024" end="65535"/>
      </nat>
    </forward>
    <bridge name="virbr5" stp="on" delay="0"/>
    <mac address="52:54:00:7b:ed:99"/>
    <domain name="workload"/>
    <ip address="192.168.101.1" netmask="255.255.255.0">
      <dhcp>
        <range start="192.168.101.128" end="192.168.101.254"/>
        <host mac="52:54:00:de:04:4c" name="nic1" ip="192.168.101.184"/>
        <host mac="52:54:00:39:1a:70" name="nic2" ip="192.168.101.185"/>
        <host mac="52:54:00:a8:3c:60" name="nic3" ip="192.168.101.186"/>
      </dhcp>
    </ip>
    <ip family="ipv6" address="fd7d:844d:3e17:f3ae::1" prefix="64">
      <dhcp>
        <range start="fd7d:844d:3e17:f3ae::100" end="fd7d:844d:3e17:f3ae::1ff"/>
      </dhcp>
    </ip>
    </network>
  5. Change the bridge name to a new one

  6. Create a VM and use the ipv6 network. (workload)

  7. Launch v1.0.3 Harvester ISO installer to create the first node

  8. Select DHCP node ip and DHCP vip during the installation, it will use the fixed IP

  9. Create another VMs and use the ipv6 network. (workload)

  10. Launch v1.0.3 Harvester ISO installer to join the second and third node

  11. Select DHCP node ip and DHCP vip during the installation, it will use the fixed IP

  12. Enable network in harvester-mgmt interface and create vlan 1 network

  13. Create several images

  14. Create two VMs, one use harvester-mgmt, another use vlan 1

  15. Backup VM to S3

  16. Shutdown all VMs

  17. Offline upgrade to v.1.1.1-rc1 release, refer to https://docs.harvesterhci.io/v1.1/upgrade/automatic

Expected behavior

Support bundle

supportbundle_5a745ecd-d809-4871-b688-b24aa8fcde96_2022-11-17T07-26-16Z.zip

Environment

Additional context Add any other context about the problem here.

starbops commented 1 year ago

Here are some findings from the crime scene:

  1. n1-103 is upgraded, n2-103 is stuck in the pre-draining, and n3-103 is still in the image-preloaded state
...
  Node Statuses:
    n1-103:
      State:  Succeeded
    n2-103:
      State:  Pre-draining
    n3-103:
      State:         Images preloaded
...
n1-103:~ # k -n harvester-system get jobs
NAME                                   COMPLETIONS   DURATION   AGE
default-vlan1                          1/1           9s         31h
harvester-promote-n2-103               1/1           100s       34h
harvester-promote-n3-103               1/1           96s        34h
hvst-upgrade-8kpdd-apply-manifests     1/1           13m        30h
hvst-upgrade-8kpdd-post-drain-n1-103   1/1           5m11s      30h
hvst-upgrade-8kpdd-pre-drain-n1-103    1/1           17s        30h
hvst-upgrade-8kpdd-pre-drain-n2-103    0/1           29h        29h
  1. n2-103 is cordoned by rancher
n1-103:~ # k get no
NAME     STATUS                     ROLES                       AGE   VERSION
n1-103   Ready                      control-plane,etcd,master   34h   v1.24.7+rke2r1
n2-103   Ready,SchedulingDisabled   control-plane,etcd,master   34h   v1.22.12+rke2r1
n3-103   Ready                      control-plane,etcd,master   34h   v1.22.12+rke2r1
  1. The pre-drain pod is waiting for all the longhorn volumes to become healthy
+ '[' true ']'
+ '[' 3 -gt 2 ']'
++ kubectl get volumes.longhorn.io/pvc-2e394c2d-573e-4947-8d82-e55c9b207fb4 -n longhorn-system -o 'jsonpath={.status.robustness}'
+ robustness=degraded
+ '[' degraded = healthy ']'
+ '[' -f /tmp/skip-pvc-2e394c2d-573e-4947-8d82-e55c9b207fb4 ']'
Waiting for volume pvc-2e394c2d-573e-4947-8d82-e55c9b207fb4 to be healthy...
+ echo 'Waiting for volume pvc-2e394c2d-573e-4947-8d82-e55c9b207fb4 to be healthy...'
+ sleep 10
  1. All the attached longhorn volumes are in a degraded state
n1-103:~ # k -n longhorn-system get lhv
NAME                                       STATE      ROBUSTNESS   SCHEDULED   SIZE          NODE     AGE
pvc-2e394c2d-573e-4947-8d82-e55c9b207fb4   attached   degraded                 53687091200   n1-103   34h
pvc-32e5ccc1-57bb-4683-967f-83db4f387b0a   attached   degraded                 10737418240   n1-103   31h
pvc-61e07848-2a2e-4860-9198-b03dd92c9c7b   attached   degraded                 10485760      n3-103   34h
pvc-7f0922a9-fead-4762-9b41-3c9f2354f30a   attached   degraded                 5368709120    n3-103   30h
pvc-9f0115e3-4baa-443f-a2d3-d837a6e03add   detached   unknown                  21474836480            31h
pvc-a5b34e7a-d127-46e0-925e-a0ae96c66afe   detached   unknown                  21474836480            31h
  1. All the attached longhorn volumes have a ReplicaSchedulingFailure condition
...
    Last Transition Time:  2022-11-16T08:22:32Z
    Message:               replica scheduling failed
    Reason:                ReplicaSchedulingFailure
    Status:                False
    Type:                  scheduled
...
  1. There's no instance-manager-r pods running on n2-103 and the corresponding instancemanager CRs are in error state
n1-103:~ # k -n longhorn-system get po -o wide | grep instance-manager
instance-manager-e-2d7f17cb                    1/1     Running   0             30h   10.52.2.35        n3-103   <none>           <none>
instance-manager-e-bf28fb23                    1/1     Running   0             29h   10.52.0.139       n1-103   <none>           <none>
instance-manager-r-11da068c                    1/1     Running   0             30h   10.52.2.33        n3-103   <none>           <none>
instance-manager-r-266475ff                    1/1     Running   0             29h   10.52.0.140       n1-103   <none>           <none>
n1-103:~ # k -n longhorn-system get instancemanagers
NAME                          STATE     TYPE      NODE     AGE
instance-manager-e-2d7f17cb   running   engine    n3-103   30h
instance-manager-e-bf28fb23   running   engine    n1-103   30h
instance-manager-e-e93877c4   error     engine    n2-103   30h
instance-manager-r-11da068c   running   replica   n3-103   30h
instance-manager-r-266475ff   running   replica   n1-103   30h
instance-manager-r-da2d04f7   error     replica   n2-103   30h

So there seems to be a deadlock: the pre-drain pod waits for the volumes to become healthy. Those volumes can only become healthy once the corresponding replicas can be spawned on n2-103, which is already cordoned ...

The volume checking in the pre-draining is to assure data availability and minimize the risk of losing any data. It assumes all the longhorn replicas are running well on a certain number of nodes (3 by default) before a node gets drained. The crux is why the original instance-manager-r pod on n2-103 is gone while pre-draining. The pre-draining only migrates or shutdown VMs but not containers on the node.

Some possibilities:


We found some indirect clues which may imply n2-103 is already drained before or while pre-draining:

Screen Shot 2022-11-17 at 23 25 40
I1116 07:46:15.390229       1 trace.go:205] Trace[943167108]: "Create" url:/api/v1/namespaces/harvester-system/pods/hvst-upgrade-8kpdd-pre-drain-n1-103
-dpjfz/eviction,user-agent:rancher/v0.0.0 (linux/amd64) kubernetes/$Format,audit-id:554c9fe9-79cb-4ba4-a2fe-1d4fc21439ec,client:192.168.101.186,accept:
application/json, */*,protocol:HTTP/2.0 (16-Nov-2022 07:46:14.659) (total time: 730ms):

But we're still not sure it is the root cause because the key evidence (all the logs of harvester/rancher/kube-apiserver pods) is gone during the upgrade.

cc @bk201

starbops commented 1 year ago

To make the upgrade successful, there are 3 ways to work around this:

  1. Lower the number of replicas of each attached volume from 3 to 2 by editing YAMLs, for example:
    n1-103:~ # k -n longhorn-system edit lhv pvc-2e394c2d-573e-4947-8d82-e55c9b207fb4
    ...
    spec:
    ...
    numberOfReplicas: 2
    ...

    The volume is now in a healthy state.

    n2-103:~ # k -n longhorn-system get lhv
    NAME                                       STATE      ROBUSTNESS   SCHEDULED   SIZE          NODE     AGE
    pvc-2e394c2d-573e-4947-8d82-e55c9b207fb4   attached   healthy                  53687091200   n1-103   47h
    pvc-32e5ccc1-57bb-4683-967f-83db4f387b0a   attached   degraded                 10737418240   n1-103   43h
    pvc-61e07848-2a2e-4860-9198-b03dd92c9c7b   attached   degraded                 10485760      n3-103   47h
    pvc-7f0922a9-fead-4762-9b41-3c9f2354f30a   attached   degraded                 5368709120    n3-103   42h
    pvc-9f0115e3-4baa-443f-a2d3-d837a6e03add   detached   unknown                  21474836480            44h
    pvc-a5b34e7a-d127-46e0-925e-a0ae96c66afe   detached   unknown                  21474836480            44h
  2. Create corresponding files to skip the volume-checking mechanism in the pre-drain pod
    n1-103:~ # k -n harvester-system exec -it hvst-upgrade-8kpdd-pre-drain-n2-103-q8jch -- bash
    n2-103:/ # touch /tmp/skip-pvc-32e5ccc1-57bb-4683-967f-83db4f387b0a

    Note that the volume is still in a degraded state but the volume-checking on the second volume is passed.

    n1-103:~ # k -n longhorn-system get lhv
    NAME                                       STATE      ROBUSTNESS   SCHEDULED   SIZE          NODE     AGE
    pvc-2e394c2d-573e-4947-8d82-e55c9b207fb4   attached   healthy                  53687091200   n1-103   47h
    pvc-32e5ccc1-57bb-4683-967f-83db4f387b0a   attached   degraded                 10737418240   n1-103   43h
    pvc-61e07848-2a2e-4860-9198-b03dd92c9c7b   attached   degraded                 10485760      n3-103   47h
    pvc-7f0922a9-fead-4762-9b41-3c9f2354f30a   attached   degraded                 5368709120    n3-103   42h
    pvc-9f0115e3-4baa-443f-a2d3-d837a6e03add   detached   unknown                  21474836480            44h
    pvc-a5b34e7a-d127-46e0-925e-a0ae96c66afe   detached   unknown                  21474836480            44h
  3. Uncordon the node directly
    n1-103:~ # k uncordon n2-103
    node/n2-103 uncordoned

    The instance-manager-e/r pods are able to spawn on n2-103.

    n1-103:~ # k get no
    NAME     STATUS   ROLES                       AGE   VERSION
    n1-103   Ready    control-plane,etcd,master   47h   v1.24.7+rke2r1
    n2-103   Ready    control-plane,etcd,master   47h   v1.22.12+rke2r1
    n3-103   Ready    control-plane,etcd,master   46h   v1.22.12+rke2r1
    n1-103:~ # k -n longhorn-system get po -l longhorn.io/component=instance-manager
    NAME                          READY   STATUS              RESTARTS   AGE
    instance-manager-e-2d7f17cb   1/1     Running             0          43h
    instance-manager-e-bf28fb23   1/1     Running             0          42h
    instance-manager-e-e93877c4   0/1     ContainerCreating   0          5s
    instance-manager-r-11da068c   1/1     Running             0          43h
    instance-manager-r-266475ff   1/1     Running             0          42h
    instance-manager-r-da2d04f7   0/1     ContainerCreating   0          5s
    n1-103:~ # k -n longhorn-system get lhim
    NAME                          STATE     TYPE      NODE     AGE
    instance-manager-e-2d7f17cb   running   engine    n3-103   43h
    instance-manager-e-bf28fb23   running   engine    n1-103   43h
    instance-manager-e-e93877c4   running   engine    n2-103   43h
    instance-manager-r-11da068c   running   replica   n3-103   43h
    instance-manager-r-266475ff   running   replica   n1-103   43h
    instance-manager-r-da2d04f7   running   replica   n2-103   43h

    The remaining degraded volumes are now in a healthy state.

    n1-103:~ # k -n longhorn-system get lhv
    NAME                                       STATE      ROBUSTNESS   SCHEDULED   SIZE          NODE     AGE
    pvc-2e394c2d-573e-4947-8d82-e55c9b207fb4   attached   healthy                  53687091200   n1-103   47h
    pvc-32e5ccc1-57bb-4683-967f-83db4f387b0a   attached   degraded                 10737418240   n1-103   44h
    pvc-61e07848-2a2e-4860-9198-b03dd92c9c7b   attached   healthy                  10485760      n3-103   47h
    pvc-7f0922a9-fead-4762-9b41-3c9f2354f30a   attached   healthy                  5368709120    n3-103   43h
    pvc-9f0115e3-4baa-443f-a2d3-d837a6e03add   detached   unknown                  21474836480            44h
    pvc-a5b34e7a-d127-46e0-925e-a0ae96c66afe   detached   unknown                  21474836480            44h
lanfon72 commented 1 year ago

Encountering this bug while upgrading from v1.1.2 to v1.2.0-rc3

supportbundle_8365a58f-9f55-4480-a460-b46525cc2098_2023-07-12T12-47-59Z.zip

TachunLin commented 5 months ago

Encounter this issue while upgrading from v1.2.1 to v1.2.2-rc2 with RKE2 guest cluster created

Attached upgrade log and support bundle for more information

Upgrade log hvst-upgrade-65kts-upgradelog-archive-2024-05-03T14-19-17Z.zip

Support bundle supportbundle_04242914-6540-4e2d-8971-a2a5ddd0eb95_2024-05-06T03-00-12Z.zip

bk201 commented 5 months ago

I checked the cluster earlier. It's a two-node cluster and volumes are configured 3 replicas. In this case, the user will need to low down the replica count or create volumes with a storage class with only 2-replica.

Encounter this issue while upgrading from v1.2.1 to v1.2.2-rc2 with RKE2 guest cluster created

TachunLin commented 5 months ago

By lowering down the replica count to reflect the number of nodes. (here we set to 2) We can upgrade from v1.2.1 to v1.2.2-rc2 on the 2 nodes baremetal machines.

image

Close this issue due to the environment configuration.