[BUG] Upgrade from v1.0.3 to v1.1.1-rc1, Harvester cluster stuck in pre-draining the second node

Describe the bug

Build a v1.0.3 Harvester cluster with fixed IPv4 and dynamic IPv6 IP and upgrade to v1.1.1-rc1

The upgrade process stuck in pre-draining the second node
Some pods in pending status
Instance manager got error

To Reproduce Steps to reproduce the behavior:

Open the KVM virtual machine manager
Open the Connection Details -> Virtual Networks
Create a new virtual network workload

Add the following XML content, add three specific mac address to use fixed IP

<network>
<name>workload</name>
<uuid>ac62e6bf-6869-41a9-a2b7-25c06c7601c9</uuid>
<forward mode="nat">
  <nat>
    <port start="1024" end="65535"/>
  </nat>
</forward>
<bridge name="virbr5" stp="on" delay="0"/>
<mac address="52:54:00:7b:ed:99"/>
<domain name="workload"/>
<ip address="192.168.101.1" netmask="255.255.255.0">
  <dhcp>
    <range start="192.168.101.128" end="192.168.101.254"/>
    <host mac="52:54:00:de:04:4c" name="nic1" ip="192.168.101.184"/>
    <host mac="52:54:00:39:1a:70" name="nic2" ip="192.168.101.185"/>
    <host mac="52:54:00:a8:3c:60" name="nic3" ip="192.168.101.186"/>
  </dhcp>
</ip>
<ip family="ipv6" address="fd7d:844d:3e17:f3ae::1" prefix="64">
  <dhcp>
    <range start="fd7d:844d:3e17:f3ae::100" end="fd7d:844d:3e17:f3ae::1ff"/>
  </dhcp>
</ip>
</network>

Change the bridge name to a new one
Create a VM and use the ipv6 network. (workload)
Launch v1.0.3 Harvester ISO installer to create the first node
Select DHCP node ip and DHCP vip during the installation, it will use the fixed IP
Create another VMs and use the ipv6 network. (workload)
Launch v1.0.3 Harvester ISO installer to join the second and third node
Select DHCP node ip and DHCP vip during the installation, it will use the fixed IP
Enable network in harvester-mgmt interface and create vlan 1 network
Create several images
Create two VMs, one use harvester-mgmt, another use vlan 1
Backup VM to S3
Shutdown all VMs
Offline upgrade to v.1.1.1-rc1 release, refer to https://docs.harvesterhci.io/v1.1/upgrade/automatic

Expected behavior

Support bundle

supportbundle_5a745ecd-d809-4871-b688-b24aa8fcde96_2022-11-17T07-26-16Z.zip

Environment

Harvester ISO version: v1.1.1-rc1
Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630): 3 nodes local kvm machines

Additional context Add any other context about the problem here.

Here are some findings from the crime scene:

n1-103 is upgraded, n2-103 is stuck in the pre-draining, and n3-103 is still in the image-preloaded state

...
  Node Statuses:
    n1-103:
      State:  Succeeded
    n2-103:
      State:  Pre-draining
    n3-103:
      State:         Images preloaded
...

n1-103:~ # k -n harvester-system get jobs
NAME                                   COMPLETIONS   DURATION   AGE
default-vlan1                          1/1           9s         31h
harvester-promote-n2-103               1/1           100s       34h
harvester-promote-n3-103               1/1           96s        34h
hvst-upgrade-8kpdd-apply-manifests     1/1           13m        30h
hvst-upgrade-8kpdd-post-drain-n1-103   1/1           5m11s      30h
hvst-upgrade-8kpdd-pre-drain-n1-103    1/1           17s        30h
hvst-upgrade-8kpdd-pre-drain-n2-103    0/1           29h        29h

n2-103 is cordoned by rancher

n1-103:~ # k get no
NAME     STATUS                     ROLES                       AGE   VERSION
n1-103   Ready                      control-plane,etcd,master   34h   v1.24.7+rke2r1
n2-103   Ready,SchedulingDisabled   control-plane,etcd,master   34h   v1.22.12+rke2r1
n3-103   Ready                      control-plane,etcd,master   34h   v1.22.12+rke2r1

The pre-drain pod is waiting for all the longhorn volumes to become healthy

+ '[' true ']'
+ '[' 3 -gt 2 ']'
++ kubectl get volumes.longhorn.io/pvc-2e394c2d-573e-4947-8d82-e55c9b207fb4 -n longhorn-system -o 'jsonpath={.status.robustness}'
+ robustness=degraded
+ '[' degraded = healthy ']'
+ '[' -f /tmp/skip-pvc-2e394c2d-573e-4947-8d82-e55c9b207fb4 ']'
Waiting for volume pvc-2e394c2d-573e-4947-8d82-e55c9b207fb4 to be healthy...
+ echo 'Waiting for volume pvc-2e394c2d-573e-4947-8d82-e55c9b207fb4 to be healthy...'
+ sleep 10

All the attached longhorn volumes are in a degraded state

n1-103:~ # k -n longhorn-system get lhv
NAME                                       STATE      ROBUSTNESS   SCHEDULED   SIZE          NODE     AGE
pvc-2e394c2d-573e-4947-8d82-e55c9b207fb4   attached   degraded                 53687091200   n1-103   34h
pvc-32e5ccc1-57bb-4683-967f-83db4f387b0a   attached   degraded                 10737418240   n1-103   31h
pvc-61e07848-2a2e-4860-9198-b03dd92c9c7b   attached   degraded                 10485760      n3-103   34h
pvc-7f0922a9-fead-4762-9b41-3c9f2354f30a   attached   degraded                 5368709120    n3-103   30h
pvc-9f0115e3-4baa-443f-a2d3-d837a6e03add   detached   unknown                  21474836480            31h
pvc-a5b34e7a-d127-46e0-925e-a0ae96c66afe   detached   unknown                  21474836480            31h

All the attached longhorn volumes have a ReplicaSchedulingFailure condition

...
    Last Transition Time:  2022-11-16T08:22:32Z
    Message:               replica scheduling failed
    Reason:                ReplicaSchedulingFailure
    Status:                False
    Type:                  scheduled
...

There's no instance-manager-r pods running on n2-103 and the corresponding instancemanager CRs are in error state

n1-103:~ # k -n longhorn-system get po -o wide | grep instance-manager
instance-manager-e-2d7f17cb                    1/1     Running   0             30h   10.52.2.35        n3-103   <none>           <none>
instance-manager-e-bf28fb23                    1/1     Running   0             29h   10.52.0.139       n1-103   <none>           <none>
instance-manager-r-11da068c                    1/1     Running   0             30h   10.52.2.33        n3-103   <none>           <none>
instance-manager-r-266475ff                    1/1     Running   0             29h   10.52.0.140       n1-103   <none>           <none>

n1-103:~ # k -n longhorn-system get instancemanagers
NAME                          STATE     TYPE      NODE     AGE
instance-manager-e-2d7f17cb   running   engine    n3-103   30h
instance-manager-e-bf28fb23   running   engine    n1-103   30h
instance-manager-e-e93877c4   error     engine    n2-103   30h
instance-manager-r-11da068c   running   replica   n3-103   30h
instance-manager-r-266475ff   running   replica   n1-103   30h
instance-manager-r-da2d04f7   error     replica   n2-103   30h

So there seems to be a deadlock: the pre-drain pod waits for the volumes to become healthy. Those volumes can only become healthy once the corresponding replicas can be spawned on n2-103, which is already cordoned ...

The volume checking in the pre-draining is to assure data availability and minimize the risk of losing any data. It assumes all the longhorn replicas are running well on a certain number of nodes (3 by default) before a node gets drained. The crux is why the original instance-manager-r pod on n2-103 is gone while pre-draining. The pre-draining only migrates or shutdown VMs but not containers on the node.

Some possibilities:

The resource of the node is insufficient so pods are evicted

Somehow the pod is crashed

longhorn-manager time="2022-11-16T10:30:01Z" level=warning msg="panic during collecting metrics" collector=instance_manager error="runtime error: invalid memory address or nil pointer dereference" node=n2-103

Rancher does not respect the upgrade strategy we have set and drains the node directly while it's still pre-draining

We found some indirect clues which may imply n2-103 is already drained before or while pre-draining:

The number of running pods on n2-103 is lower than the other nodes. And the remaining pods are all static pods or controlled by daemonsets.

rancher ever wanted to drain n1-103 while it was still running a pre-drain pod (logs are from kube-apiserver)

I1116 07:46:15.390229       1 trace.go:205] Trace[943167108]: "Create" url:/api/v1/namespaces/harvester-system/pods/hvst-upgrade-8kpdd-pre-drain-n1-103
-dpjfz/eviction,user-agent:rancher/v0.0.0 (linux/amd64) kubernetes/$Format,audit-id:554c9fe9-79cb-4ba4-a2fe-1d4fc21439ec,client:192.168.101.186,accept:
application/json, */*,protocol:HTTP/2.0 (16-Nov-2022 07:46:14.659) (total time: 730ms):

But we're still not sure it is the root cause because the key evidence (all the logs of harvester/rancher/kube-apiserver pods) is gone during the upgrade.

cc @bk201

To make the upgrade successful, there are 3 ways to work around this:

Lower the number of replicas of each attached volume from 3 to 2 by editing YAMLs, for example:

n1-103:~ # k -n longhorn-system edit lhv pvc-2e394c2d-573e-4947-8d82-e55c9b207fb4
...
spec:
...
numberOfReplicas: 2
...

The volume is now in a healthy state.

n2-103:~ # k -n longhorn-system get lhv
NAME                                       STATE      ROBUSTNESS   SCHEDULED   SIZE          NODE     AGE
pvc-2e394c2d-573e-4947-8d82-e55c9b207fb4   attached   healthy                  53687091200   n1-103   47h
pvc-32e5ccc1-57bb-4683-967f-83db4f387b0a   attached   degraded                 10737418240   n1-103   43h
pvc-61e07848-2a2e-4860-9198-b03dd92c9c7b   attached   degraded                 10485760      n3-103   47h
pvc-7f0922a9-fead-4762-9b41-3c9f2354f30a   attached   degraded                 5368709120    n3-103   42h
pvc-9f0115e3-4baa-443f-a2d3-d837a6e03add   detached   unknown                  21474836480            44h
pvc-a5b34e7a-d127-46e0-925e-a0ae96c66afe   detached   unknown                  21474836480            44h

Create corresponding files to skip the volume-checking mechanism in the pre-drain pod

n1-103:~ # k -n harvester-system exec -it hvst-upgrade-8kpdd-pre-drain-n2-103-q8jch -- bash
n2-103:/ # touch /tmp/skip-pvc-32e5ccc1-57bb-4683-967f-83db4f387b0a

Note that the volume is still in a degraded state but the volume-checking on the second volume is passed.

n1-103:~ # k -n longhorn-system get lhv
NAME                                       STATE      ROBUSTNESS   SCHEDULED   SIZE          NODE     AGE
pvc-2e394c2d-573e-4947-8d82-e55c9b207fb4   attached   healthy                  53687091200   n1-103   47h
pvc-32e5ccc1-57bb-4683-967f-83db4f387b0a   attached   degraded                 10737418240   n1-103   43h
pvc-61e07848-2a2e-4860-9198-b03dd92c9c7b   attached   degraded                 10485760      n3-103   47h
pvc-7f0922a9-fead-4762-9b41-3c9f2354f30a   attached   degraded                 5368709120    n3-103   42h
pvc-9f0115e3-4baa-443f-a2d3-d837a6e03add   detached   unknown                  21474836480            44h
pvc-a5b34e7a-d127-46e0-925e-a0ae96c66afe   detached   unknown                  21474836480            44h

Uncordon the node directly

n1-103:~ # k uncordon n2-103
node/n2-103 uncordoned

The instance-manager-e/r pods are able to spawn on n2-103.

n1-103:~ # k get no
NAME     STATUS   ROLES                       AGE   VERSION
n1-103   Ready    control-plane,etcd,master   47h   v1.24.7+rke2r1
n2-103   Ready    control-plane,etcd,master   47h   v1.22.12+rke2r1
n3-103   Ready    control-plane,etcd,master   46h   v1.22.12+rke2r1
n1-103:~ # k -n longhorn-system get po -l longhorn.io/component=instance-manager
NAME                          READY   STATUS              RESTARTS   AGE
instance-manager-e-2d7f17cb   1/1     Running             0          43h
instance-manager-e-bf28fb23   1/1     Running             0          42h
instance-manager-e-e93877c4   0/1     ContainerCreating   0          5s
instance-manager-r-11da068c   1/1     Running             0          43h
instance-manager-r-266475ff   1/1     Running             0          42h
instance-manager-r-da2d04f7   0/1     ContainerCreating   0          5s
n1-103:~ # k -n longhorn-system get lhim
NAME                          STATE     TYPE      NODE     AGE
instance-manager-e-2d7f17cb   running   engine    n3-103   43h
instance-manager-e-bf28fb23   running   engine    n1-103   43h
instance-manager-e-e93877c4   running   engine    n2-103   43h
instance-manager-r-11da068c   running   replica   n3-103   43h
instance-manager-r-266475ff   running   replica   n1-103   43h
instance-manager-r-da2d04f7   running   replica   n2-103   43h

The remaining degraded volumes are now in a healthy state.

n1-103:~ # k -n longhorn-system get lhv
NAME                                       STATE      ROBUSTNESS   SCHEDULED   SIZE          NODE     AGE
pvc-2e394c2d-573e-4947-8d82-e55c9b207fb4   attached   healthy                  53687091200   n1-103   47h
pvc-32e5ccc1-57bb-4683-967f-83db4f387b0a   attached   degraded                 10737418240   n1-103   44h
pvc-61e07848-2a2e-4860-9198-b03dd92c9c7b   attached   healthy                  10485760      n3-103   47h
pvc-7f0922a9-fead-4762-9b41-3c9f2354f30a   attached   healthy                  5368709120    n3-103   43h
pvc-9f0115e3-4baa-443f-a2d3-d837a6e03add   detached   unknown                  21474836480            44h
pvc-a5b34e7a-d127-46e0-925e-a0ae96c66afe   detached   unknown                  21474836480            44h

Encountering this bug while upgrading from v1.1.2 to v1.2.0-rc3

supportbundle_8365a58f-9f55-4480-a460-b46525cc2098_2023-07-12T12-47-59Z.zip

Encounter this issue while upgrading from v1.2.1 to v1.2.2-rc2 with RKE2 guest cluster created

Upgrade also stuck in pre-draining of the first node
Checking the log of hvst-upgrade-65kts-pre-drain-node1-8szcx, it keep looping in
```
Waiting for volume pvc-e248fddb-140c-40a1-a697-3b4982320030 to be healthy...
```
Node1 is in Cordoned state

Attached upgrade log and support bundle for more information

Upgrade log hvst-upgrade-65kts-upgradelog-archive-2024-05-03T14-19-17Z.zip

Support bundle supportbundle_04242914-6540-4e2d-8971-a2a5ddd0eb95_2024-05-06T03-00-12Z.zip

I checked the cluster earlier. It's a two-node cluster and volumes are configured 3 replicas. In this case, the user will need to low down the replica count or create volumes with a storage class with only 2-replica.

Encounter this issue while upgrading from v1.2.1 to v1.2.2-rc2 with RKE2 guest cluster created

By lowering down the replica count to reflect the number of nodes. (here we set to 2) We can upgrade from v1.2.1 to v1.2.2-rc2 on the 2 nodes baremetal machines.

Close this issue due to the environment configuration.

harvester / harvester

[BUG] Upgrade from v1.0.3 to v1.1.1-rc1, Harvester cluster stuck in pre-draining the second node #3164