Closed TachunLin closed 5 months ago
Here are some findings from the crime scene:
n1-103
is upgraded, n2-103
is stuck in the pre-draining, and n3-103
is still in the image-preloaded state...
Node Statuses:
n1-103:
State: Succeeded
n2-103:
State: Pre-draining
n3-103:
State: Images preloaded
...
n1-103:~ # k -n harvester-system get jobs
NAME COMPLETIONS DURATION AGE
default-vlan1 1/1 9s 31h
harvester-promote-n2-103 1/1 100s 34h
harvester-promote-n3-103 1/1 96s 34h
hvst-upgrade-8kpdd-apply-manifests 1/1 13m 30h
hvst-upgrade-8kpdd-post-drain-n1-103 1/1 5m11s 30h
hvst-upgrade-8kpdd-pre-drain-n1-103 1/1 17s 30h
hvst-upgrade-8kpdd-pre-drain-n2-103 0/1 29h 29h
n2-103
is cordoned by ranchern1-103:~ # k get no
NAME STATUS ROLES AGE VERSION
n1-103 Ready control-plane,etcd,master 34h v1.24.7+rke2r1
n2-103 Ready,SchedulingDisabled control-plane,etcd,master 34h v1.22.12+rke2r1
n3-103 Ready control-plane,etcd,master 34h v1.22.12+rke2r1
+ '[' true ']'
+ '[' 3 -gt 2 ']'
++ kubectl get volumes.longhorn.io/pvc-2e394c2d-573e-4947-8d82-e55c9b207fb4 -n longhorn-system -o 'jsonpath={.status.robustness}'
+ robustness=degraded
+ '[' degraded = healthy ']'
+ '[' -f /tmp/skip-pvc-2e394c2d-573e-4947-8d82-e55c9b207fb4 ']'
Waiting for volume pvc-2e394c2d-573e-4947-8d82-e55c9b207fb4 to be healthy...
+ echo 'Waiting for volume pvc-2e394c2d-573e-4947-8d82-e55c9b207fb4 to be healthy...'
+ sleep 10
n1-103:~ # k -n longhorn-system get lhv
NAME STATE ROBUSTNESS SCHEDULED SIZE NODE AGE
pvc-2e394c2d-573e-4947-8d82-e55c9b207fb4 attached degraded 53687091200 n1-103 34h
pvc-32e5ccc1-57bb-4683-967f-83db4f387b0a attached degraded 10737418240 n1-103 31h
pvc-61e07848-2a2e-4860-9198-b03dd92c9c7b attached degraded 10485760 n3-103 34h
pvc-7f0922a9-fead-4762-9b41-3c9f2354f30a attached degraded 5368709120 n3-103 30h
pvc-9f0115e3-4baa-443f-a2d3-d837a6e03add detached unknown 21474836480 31h
pvc-a5b34e7a-d127-46e0-925e-a0ae96c66afe detached unknown 21474836480 31h
ReplicaSchedulingFailure
condition...
Last Transition Time: 2022-11-16T08:22:32Z
Message: replica scheduling failed
Reason: ReplicaSchedulingFailure
Status: False
Type: scheduled
...
n2-103
and the corresponding instancemanager CRs are in error
staten1-103:~ # k -n longhorn-system get po -o wide | grep instance-manager
instance-manager-e-2d7f17cb 1/1 Running 0 30h 10.52.2.35 n3-103 <none> <none>
instance-manager-e-bf28fb23 1/1 Running 0 29h 10.52.0.139 n1-103 <none> <none>
instance-manager-r-11da068c 1/1 Running 0 30h 10.52.2.33 n3-103 <none> <none>
instance-manager-r-266475ff 1/1 Running 0 29h 10.52.0.140 n1-103 <none> <none>
n1-103:~ # k -n longhorn-system get instancemanagers
NAME STATE TYPE NODE AGE
instance-manager-e-2d7f17cb running engine n3-103 30h
instance-manager-e-bf28fb23 running engine n1-103 30h
instance-manager-e-e93877c4 error engine n2-103 30h
instance-manager-r-11da068c running replica n3-103 30h
instance-manager-r-266475ff running replica n1-103 30h
instance-manager-r-da2d04f7 error replica n2-103 30h
So there seems to be a deadlock: the pre-drain pod waits for the volumes to become healthy. Those volumes can only become healthy once the corresponding replicas can be spawned on n2-103
, which is already cordoned ...
The volume checking in the pre-draining is to assure data availability and minimize the risk of losing any data. It assumes all the longhorn replicas are running well on a certain number of nodes (3 by default) before a node gets drained. The crux is why the original instance-manager-r pod on n2-103
is gone while pre-draining. The pre-draining only migrates or shutdown VMs but not containers on the node.
Some possibilities:
longhorn-manager time="2022-11-16T10:30:01Z" level=warning msg="panic during collecting metrics" collector=instance_manager error="runtime error: invalid memory address or nil pointer dereference" node=n2-103
We found some indirect clues which may imply n2-103
is already drained before or while pre-draining:
n2-103
is lower than the other nodes. And the remaining pods are all static pods or controlled by daemonsets.n1-103
while it was still running a pre-drain pod (logs are from kube-apiserver)I1116 07:46:15.390229 1 trace.go:205] Trace[943167108]: "Create" url:/api/v1/namespaces/harvester-system/pods/hvst-upgrade-8kpdd-pre-drain-n1-103
-dpjfz/eviction,user-agent:rancher/v0.0.0 (linux/amd64) kubernetes/$Format,audit-id:554c9fe9-79cb-4ba4-a2fe-1d4fc21439ec,client:192.168.101.186,accept:
application/json, */*,protocol:HTTP/2.0 (16-Nov-2022 07:46:14.659) (total time: 730ms):
But we're still not sure it is the root cause because the key evidence (all the logs of harvester/rancher/kube-apiserver pods) is gone during the upgrade.
cc @bk201
To make the upgrade successful, there are 3 ways to work around this:
n1-103:~ # k -n longhorn-system edit lhv pvc-2e394c2d-573e-4947-8d82-e55c9b207fb4
...
spec:
...
numberOfReplicas: 2
...
The volume is now in a healthy state.
n2-103:~ # k -n longhorn-system get lhv
NAME STATE ROBUSTNESS SCHEDULED SIZE NODE AGE
pvc-2e394c2d-573e-4947-8d82-e55c9b207fb4 attached healthy 53687091200 n1-103 47h
pvc-32e5ccc1-57bb-4683-967f-83db4f387b0a attached degraded 10737418240 n1-103 43h
pvc-61e07848-2a2e-4860-9198-b03dd92c9c7b attached degraded 10485760 n3-103 47h
pvc-7f0922a9-fead-4762-9b41-3c9f2354f30a attached degraded 5368709120 n3-103 42h
pvc-9f0115e3-4baa-443f-a2d3-d837a6e03add detached unknown 21474836480 44h
pvc-a5b34e7a-d127-46e0-925e-a0ae96c66afe detached unknown 21474836480 44h
n1-103:~ # k -n harvester-system exec -it hvst-upgrade-8kpdd-pre-drain-n2-103-q8jch -- bash
n2-103:/ # touch /tmp/skip-pvc-32e5ccc1-57bb-4683-967f-83db4f387b0a
Note that the volume is still in a degraded state but the volume-checking on the second volume is passed.
n1-103:~ # k -n longhorn-system get lhv
NAME STATE ROBUSTNESS SCHEDULED SIZE NODE AGE
pvc-2e394c2d-573e-4947-8d82-e55c9b207fb4 attached healthy 53687091200 n1-103 47h
pvc-32e5ccc1-57bb-4683-967f-83db4f387b0a attached degraded 10737418240 n1-103 43h
pvc-61e07848-2a2e-4860-9198-b03dd92c9c7b attached degraded 10485760 n3-103 47h
pvc-7f0922a9-fead-4762-9b41-3c9f2354f30a attached degraded 5368709120 n3-103 42h
pvc-9f0115e3-4baa-443f-a2d3-d837a6e03add detached unknown 21474836480 44h
pvc-a5b34e7a-d127-46e0-925e-a0ae96c66afe detached unknown 21474836480 44h
n1-103:~ # k uncordon n2-103
node/n2-103 uncordoned
The instance-manager-e/r pods are able to spawn on n2-103
.
n1-103:~ # k get no
NAME STATUS ROLES AGE VERSION
n1-103 Ready control-plane,etcd,master 47h v1.24.7+rke2r1
n2-103 Ready control-plane,etcd,master 47h v1.22.12+rke2r1
n3-103 Ready control-plane,etcd,master 46h v1.22.12+rke2r1
n1-103:~ # k -n longhorn-system get po -l longhorn.io/component=instance-manager
NAME READY STATUS RESTARTS AGE
instance-manager-e-2d7f17cb 1/1 Running 0 43h
instance-manager-e-bf28fb23 1/1 Running 0 42h
instance-manager-e-e93877c4 0/1 ContainerCreating 0 5s
instance-manager-r-11da068c 1/1 Running 0 43h
instance-manager-r-266475ff 1/1 Running 0 42h
instance-manager-r-da2d04f7 0/1 ContainerCreating 0 5s
n1-103:~ # k -n longhorn-system get lhim
NAME STATE TYPE NODE AGE
instance-manager-e-2d7f17cb running engine n3-103 43h
instance-manager-e-bf28fb23 running engine n1-103 43h
instance-manager-e-e93877c4 running engine n2-103 43h
instance-manager-r-11da068c running replica n3-103 43h
instance-manager-r-266475ff running replica n1-103 43h
instance-manager-r-da2d04f7 running replica n2-103 43h
The remaining degraded volumes are now in a healthy state.
n1-103:~ # k -n longhorn-system get lhv
NAME STATE ROBUSTNESS SCHEDULED SIZE NODE AGE
pvc-2e394c2d-573e-4947-8d82-e55c9b207fb4 attached healthy 53687091200 n1-103 47h
pvc-32e5ccc1-57bb-4683-967f-83db4f387b0a attached degraded 10737418240 n1-103 44h
pvc-61e07848-2a2e-4860-9198-b03dd92c9c7b attached healthy 10485760 n3-103 47h
pvc-7f0922a9-fead-4762-9b41-3c9f2354f30a attached healthy 5368709120 n3-103 43h
pvc-9f0115e3-4baa-443f-a2d3-d837a6e03add detached unknown 21474836480 44h
pvc-a5b34e7a-d127-46e0-925e-a0ae96c66afe detached unknown 21474836480 44h
Encountering this bug while upgrading from v1.1.2
to v1.2.0-rc3
supportbundle_8365a58f-9f55-4480-a460-b46525cc2098_2023-07-12T12-47-59Z.zip
Encounter this issue while upgrading from v1.2.1
to v1.2.2-rc2
with RKE2 guest cluster created
Upgrade also stuck in pre-draining
of the first node
Checking the log of hvst-upgrade-65kts-pre-drain-node1-8szcx
, it keep looping in
Waiting for volume pvc-e248fddb-140c-40a1-a697-3b4982320030 to be healthy...
Node1 is in Cordoned
state
Attached upgrade log and support bundle for more information
Upgrade log hvst-upgrade-65kts-upgradelog-archive-2024-05-03T14-19-17Z.zip
Support bundle supportbundle_04242914-6540-4e2d-8971-a2a5ddd0eb95_2024-05-06T03-00-12Z.zip
I checked the cluster earlier. It's a two-node cluster and volumes are configured 3 replicas. In this case, the user will need to low down the replica count or create volumes with a storage class with only 2-replica.
Encounter this issue while upgrading from
v1.2.1
tov1.2.2-rc2
with RKE2 guest cluster created
By lowering down the replica count to reflect the number of nodes. (here we set to 2) We can upgrade from v1.2.1 to v1.2.2-rc2 on the 2 nodes baremetal machines.
Close this issue due to the environment configuration.
Describe the bug
Build a
v1.0.3
Harvester cluster with fixed IPv4 and dynamic IPv6 IP and upgrade tov1.1.1-rc1
The upgrade process stuck in
pre-draining
the second nodeSome pods in
pending
statusInstance manager got error
To Reproduce Steps to reproduce the behavior:
Open the KVM virtual machine manager
Open the Connection Details -> Virtual Networks
Create a new virtual network
workload
Add the following XML content, add three specific mac address to use fixed IP
Change the bridge name to a new one
Create a VM and use the ipv6 network. (workload)
Launch v1.0.3 Harvester ISO installer to create the first node
Select DHCP node ip and DHCP vip during the installation, it will use the fixed IP
Create another VMs and use the ipv6 network. (workload)
Launch v1.0.3 Harvester ISO installer to join the second and third node
Select DHCP node ip and DHCP vip during the installation, it will use the fixed IP
Enable network in
harvester-mgmt
interface and createvlan 1
networkCreate several images
Create two VMs, one use harvester-mgmt, another use
vlan 1
Backup VM to S3
Shutdown all VMs
Offline upgrade to
v.1.1.1-rc1
release, refer to https://docs.harvesterhci.io/v1.1/upgrade/automaticExpected behavior
Support bundle
supportbundle_5a745ecd-d809-4871-b688-b24aa8fcde96_2022-11-17T07-26-16Z.zip
Environment
v1.1.1-rc1
Additional context Add any other context about the problem here.