[BUG] degraded v2 volume doesn't create new replica even though there is an available disk

yangchiu commented 1 month ago

Describe the bug

Trying to reproduce https://github.com/longhorn/longhorn/issues/9166 using longhorn-tests https://github.com/longhorn/longhorn-tests/pull/2048. After several replica deletion and rebuilding, the v2 volume stops creating new replicas and performing replica rebuilding. It remains with only one replica, even though there is an available disk:

https://ci.longhorn.io/job/private/job/longhorn-e2e-test/1019/

http://3.220.120.187:30000

The disk on node ip-10-0-1-158 can be used:

one3

But the volume doesn't create a new replica on ip-10-0-1-158:

one1

# kubectl get volumes -n longhorn-system
NAME                                       DATA ENGINE   STATE      ROBUSTNESS   SCHEDULED   SIZE         NODE            AGE
pvc-49b8a217-bbe1-428b-8aff-858164a57c2b   v2            attached   degraded                 3221225472   ip-10-0-1-160   63m

# kubectl get replicas -n longhorn-system
NAME                                                  DATA ENGINE   STATE     NODE            DISK                                   INSTANCEMANAGER                                     IMAGE                                             AGE
pvc-49b8a217-bbe1-428b-8aff-858164a57c2b-r-8592f003   v2            running   ip-10-0-1-160   8434ac92-811f-4878-857f-ec8d2141d237   instance-manager-eba99a978580605061398476e501d534   longhornio/longhorn-instance-manager:v1.7.0-rc3   62m

To Reproduce

Run test case Degraded Volume Replica Rebuilding repeatedly.

Expected behavior

Support bundle for troubleshooting

supportbundle_86dadfe6-2d9d-4da2-bf3f-608b1961d2b1_2024-08-06T08-04-59Z.zip

Environment

Longhorn version: v1.7.0-rc3
Impacted volume (PV):
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.30.0+k3s1
- Number of control plane nodes in the cluster:
- Number of worker nodes in the cluster:
Node config
- OS type and version: sles 15-sp6
- Kernel version:
- CPU per node:
- Memory per node:
- Disk type (e.g. SSD/NVMe/HDD):
- Network bandwidth between the nodes (Gbps):
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
Number of Longhorn volumes in the cluster:

Additional context

derekbit commented 1 month ago

@c3y1huang It seems related to replica scheduling. Could you help check if there is any clue? Thank you.

c3y1huang commented 1 month ago

From the log, Longhorn somehow thinks there is no available disk candidate.

2024-08-06T07:00:48.404463077Z time="2024-08-06T07:00:48Z" level=warning msg="Unable to create new replica pvc-49b8a217-bbe1-428b-8aff-858164a57c2b-r-171d1f78" func="controller.(*VolumeController).replenishReplicas" file="volume_controller.go:2305" accessMode=rwo controller=longhorn-volume error="No available disk candidates to create a new replica of size 3221225472" frontend=blockdev migratable=false node=ip-10-0-1-160 owner=ip-10-0-1-160 state=attached volume=pvc-49b8a217-bbe1-428b-8aff-858164a57c2b

However, after about 7, 8 hours, Longhorn seem to find a disk candidate.

Deleting volume pvc-49b8a217-bbe1-428b-8aff-858164a57c2b's replica on node ip-10-0-1-158 failed with error: list index out of range ... (28189)
Deleting replica pvc-49b8a217-bbe1-428b-8aff-858164a57c2b-r-cb4a0c6b
Waiting for volume pvc-49b8a217-bbe1-428b-8aff-858164a57c2b's replica on node ip-10-0-1-158 rebuilding completed
Completed volume pvc-49b8a217-bbe1-428b-8aff-858164a57c2b replica rebuilding on ip-10-0-1-158

Need more investigation.

longhorn / longhorn