longhorn / longhorn

Cloud-Native distributed storage built on and for Kubernetes
https://longhorn.io
Apache License 2.0
5.97k stars 589 forks source link

[BUG] degraded v2 volume doesn't create new replica even though there is an available disk #9197

Open yangchiu opened 1 month ago

yangchiu commented 1 month ago

Describe the bug

Trying to reproduce https://github.com/longhorn/longhorn/issues/9166 using longhorn-tests https://github.com/longhorn/longhorn-tests/pull/2048. After several replica deletion and rebuilding, the v2 volume stops creating new replicas and performing replica rebuilding. It remains with only one replica, even though there is an available disk:

https://ci.longhorn.io/job/private/job/longhorn-e2e-test/1019/

http://3.220.120.187:30000

The disk on node ip-10-0-1-158 can be used:

one3

But the volume doesn't create a new replica on ip-10-0-1-158:

one1

# kubectl get volumes -n longhorn-system
NAME                                       DATA ENGINE   STATE      ROBUSTNESS   SCHEDULED   SIZE         NODE            AGE
pvc-49b8a217-bbe1-428b-8aff-858164a57c2b   v2            attached   degraded                 3221225472   ip-10-0-1-160   63m
# kubectl get replicas -n longhorn-system
NAME                                                  DATA ENGINE   STATE     NODE            DISK                                   INSTANCEMANAGER                                     IMAGE                                             AGE
pvc-49b8a217-bbe1-428b-8aff-858164a57c2b-r-8592f003   v2            running   ip-10-0-1-160   8434ac92-811f-4878-857f-ec8d2141d237   instance-manager-eba99a978580605061398476e501d534   longhornio/longhorn-instance-manager:v1.7.0-rc3   62m

To Reproduce

Run test case Degraded Volume Replica Rebuilding repeatedly.

Expected behavior

Support bundle for troubleshooting

supportbundle_86dadfe6-2d9d-4da2-bf3f-608b1961d2b1_2024-08-06T08-04-59Z.zip

Environment

Additional context

derekbit commented 1 month ago

@c3y1huang It seems related to replica scheduling. Could you help check if there is any clue? Thank you.

c3y1huang commented 1 month ago

From the log, Longhorn somehow thinks there is no available disk candidate.

2024-08-06T07:00:48.404463077Z time="2024-08-06T07:00:48Z" level=warning msg="Unable to create new replica pvc-49b8a217-bbe1-428b-8aff-858164a57c2b-r-171d1f78" func="controller.(*VolumeController).replenishReplicas" file="volume_controller.go:2305" accessMode=rwo controller=longhorn-volume error="No available disk candidates to create a new replica of size 3221225472" frontend=blockdev migratable=false node=ip-10-0-1-160 owner=ip-10-0-1-160 state=attached volume=pvc-49b8a217-bbe1-428b-8aff-858164a57c2b

However, after about 7, 8 hours, Longhorn seem to find a disk candidate.

Deleting volume pvc-49b8a217-bbe1-428b-8aff-858164a57c2b's replica on node ip-10-0-1-158 failed with error: list index out of range ... (28189)
Deleting replica pvc-49b8a217-bbe1-428b-8aff-858164a57c2b-r-cb4a0c6b
Waiting for volume pvc-49b8a217-bbe1-428b-8aff-858164a57c2b's replica on node ip-10-0-1-158 rebuilding completed
Completed volume pvc-49b8a217-bbe1-428b-8aff-858164a57c2b replica rebuilding on ip-10-0-1-158

Need more investigation.