Open yangchiu opened 1 month ago
@c3y1huang It seems related to replica scheduling. Could you help check if there is any clue? Thank you.
From the log, Longhorn somehow thinks there is no available disk candidate.
2024-08-06T07:00:48.404463077Z time="2024-08-06T07:00:48Z" level=warning msg="Unable to create new replica pvc-49b8a217-bbe1-428b-8aff-858164a57c2b-r-171d1f78" func="controller.(*VolumeController).replenishReplicas" file="volume_controller.go:2305" accessMode=rwo controller=longhorn-volume error="No available disk candidates to create a new replica of size 3221225472" frontend=blockdev migratable=false node=ip-10-0-1-160 owner=ip-10-0-1-160 state=attached volume=pvc-49b8a217-bbe1-428b-8aff-858164a57c2b
However, after about 7, 8 hours, Longhorn seem to find a disk candidate.
Deleting volume pvc-49b8a217-bbe1-428b-8aff-858164a57c2b's replica on node ip-10-0-1-158 failed with error: list index out of range ... (28189)
Deleting replica pvc-49b8a217-bbe1-428b-8aff-858164a57c2b-r-cb4a0c6b
Waiting for volume pvc-49b8a217-bbe1-428b-8aff-858164a57c2b's replica on node ip-10-0-1-158 rebuilding completed
Completed volume pvc-49b8a217-bbe1-428b-8aff-858164a57c2b replica rebuilding on ip-10-0-1-158
Need more investigation.
Describe the bug
Trying to reproduce https://github.com/longhorn/longhorn/issues/9166 using longhorn-tests https://github.com/longhorn/longhorn-tests/pull/2048. After several replica deletion and rebuilding, the v2 volume stops creating new replicas and performing replica rebuilding. It remains with only one replica, even though there is an available disk:
https://ci.longhorn.io/job/private/job/longhorn-e2e-test/1019/
http://3.220.120.187:30000
The disk on node
ip-10-0-1-158
can be used:But the volume doesn't create a new replica on
ip-10-0-1-158
:To Reproduce
Run test case
Degraded Volume Replica Rebuilding
repeatedly.Expected behavior
Support bundle for troubleshooting
supportbundle_86dadfe6-2d9d-4da2-bf3f-608b1961d2b1_2024-08-06T08-04-59Z.zip
Environment
Additional context