longhorn / longhorn

Cloud-Native distributed storage built on and for Kubernetes
https://longhorn.io
Apache License 2.0
6.17k stars 604 forks source link

[BUG] Test case test_node_eviction_multiple_volume failed to reschedule replicas after volume detached #9857

Open yangchiu opened 2 days ago

yangchiu commented 2 days ago

Describe the bug

Test case test_node_eviction_multiple_volume failed to reschedule replicas after volume detached:

https://ci.longhorn.io/job/public/job/master/job/sles/job/amd64/job/longhorn-tests-sles-amd64/1104/testReport/junit/tests/test_node/test_node_eviction_multiple_volume/

To Reproduce

  1. Disable scheduling on node 1.
  2. Create pv, pvc, pod with volume 1 of 2 replicas.
  3. Set 'Eviction Requested' to 'true' and disable scheduling on node 2.
  4. Set 'Eviction Requested' to 'false' and enable scheduling on node 1.
  5. Check volume 'healthy' and wait for replicas running on node 1 and 3.
  6. delete pods to detach volume 1.
  7. Set 'Eviction Requested' to 'false' and enable scheduling on node 2.
  8. Set 'Eviction Requested' to 'true' and disable scheduling on node 1.
  9. Wait for replicas running on node 2 and 3.

In v1.7.2, the detached volume will automatically re-attach in step 9 to reschedule replicas from node 1 to node 2.

But in master-head, the re-attachment and rescheduling never happen.

Expected behavior

Support bundle for troubleshooting

Environment

Additional context

Workaround and Mitigation

derekbit commented 2 days ago

@mantissahz Please help investigate the issue. Thank you.

yangchiu commented 2 days ago

Could this be related to https://github.com/longhorn/longhorn/issues/9781?

c3y1huang commented 1 day ago

Could this be related to #9781?

Yes, it seems to be a regression failure caused by it. I will handle this at https://github.com/longhorn/longhorn/issues/9781.

cc @derekbit @mantissahz

longhorn-io-github-bot commented 1 day ago

Pre Ready-For-Testing Checklist

innobead commented 2 hours ago

Could this be related to #9781?

Yes, it seems to be a regression failure caused by it. I will handle this at #9781.

cc @derekbit @mantissahz

so this is not a regression in the existing versions but caused by the recent fix for #9781 ?

c3y1huang commented 2 hours ago

so this is not a regression in the existing versions but caused by the recent fix for #9781 ?

Yes, this is caused by a recently merged PR. https://github.com/longhorn/longhorn-manager/pull/3270