openebs / mayastor

Dynamically provision Stateful Persistent Replicated Cluster-wide Fabric Volumes & Filesystems for Kubernetes that is provisioned from an optimized NVME SPDK backend data storage stack.
Apache License 2.0
753 stars 109 forks source link

Reuse Rebuild IO handles #1755

Closed tiagolobocastro closed 1 month ago

tiagolobocastro commented 1 month ago
fix(rebuild): reuse rebuild IO handles

Reuses the rebuild IO handles, rather than attempting to allocate
them per rebuild task.
The main issue with handle allocation on the fly is that the target
may have not cleaned up a previous IO qpair connection, and so the
connect may fail. We started seeing this more on CI because we forgot
to cherry-pick a commit increasing the retry delay.
However, after inspecting a bunch of user support bundles I see that
we still have occasional connect errors. Rather than increasing the
timeout, we attempt here to reuse the handles, thus avoid the
problem almost entirely.

Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com>

refactor(rebuild): rebuild completion is not an error

When the rebuild has been complete, if we wait for it this fails because
the channels are not longer available.
Instead, simply return the rebuild state, since this is what we want anyway.

Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com>
tiagolobocastro commented 1 month ago

lgtm. As I understand we'll now make all 16 rebuild tasks use the same connection for reading/writing the rebuild segments.

Yes, each task has the immutable RebuildDescriptor and reuse the handles from there.

Is the rebuild descriptor destroyed today as soon as rebuild finishes and the nexus channels are reconfigured to add rebuilt bdev's handle? We wouldn't want a lingering handle in rebuild descriptor while the rebuilt child bdev handle has been added to main nexus channels too.

Yes that's right, the rebuild backend "runs away" when the rebuild is started. When the rebuild completes the backend terminates, and on the drop, the handles are also dropped. Eventually we can also improve things here by moving the handle to the tasks themselves, allowing us to drop the handles if the rebuild is paused for example. Also that could be the first step to allowing the rebuild to run on multi-cores too.

For now simply moving the handle to descriptor is good enough for solving the connection issues.

tiagolobocastro commented 1 month ago

bors merge

bors-openebs-mayastor[bot] commented 1 month ago

Build succeeded: