openebs / mayastor

Dynamically provision Stateful Persistent Replicated Cluster-wide Fabric Volumes & Filesystems for Kubernetes that is provisioned from an optimized NVME SPDK backend data storage stack.
Apache License 2.0
740 stars 106 forks source link

Replicated volumes failing to degraded state under load #1331

Open kukacz opened 1 year ago

kukacz commented 1 year ago

Describe the bug Three-way replicated volumes occasionally fail into degraded state when parallel load is created from multiple pods directed to multiple mayastor volumes possibly sharing same DiskPool resources.

To Reproduce These failures happen during first run of fio test tool on a newly created volume, mounted to a k8s pod managed by K8s StatefulSet resource. It is the phase when fio is creating its test files, in my case they are four 8GiB test files per volume, created one by one, while running eg. on 6 pods in parallel. Usually, there's at least 1 volume that fails to Degraded during this operation.

Expected behavior Volumes stay online independently on the load put on them.

Screenshots N/A

OS info (please complete the following information):

Additional context Mayastor volumes are formatted using the default ext4 filesystem. There are 3 worker nodes labelled "storage-nodes" and 3 "worker-nodes" in the attached support dump. For this test, they were not tainted though and all served the same role, colocating both workload pods and storage pods equally. Also, don't get confused with the "rook" label used in worker node names, there was no other software storage system than mayastor running in that cluster. A fresh failure operation was logged in the attached support collection. Here's the timing description:

Attached logs with support collection were split and compressed to match github constraints. To rebuild, do the reverse of split -b 25M mayastor-2023-03-13--14-28-12-UTC.tar.gz; gzip xaa; gzip xab. xaa.gz xab.gz

tiagolobocastro commented 1 year ago

@kukacz at least some of the logs don't seem to have newlines, would you be able to double check this? (we might have done the reverse process incorrectly or it might be a bug with the dump itself)

kukacz commented 1 year ago

@kukacz at least some of the logs don't seem to have newlines, would you be able to double check this? (we might have done the reverse process incorrectly or it might be a bug with the dump itself)

@tiagolobocastro Hi again Tiago! Thanks for looking into the issue. Yes, I've already doublechecked that line endings issue with @Abhinandan-Purkait via private Slack chat earlier. Unfortunately, that seemed to be sourced in the log dump procedure, I've experienced it in both kubectl plugin versions of 2.0.0 and 2.0.1 and different client environments each.

tiagolobocastro commented 1 year ago

How bizarre, I don't think I've ever seen this... Anyway I can see faulted replicas which explain the degraded volumes you've mentioned, though it's kinda painful to look at the logs in a single line. Would you be able and willing to try this with the latest release? We now also collect k8s logs which might be useful here if the problem is somehow with our loki logs.

kukacz commented 1 year ago

... Would you be able and willing to try this with the latest release? We now also collect k8s logs which might be useful here if the problem is somehow with our loki logs.

@tiagolobocastro Of course! I've just repeated that test - switched to latest mayastor-2.1.0 helm chart, different public cloud provider, upgraded kubectl-mayastor plugin to the latest one, Ubuntu 22.04.2 LTS on all nodes.

Unfortunately, I've experienced same issue - 2 from 9 healthy three-way replica volumes failed to degraded state while being written to in 4 fio threads. Also, the single log line issue remains, but there are those additional logs collected. Logs attached: mayastor-2023-05-05--09-09-42-UTC.tar.gz

tiagolobocastro commented 1 year ago

@kukacz which volumes did you see going degraded? Do you have dmesg logs for the initiator nodes? Seems we are having issues in the initiators, example storage-node-b47ua:

23-05-05T09:03:17.697200Z INFO Target state transition, sate: Suspected, target: "nqn.2019-05.io.openebs:a26609bd-a329-4cc3-bf9d-b8dbc528d03c", mpath: "/sys/devices/virtual/nvme-fabrics/ctl/nvme4"

Sadly this is triggering an HA bug which is causing replacing of the target to loop, this is fixed and will be released in 2.2, which is the next coming release, currently in testing phase.

kukacz commented 1 year ago

@kukacz which volumes did you see going degraded? Do you have dmesg logs for the initiator nodes? Seems we are having issues in the initiators, example storage-node-b47ua:

hi @tiagolobocastro. The degraded volumes were a26609bd-a329-4cc3-bf9d-b8dbc528d03c and 4bce86b0-fb93-4da8-8a02-b594f5f0e680. Unfortunately, dmesg is gone already. I can get the log by a new deployment and re-running the test.

tiagolobocastro commented 1 year ago

@kukacz yes that's the ones I see trying to be switchover! So the switchover bug I mentioned will be fixed for 2.2 but the main culprit would still be there, something happening on the target causing connectivity issues with both the initiator and the data replica, hmm.

Would you be able to re-test with helm --set agents.ha.enabled=false ? This will prevent us from trying to move the target (avoid the bug I mentioned before) and perhaps together will dmesg will help us pin down the problem. Btw I assume we're not running close to your networks bandwith or anything like that?

tiagolobocastro commented 9 months ago

Any updates from your side @kukacz ? IIRC we discussed this on slack where I could not reproduce the issue at the time. This may already be fixed, we have increase default timeouts (shared infra backend would trigger io timeouts sometimes) and there were a few reactor deadlocks that we fixed.

kukacz commented 9 months ago

@tiagolobocastro Sorry, I haven't tested this for long. Give me few days please to plan my options to redeploy the scenario and verify. Thank you!

tiagolobocastro commented 9 months ago

thank you @kukacz that'd be awesome :+1:

tiagolobocastro commented 2 weeks ago

@kukacz may I ping you again? Did you manage to redeploy? Thanks