openebs / mayastor

Dynamically provision Stateful Persistent Replicated Cluster-wide Fabric Volumes & Filesystems for Kubernetes that is provisioned from an optimized NVME SPDK backend data storage stack.
Apache License 2.0
724 stars 105 forks source link

1 node down in a 3 node system caused many volumes to be Degraded/Faulted without a Target node #1714

Open veenadong opened 1 month ago

veenadong commented 1 month ago

Mayastor 2.5.1 on a 3 node system, take 1 node down and many volumes are in a rebuilding then faulting, and trying again.

mayastor-io-engine-xs9sk_mayastor_io-engine-dd23276f2aaa3971af97201157038ba42b9e57962f307d92e693116527acd457.log:2024-08-07T18:19:38.950624012-07:00 stdout F [2024-08-08T01:19:38.950553630+00:00 ERROR io_engine::rebuild::rebuild_descriptor:rebuild_descriptor.rs:85] 10.245.250.69:8420/nqn.2019-05.io.openebs:b8d8cb2e-0160-4015-931a-53c3b125447cn1: failed to get I/O handle: failed to get IO channel for 10.245.250.69:8420/nqn.2019-05.io.openebs:b8d8cb2e-0160-4015-931a-53c3b125447cn1 Mayastor pods are running:

NAME                                            READY   STATUS    RESTARTS        AGE
mayastor-agent-core-5859d9898c-hfjrd            2/2     Running   8 (134m ago)    4h7m
mayastor-agent-ha-node-hmnn9                    1/1     Running   8 (136m ago)    39h
mayastor-agent-ha-node-zjm5c                    1/1     Running   0               3h1m
mayastor-api-rest-bb7656dd7-26ckt               1/1     Running   0               177m
mayastor-api-rest-bb7656dd7-857cr               1/1     Running   1 (136m ago)    6h19m
mayastor-api-rest-bb7656dd7-rdl5m               1/1     Running   0               177m
mayastor-csi-controller-56bb85db5f-bmd9k        5/5     Running   5 (136m ago)    4h1m
mayastor-csi-node-gqknk                         2/2     Running   16 (136m ago)   39h
mayastor-csi-node-p8l7n                         2/2     Running   0               3h1m
mayastor-etcd-0                                 1/1     Running   1 (135m ago)    4h
mayastor-etcd-1                                 1/1     Running   3 (136m ago)    39h
mayastor-etcd-2                                 0/1     Pending   0               170m
mayastor-io-engine-tp6cm                        2/2     Running   0               3h1m
mayastor-io-engine-xs9sk                        2/2     Running   8 (48m ago)     39h
mayastor-localpv-provisioner-6fd649f5fb-bcn8g   1/1     Running   2 (136m ago)    171m
mayastor-nats-0                                 3/3     Running   3 (136m ago)    4h
mayastor-nats-1                                 3/3     Running   0               170m
mayastor-nats-2                                 3/3     Running   0               170m
mayastor-operator-diskpool-c6b78cbdb-bqrqx      1/1     Running   0               171m

IO-engine pod is logging:

[2024-08-08T01:16:25.047340423+00:00 ERROR io_engine::bdev::nvmx::handle:handle.rs:388] I/O completed with PI error
[2024-08-08T01:16:25.047343437+00:00 ERROR io_engine::bdev::nvmx::handle:handle.rs:388] I/O completed with PI error

[2024-08-08T01:16:25.585879277+00:00 ERROR mayastor::spdk:nvme_fabric.c:596] Connect command failed, rc -125, trtype:TCP adrfam:IPv4 traddr:10.245.250.71 trsvcid:8420 subnqn:nqn.2019-05.io.openebs:16b7a8d2-e89e-4e31-9e8f-571a3726a986
[2024-08-08T01:16:25.585921213+00:00 ERROR mayastor::spdk:nvme_tcp.c:2017] Failed to poll NVMe-oF Fabric CONNECT command
[2024-08-08T01:16:25.585926825+00:00 ERROR mayastor::spdk:nvme_tcp.c:1806] Failed to connect tqpair=0x5582015e9850
[2024-08-08T01:16:25.585970057+00:00 ERROR mayastor::spdk:nvme_qpair.c:797] CQ transport error -6 (No such device or address) on qpair id 3
[2024-08-08T01:16:25.585976264+00:00 ERROR io_engine::bdev::nvmx::qpair:qpair.rs:473] I/O qpair async connection polling error: UnknownErrno: Unknown errno self=QPair { qpair: 0x5582015e9850, ctrlr: 0x558201617f80, ctrlr_name: "10.245.250.71:8420/nqn.2019-05.io.openebs:16b7a8d2-e89e-4e31-9e8f-571a3726a986n1", state: Connecting, waiters: 15 }
[2024-08-08T01:16:25.585990227+00:00 ERROR io_engine::bdev::nvmx::qpair:qpair.rs:483] I/O qpair async connection failed self=QPair { qpair: 0x5582015e9850, ctrlr: 0x558201617f80, ctrlr_name: "10.245.250.71:8420/nqn.2019-05.io.openebs:16b7a8d2-e89e-4e31-9e8f-571a3726a986n1", state: Connecting, waiters: 15 } err=OpenBdev { source: UnknownErrno }
[2024-08-08T01:16:25.586064498+00:00 ERROR io_engine::rebuild::rebuild_descriptor:rebuild_descriptor.rs:85] 10.245.250.71:8420/nqn.2019-05.io.openebs:16b7a8d2-e89e-4e31-9e8f-571a3726a986n1: failed to get I/O handle: failed to open bdev
[2024-08-08T01:16:25.586085241+00:00 ERROR io_engine::rebuild::rebuild_descriptor:rebuild_descriptor.rs:85] 10.245.250.71:8420/nqn.2019-05.io.openebs:16b7a8d2-e89e-4e31-9e8f-571a3726a986n1: failed to get I/O handle: failed to open bdev

[2024-08-08T01:16:25.586346683+00:00 ERROR io_engine::rebuild::rebuild_job_backend:rebuild_job_backend.rs:368] Rebuild job #705 (running) 'nvmf://10.245.250.71:8420/nqn.2019-05.io.openebs:16b7a8d2-e89e-4e31-9e8f-571a3726a986?uuid=16b7a8d2-e89e-4e31-9e8f-571a3726a986' -> 'bdev:///3d646500-3214-4b3b-9849-139b3648e56e?uuid=3d646500-3214-4b3b-9849-139b3648e56e' on nexus '956e0c1b-59da-46ea-aa74-4f04bc4e75f6': failed to rebuild segment id=14 block=8969216 with error: Failed to get a handle for bdev 10.245.250.71:8420/nqn.2019-05.io.openebs:16b7a8d2-e89e-4e31-9e8f-571a3726a986n1
[2024-08-08T01:16:25.586468900+00:00  INFO io_engine::rebuild::rebuild_job_backend:rebuild_job_backend.rs:426] Rebuild job #705 (failed: done) 'nvmf://10.245.250.71:8420/nqn.2019-05.io.openebs:16b7a8d2-e89e-4e31-9e8f-571a3726a986?uuid=16b7a8d2-e89e-4e31-9e8f-571a3726a986' -> 'bdev:///3d646500-3214-4b3b-9849-139b3648e56e?uuid=3d646500-3214-4b3b-9849-139b3648e56e' on nexus '956e0c1b-59da-46ea-aa74-4f04bc4e75f6': changing state from Running to Failed; current stats: RebuildStats { blocks_total: 629135326, blocks_recovered: 8958976, blocks_transferred: 333824, blocks_remaining: 620176350, progress: 1, blocks_per_task: 128, block_size: 512, tasks_total: 16, tasks_active: 0, start_time: 2024-08-08T01:15:17.756714484Z, is_partial: false }
[2024-08-08T01:16:25.586486051+00:00  INFO io_engine::rebuild::rebuild_job_backend:rebuild_job_backend.rs:606] Rebuild job #705 (failed: done) 'nvmf://10.245.250.71:8420/nqn.2019-05.io.openebs:16b7a8d2-e89e-4e31-9e8f-571a3726a986?uuid=16b7a8d2-e89e-4e31-9e8f-571a3726a986' -> 'bdev:///3d646500-3214-4b3b-9849-139b3648e56e?uuid=3d646500-3214-4b3b-9849-139b3648e56e' on nexus '956e0c1b-59da-46ea-aa74-4f04bc4e75f6': backend dropped; final stats: RebuildStats { blocks_total: 629135326, blocks_recovered: 8958976, blocks_transferred: 333824, blocks_remaining: 620176350, progress: 1, blocks_per_task: 128, block_size: 512, tasks_total: 16, tasks_active: 0, start_time: 2024-08-08T01:15:17.756714484Z, is_partial: false }
[2024-08-08T01:16:25.586728047+00:00 ERROR io_engine::bdev::nexus::nexus_bdev_rebuild:nexus_bdev_rebuild.rs:472] Child 'bdev:///3d646500-3214-4b3b-9849-139b3648e56e?uuid=3d646500-3214-4b3b-9849-139b3648e56e @ 956e0c1b-59da-46ea-aa74-4f04bc4e75f6' [open out-of-sync; rebuilding]: rebuild job failed with error: Failed to get a handle for bdev 10.245.250.71:8420/nqn.2019-05.io.openebs:16b7a8d2-e89e-4e31-9e8f-571a3726a986n1: failed to open bdev: UnknownErrno: Unknown errno
[2024-08-08T01:16:25.591830223+00:00  INFO io_engine::bdev::nexus::nexus_child:nexus_child.rs:1049] Child 'bdev:///3d646500-3214-4b3b-9849-139b3648e56e?uuid=3d646500-3214-4b3b-9849-139b3648e56e @ 956e0c1b-59da-46ea-aa74-4f04bc4e75f6' [open out-of-sync; rebuilding]: closing child...
[2024-08-08T01:16:25.591843753+00:00 ERROR io_engine::core::descriptor:descriptor.rs:53] Bdev '3d646500-3214-4b3b-9849-139b3648e56e' is not claimed by this module 'NEXUS_CAS_MODULE'
[2024-08-08T01:16:25.591848712+00:00  INFO io_engine::bdev::nexus::nexus_child:nexus_child.rs:1072] Child 'bdev:///3d646500-3214-4b3b-9849-139b3648e56e?uuid=3d646500-3214-4b3b-9849-139b3648e56e @ 956e0c1b-59da-46ea-aa74-4f04bc4e75f6' [open (destroying) out-of-sync; rebuilding]: destroying block device...
[2024-08-08T01:16:25.591874931+00:00  INFO io_engine::bdev::nexus::nexus_child:nexus_child.rs:1075] Child 'bdev:///3d646500-3214-4b3b-9849-139b3648e56e?uuid=3d646500-3214-4b3b-9849-139b3648e56e @ 956e0c1b-59da-46ea-aa74-4f04bc4e75f6' [open (destroying) out-of-sync; rebuilding]: block device destroyed, waiting for removal...
[2024-08-08T01:16:25.591958581+00:00  INFO io_engine::bdev::nexus::nexus_bdev_children:nexus_bdev_children.rs:899] Unplugging nexus child device nexus_name="956e0c1b-59da-46ea-aa74-4f04bc4e75f6" child_device="3d646500-3214-4b3b-9849-139b3648e56e"
[2024-08-08T01:16:25.591966411+00:00  INFO io_engine::bdev::nexus::nexus_child:nexus_child.rs:1103] Child 'bdev:///3d646500-3214-4b3b-9849-139b3648e56e?uuid=3d646500-3214-4b3b-9849-139b3648e56e @ 956e0c1b-59da-46ea-aa74-4f04bc4e75f6' [open (destroying) out-of-sync; rebuilding]: unplugging child...
[2024-08-08T01:16:25.591974582+00:00  INFO io_engine::bdev::nexus::nexus_bdev:nexus_bdev.rs:632] Nexus '956e0c1b-59da-46ea-aa74-4f04bc4e75f6' [open]: dynamic reconfiguration event: unplug, reconfiguring I/O channels...
[2024-08-08T01:16:25.592098633+00:00  INFO io_engine::bdev::nexus::nexus_bdev:nexus_bdev.rs:655] Nexus '956e0c1b-59da-46ea-aa74-4f04bc4e75f6' [open]: dynamic reconfiguration event: unplug, reconfiguring I/O channels completed with result: Ok
[2024-08-08T01:16:25.592145370+00:00  INFO io_engine::bdev::nexus::nexus_child:nexus_child.rs:1147] Child 'bdev:///3d646500-3214-4b3b-9849-139b3648e56e?uuid=3d646500-3214-4b3b-9849-139b3648e56e @ 956e0c1b-59da-46ea-aa74-4f04bc4e75f6' [closed (destroying) out-of-sync]: child successfully unplugged
[2024-08-08T01:16:25.592204944+00:00  INFO io_engine::bdev::nexus::nexus_child:nexus_child.rs:1086] Child 'bdev:///3d646500-3214-4b3b-9849-139b3648e56e?uuid=3d646500-3214-4b3b-9849-139b3648e56e @ 956e0c1b-59da-46ea-aa74-4f04bc4e75f6' [closed out-of-sync]: child closed successfully
[2024-08-08T01:16:25.592215300+00:00  INFO io_engine::bdev::nexus::nexus_bdev:nexus_bdev.rs:632] Nexus '956e0c1b-59da-46ea-aa74-4f04bc4e75f6' [open]: dynamic reconfiguration event: rebuild, reconfiguring I/O channels...
[2024-08-08T01:16:25.592330041+00:00  INFO io_engine::bdev::nexus::nexus_bdev:nexus_bdev.rs:655] Nexus '956e0c1b-59da-46ea-aa74-4f04bc4e75f6' [open]: dynamic reconfiguration event: rebuild, reconfiguring I/O channels completed with result: Ok
[2024-08-08T01:16:29.631726005+00:00 ERROR mayastor::spdk:nvme_ctrlr.c:4380] [nqn.2019-05.io.openebs:a2685ffc-437b-4286-8aa7-ab6bbd64e686] Submitting Keep Alive failed

Attached are the support bundle from the two running nodes: mayastor-2024-08-08--00-35-44-UTC.tar.gz mayastor-2024-08-08--00-34-37-UTC.tar.gz

avishnu commented 1 month ago

What is the value of repl in the storage class? Did the node come back online?

tiagolobocastro commented 1 month ago

Seems you have hit this bug: https://github.com/openebs/mayastor/issues/1710

[2024-08-08T01:16:25.585879277+00:00 ERROR mayastor::spdk:nvme_fabric.c:596] Connect command failed, rc -125, trtype:TCP adrfam:IPv4 traddr:10.245.250.71 trsvcid:8420 subnqn:nqn.2019-05.io.openebs:16b7a8d2-e89e-4e31-9e8f-571a3726a986 [2024-08-08T01:16:25.585921213+00:00 ERROR mayastor::spdk:nvme_tcp.c:2017] Failed to poll NVMe-oF Fabric CONNECT command [2024-08-08T01:16:25.585926825+00:00 ERROR mayastor::spdk:nvme_tcp.c:1806] Failed to connect tqpair=0x5582015e9850

We have a fix on https://github.com/openebs/mayastor/pull/1711

tiagolobocastro commented 1 month ago

hmm no logs are included in the bundle @veenadong, which plugin version did you use to create the bundle?

veenadong commented 1 month ago

hmm no logs are included in the bundle @veenadong, which plugin version did you use to create the bundle?

We are running 2.5.1 and the plugin version is: Kubectl Plugin (kubectl-mayastor) revision e8f685e3c9e1 (v2.5.0+0)

I see the fix #1711 will be merged into 2.7? What is the ETA for this fix to be available for GA?

tiagolobocastro commented 1 month ago

Plan is to release in a couple of weeks along with a control-plane previously missed functionality for updating pool labels after creation.