Storage keeps disconnecting

AlexanderDotH commented 2 months ago

Describe the bug Some of my volumes are randomly disconnecting due to unknown reason. I have 3 storage nodes with 6 cores(4 of them are dedicated to mayastor) and 8GB of ram. Some volumes are mounted via mayastor to the worker nodes. After some time when I don't look at my cluster it keeps disconnecting to the pods and keep them in a read-only state. The cluster is a fresh installation of native Kubernetes 1.31.0 with Kubeadm. After setup everything works fine and after some time it doesn't work. The mayastor csi-nde logs also say its published and working.

To Reproduce

Create a new Kubernetes cluster.
Setup cilium for networking.
Install openebs mayastor using the helm chart.

helm install openebs --namespace kube-storage openebs/openebs --create-namespace \ --set mayastor.enabled=true \ --set mayastor.crds.enabled=true \ --set mayastor.etcd.clusterDomain=alex-cloud.internal \ --set engines.local.lvm.enabled=false \ --set engines.local.zfs.enabled=false \ --set localprovisioner.enabled=false \ --set 'mayastor.io_engine.coreList={2,3,4,5}' \ --set zfs-localpv.localpv.tolerations[0].key=role \ --set zfs-localpv.localpv.tolerations[0].operator=Equal \ --set zfs-localpv.localpv.tolerations[0].value=storage \ --set zfs-localpv.localpv.tolerations[0].effect=NoSchedule \ --set zfs-localpv.zfsController.provisioner.tolerations[0].key=role \ --set zfs-localpv.zfsController.provisioner.tolerations[0].operator=Equal \ --set zfs-localpv.zfsController.provisioner.tolerations[0].value=storage \ --set zfs-localpv.zfsController.provisioner.tolerations[0].effect=NoSchedule \ --set mayastor.crds.csi.volumeSnapshots.enabled=false \ --set mayastor.tolerations[0].key=role \ --set mayastor.tolerations[0].operator=Equal \ --set mayastor.tolerations[0].value=storage \ --set mayastor.tolerations[0].effect=NoSchedule \ --no-hooks

Setup storage pools from a partition on the storage nodes. (For this just one) apiVersion: "openebs.io/v1beta2" kind: DiskPool metadata: name: alex-cloud-sn-1-pool namespace: kube-storage spec: node: alex-cloud-sn-1 disks: ["/dev/sda2"]
Setup the storage class. apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: alex-cloud-default-sc annotations: storageclass.kubernetes.io/is-default-class: "true" parameters: ioTimeout: "30" protocol: nvmf repl: "3" fsType: "ext4" allowVolumeExpansion: true provisioner: io.openebs.csi-mayastor
Attach the volume to any pod or deployment.

Expected behavior Stay connected no matter what happens.

Screenshots Not really possible but I can provide logs.

OS info (please complete the following information):

Distro: Rocky Linux 9.4 (Blue Onyx)
Kernel version:
- Master: Linux 5.14.0-427.33.1.el9_4.x86_64 x86_64
- Worker 1: Linux 5.14.0-427.33.1.el9_4.x86_64 x86_64
- Worker 2: Linux 5.14.0-427.33.1.el9_4.x86_64 x86_64
- Worker 3: Linux 5.14.0-427.33.1.el9_4.x86_64 x86_64
- Worker 4: Linux 5.14.0-427.33.1.el9_4.x86_64 x86_64
- Storage 1: Linux 5.14.0-427.33.1.el9_4.x86_64 x86_64
- Storage 2: Linux 5.14.0-427.33.1.el9_4.x86_64 x86_64
- Storage 3: Linux 5.14.0-427.33.1.el9_4.x86_64 x86_64
Newest from helm (2.7.0)

Additional context We can also jump on a call or something. This drives me crazy. Here is my discord: @alexdoth

tiagolobocastro commented 2 months ago

Hi,

hmm I wonder if the nvmf connection is dropping or something like that. The logs might provide some help here, would you be able to take a support bundle and upload it here? https://openebs.io/docs/4.0.x/user-guides/replicated-storage-user-guide/replicated-pv-mayastor/advanced-operations/supportability#using-the-supportability-tool

Also a small dmesg snippet from around the time when this happens might give some clues as well.

Thank you

AlexanderDotH commented 2 months ago

Hey! Today at around 11:20AM my postgres pod just got disconnected.

There are no dmesg logs around the time where it got disconnected. I can still export the logs from all nodes if you want.

My Kubernetes setup:

Host	Core	Ram (GB)	Disk (GB)
alex-cloud-mn-1	6	8	60
alex-cloud-wn-1	8	8	50
alex-cloud-wn-2	8	8	50
alex-cloud-wn-3	8	8	50
alex-cloud-wn-4	8	8	50
alex-cloud-sn-1	6	8	380
alex-cloud-sn-2	6	8	380
alex-cloud-sn-3	6	8	380

If you need direct access to grafana, let me know! I also have some metrics from 11-12AM and one pod is using more cpu power than the others:

I also found evidence by looking at the volumes and found that the postgres volume was degraded and it was exact the same node as shown in the grafana metrics.

ID	REPLICAS	TARGET-NODE	ACCESSIBILITY	STATUS	SIZE	THIN-PROVISIONED	ALLOCATED
0a07f4dc-974c-4b39-ba47-f10c51f1fbf3	3	alex-cloud-sn-3	nvmf	Online	2GiB	false	2GiB
5b1da3b6-7890-4e54-ac08-9ef12bd50f9e	3	alex-cloud-sn-1	nvmf	Degraded	5GiB	false	5GiB
613b787b-309f-4102-9829-d4d1674a7f0c	3	alex-cloud-sn-3	nvmf	Online	2GiB	false	2GiB
8fd8ab9e-0aad-4beb-8b48-3715461ec1c9	3	alex-cloud-sn-2	nvmf	Online	5GiB	false	5GiB
9a56b118-9713-42f1-bdda-52d89f91aa84	3	alex-cloud-sn-1	nvmf	Online	5GiB	false	5GiB
b2158a99-0952-45db-b59d-463b3c2b8dd3	3	alex-cloud-sn-1	nvmf	Online	2GiB	false	2GiB
cea5219d-143d-4a77-84e4-c4b92f3fcfaa	3	alex-cloud-sn-1	nvmf	Online	5GiB	false	5GiB
84c8620c-fcdf-4fa2-afd3-ac7d4a06143b	3	alex-cloud-sn-1	nvmf	Online	5GiB	false	5GiB
d4a61347-db11-414c-aac6-1710162ae357	3	alex-cloud-sn-3	nvmf	Online	5GiB	false	5GiB
8f438727-f1a4-4d1c-b6a2-00388e845bd6	3	alex-cloud-sn-3	nvmf	Degraded	5GiB	false	5GiB
15bb5925-cc23-44b2-920e-3ac6d5ec6929	3	alex-cloud-sn-1	nvmf	Online	10GiB	false	10GiB
a46141de-ef66-4cdf-bd1a-3b2dd1c07fbd	3	alex-cloud-sn-3	nvmf	Online	8GiB	false	8GiB
331c0652-0a75-4a2c-8946-3caa0590af06	3	alex-cloud-sn-1	nvmf	Online	50GiB	false	50GiB

Here is my dump: mayastor-2024-09-12--14-28-45-UTC.tar.gz

tiagolobocastro commented 2 months ago

Thanks for the bundle!

@dsharma-dc lately I've been seeing these messages prop up, any clue?

2024-09-12T14:46:32.627058782+02:00 stdout F [2024-09-12T12:46:32.611677145+00:00 ERROR mayastor::spdk:tcp.c:2212] The TCP/IP connection is not negotiated
2024-09-12T14:47:02.648815057+02:00 stdout F [2024-09-12T12:47:02.648483952+00:00 ERROR mayastor::spdk:tcp.c:1605] No pdu coming for tqpair=0x561ca17d9570 within 30 seconds

I also see on this bundle:

"gRPC request 'share_replica' for 'Replica' failed with 'status: AlreadyExists, message: \"Failed to acquire lock for the resource: alex-cloud-sn-2-pool, lock already held\

At around this time, replica service seems to get stuck:

[2024-09-12T04:54:29.727501359+00:00  WARN io_engine::grpc::v1::replica:replica.rs:83] destroy_replica: gRPC method timed out, args: DestroyReplicaRequest { uuid: "4711b421-0210-4db5-b88f-c2c55cac52da", pool: Some(PoolName("alex-cloud-sn-2-pool")) }

tiagolobocastro commented 2 months ago

@AlexanderDotH would you be able to exec into io-engine pod on node sn-2, on container io-engine and run:

io-engine-client bdev list
io-engine-client nexus list
io-engine-client replica list

Thank you

AlexanderDotH commented 2 months ago

Sure! Here is the output:

/ # io-engine-client bdev list	UUID	NUM_BLOCKS	BLK_SIZE	CLAIMED_BY	NAME
613b787b-309f-4102-9829-d4d1674a7f0c	4184030	512	NVMe-oF Target	613b787b-309f-4102-9829-d4d1674a7f0c	nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:613b787b-309f-4102-9829-d4d1674a7f0c
f7cec220-a89f-485c-a2b2-80d555d0776f	692060159	512	lvol	/dev/sda2	bdev:////dev/sda2
228ae271-673b-4d57-8db8-8a6bfb311f69	10485760	512	NVMe-oF Target	228ae271-673b-4d57-8db8-8a6bfb311f69	nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:228ae271-673b-4d57-8db8-8a6bfb311f69
f1168840-bce1-4333-bca7-1170e9a3f045	10485760	512	NVMe-oF Target	f1168840-bce1-4333-bca7-1170e9a3f045	nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:f1168840-bce1-4333-bca7-1170e9a3f045
84c55ffb-8cf2-4b82-8ccd-779ae1224128	10485760	512	NVMe-oF Target	84c55ffb-8cf2-4b82-8ccd-779ae1224128	nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:84c55ffb-8cf2-4b82-8ccd-779ae1224128
3e6a9b9f-7da4-498a-96b9-11ebe8fd14f9	10485760	512	NVMe-oF Target	3e6a9b9f-7da4-498a-96b9-11ebe8fd14f9	nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:3e6a9b9f-7da4-498a-96b9-11ebe8fd14f9
2af3ef1f-4129-456c-b3e7-d511cde9f58a	4194304	512	NVMe-oF Target	2af3ef1f-4129-456c-b3e7-d511cde9f58a	nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:2af3ef1f-4129-456c-b3e7-d511cde9f58a
a8f20aa7-70d7-4474-84e9-7ffcdc190d45	4194304	512	NVMe-oF Target	a8f20aa7-70d7-4474-84e9-7ffcdc190d45	nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:a8f20aa7-70d7-4474-84e9-7ffcdc190d45
c000f370-0d65-405f-9d7e-10fa8a6d07aa	4194304	512	NVMe-oF Target	c000f370-0d65-405f-9d7e-10fa8a6d07aa	nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:c000f370-0d65-405f-9d7e-10fa8a6d07aa
8fd8ab9e-0aad-4beb-8b48-3715461ec1c9	10475486	512	NVMe-oF Target	8fd8ab9e-0aad-4beb-8b48-3715461ec1c9	nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:8fd8ab9e-0aad-4beb-8b48-3715461ec1c9
5b1da3b6-7890-4e54-ac08-9ef12bd50f9e	10475486	512	NVMe-oF Target	5b1da3b6-7890-4e54-ac08-9ef12bd50f9e	nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:5b1da3b6-7890-4e54-ac08-9ef12bd50f9e

/ # io-engine-client nexus list	NAME	UUID	SIZE	REBUILDS
613b787b-309f-4102-9829-d4d1674a7f0c	2ab837e5-fdd5-47d2-96de-ad5a8aa4e765	2147483648	shutdown	nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:613b787b-309f-4102-9829-d4d1674a7f0c
8fd8ab9e-0aad-4beb-8b48-3715461ec1c9	9759759c-a0e6-4777-93e7-af6db9bed125	5368709120	shutdown	nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:8fd8ab9e-0aad-4beb-8b48-3715461ec1c9
5b1da3b6-7890-4e54-ac08-9ef12bd50f9e	a50c8ffb-b1d3-4ffd-b16d-085bc1be6ee5	5368709120	shutdown	nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:5b1da3b6-7890-4e54-ac08-9ef12bd50f9e

/ # io-engine-client replica list	POOL	NAME	UUID	THIN	SHARE	SIZE	CAP	ALLOC	URI	IS_SNAPSHOT
alex-cloud-sn-2-pool	228ae271-673b-4d57-8db8-8a6bfb311f69	228ae271-673b-4d57-8db8-8a6bfb311f69	false	nvmf	5368709120	5368709120	5368709120	nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:228ae271-673b-4d57-8db8-8a6bfb311f69?uuid=228ae271-673b-4d57-8db8-8a6bfb311f69	false	false
alex-cloud-sn-2-pool	f1168840-bce1-4333-bca7-1170e9a3f045	f1168840-bce1-4333-bca7-1170e9a3f045	false	nvmf	5368709120	5368709120	5368709120	nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:f1168840-bce1-4333-bca7-1170e9a3f045?uuid=f1168840-bce1-4333-bca7-1170e9a3f045	false	false
alex-cloud-sn-2-pool	84c55ffb-8cf2-4b82-8ccd-779ae1224128	84c55ffb-8cf2-4b82-8ccd-779ae1224128	false	nvmf	5368709120	5368709120	5368709120	nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:84c55ffb-8cf2-4b82-8ccd-779ae1224128?uuid=84c55ffb-8cf2-4b82-8ccd-779ae1224128	false	false
alex-cloud-sn-2-pool	3e6a9b9f-7da4-498a-96b9-11ebe8fd14f9	3e6a9b9f-7da4-498a-96b9-11ebe8fd14f9	false	nvmf	5368709120	5368709120	5368709120	nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:3e6a9b9f-7da4-498a-96b9-11ebe8fd14f9?uuid=3e6a9b9f-7da4-498a-96b9-11ebe8fd14f9	false	false
alex-cloud-sn-2-pool	2af3ef1f-4129-456c-b3e7-d511cde9f58a	2af3ef1f-4129-456c-b3e7-d511cde9f58a	false	nvmf	2147483648	2147483648	2147483648	nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:2af3ef1f-4129-456c-b3e7-d511cde9f58a?uuid=2af3ef1f-4129-456c-b3e7-d511cde9f58a	false	false
alex-cloud-sn-2-pool	a8f20aa7-70d7-4474-84e9-7ffcdc190d45	a8f20aa7-70d7-4474-84e9-7ffcdc190d45	false	nvmf	2147483648	2147483648	2147483648	nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:a8f20aa7-70d7-4474-84e9-7ffcdc190d45?uuid=a8f20aa7-70d7-4474-84e9-7ffcdc190d45	false	false
alex-cloud-sn-2-pool	c000f370-0d65-405f-9d7e-10fa8a6d07aa	c000f370-0d65-405f-9d7e-10fa8a6d07aa	false	nvmf	2147483648	2147483648	2147483648	nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:c000f370-0d65-405f-9d7e-10fa8a6d07aa?uuid=c000f370-0d65-405f-9d7e-10fa8a6d07aa	false	false

tiagolobocastro commented 2 months ago

Strange, also connection issues between ha cluster and ha node agents?

2024-09-12T16:26:54.739700449+02:00 stdout F   [2m2024-09-12T14:26:54.739611Z[0m [31mERROR[0m [1;31mgrpc::operations::ha_node::client[0m[31m: [1;31merror[0m[31m: Unavailable: status: Unavailable, message: "error trying to connect: tcp connect error: No route to host (os error 113)", details: [], metadata: MetadataMap { headers: {} }[0m
2024-09-12T16:26:54.739720687+02:00 stdout F     [2;3mat[0m control-plane/grpc/src/operations/ha_node/client.rs:99
2024-09-12T16:26:54.739725225+02:00 stdout F
2024-09-12T16:26:54.739730355+02:00 stdout F   [2m2024-09-12T14:26:54.739628Z[0m [32m INFO[0m [1;32magent_ha_cluster::switchover[0m[32m: [32mSending failed Switchover request back to the work queue, [1;32mvolume.uuid[0m[32m: b2158a99-0952-45db-b59d-463b3c2b8dd3, [1;32merror[0m[32m: Nvme path replacement failed: Unavailable: status: Unavailable, message: "error trying to connect: tcp connect error: No route to host (os error 113)", details: [], metadata: MetadataMap { headers: {} }[0m
2024-09-12T16:26:54.739733972+02:00 stdout F     [2;3mat[0m control-plane/agents/src/bin/ha/cluster/switchover.rs:573
2024-09-12T16:26:54.739737288+02:00 stdout F

dsharma-dc commented 2 months ago

I haven't noticed these errors recently. However, looking around I get indications that it might be something to do with how networking is working in cluster. @AlexanderDotH Is the Cilium configured to use encrypted connections? If yes could you try disabling encryption and see if you observe better behaviour? --set encryption.enabled=false

AlexanderDotH commented 2 months ago

Encryption is always disabled but it's a dualstack cluster with IPv4 and IPv6 with BGP. I also could't observe any packet drops or something. Since I opened the issue there wasn't a single outage. Unless today and most of the degraded pods are postgres(stackgres) cluster. Are many read and write actions a issue? Maybe because its constantly replicating the WAL files between each replica. Network throughput is not a issue ig. I ran multiple network benchmarks and it's always around 600-800Gib/s. I could optimize it further using native routing but it's too complicated for me. The 3 storage nodes are providing the entire cluster with storage, is this setup more likely to throw errors and degraded performance? About performance: 4/6 cores on each storage node is dedicated to the io-engine. I also tainted the storage nodes to block any random scheduling on them. (OpenEBS is tainted to deploy on the storage nodes)

tiagolobocastro commented 2 months ago

From what I can see, the agent-ha-cluster tries to call the agent-ha-node, example node is at: "179.61.253.10:50053" And we get: connect: tcp connect error: No route to host (os error 113)

Could the dual stack cause this?

The 3 storage nodes are providing the entire cluster with storage, is this setup more likely to throw errors and degraded performance?

hard to say until we find the root cause Did you isolate the cores btw?

-- A simple fix which won't help but found on these logs: https://github.com/openebs/mayastor/pull/1736

AlexanderDotH commented 2 months ago

That's also weird. In the past I used tuned for core isolation but in the newest kubernetes version I simply had to set it inside the helm command

tiagolobocastro commented 2 months ago

I'm not familiar with tuned, I set it up on the kernel boot cmdline. You can check the isolated cores with: cat /sys/devices/system/cpu/isolated

Maybe we can also check this from within mayastor and report whether we're isolated or not? @dsharma-dc ?

-- Ok I think I have the cause for the lockout of the pool, seems like we try to delete the replica whilst it was still on the nexus, and this may cause some deadlocking behaviour:

2024-09-12T06:54:14.717272539+02:00 stdout F [2024-09-12T04:54:14.717133800+00:00  INFO io_engine::grpc::v1::replica:replica.rs:402] DestroyReplicaRequest { uuid: "4711b421-0210-4db5-b88f-c2c55cac52da", pool: Some(PoolName("alex-cloud-sn-2-pool")) }
2024-09-12T06:54:14.719461813+02:00 stdout F [2024-09-12T04:54:14.719375762+00:00  INFO io_engine::lvs::lvs_lvol:lvs_lvol.rs:247] Lvol 'alex-cloud-sn-2-pool/3a6b6004-dbc8-4613-b316-f1f35fce24e0/4711b421-0210-4db5-b88f-c2c55cac52da' [50.00 GiB]: unshared
2024-09-12T06:54:14.719767776+02:00 stdout F [2024-09-12T04:54:14.719701943+00:00  INFO io_engine::bdev::device:device.rs:785] Received SPDK remove event for bdev '4711b421-0210-4db5-b88f-c2c55cac52da'
2024-09-12T06:54:14.719783986+02:00 stdout F [2024-09-12T04:54:14.719730236+00:00  INFO io_engine::bdev::nexus::nexus_bdev_children:nexus_bdev_children.rs:899] Unplugging nexus child device [3mnexus_name[0m[2m=[0m"331c0652-0a75-4a2c-8946-3caa0590af06" [3mchild_device[0m[2m=[0m"4711b421-0210-4db5-b88f-c2c55cac52da"
2024-09-12T06:54:14.719851202+02:00 stdout F [2024-09-12T04:54:14.719744462+00:00  INFO io_engine::bdev::nexus::nexus_child:nexus_child.rs:1113] Child 'bdev:///4711b421-0210-4db5-b88f-c2c55cac52da?uuid=4711b421-0210-4db5-b88f-c2c55cac52da @ 331c0652-0a75-4a2c-8946-3caa0590af06' [open synced]: unplugging child...
2024-09-12T06:54:14.720345689+02:00 stdout F [2024-09-12T04:54:14.719979192+00:00  INFO io_engine::bdev::nexus::nexus_bdev:nexus_bdev.rs:657] Nexus '331c0652-0a75-4a2c-8946-3caa0590af06' [open]: dynamic reconfiguration event: unplug, reconfiguring I/O channels...
2024-09-12T06:54:14.720361719+02:00 stdout F [2024-09-12T04:54:14.720206068+00:00  INFO io_engine::bdev::nexus::nexus_bdev:nexus_bdev.rs:680] Nexus '331c0652-0a75-4a2c-8946-3caa0590af06' [open]: dynamic reconfiguration event: unplug, reconfiguring I/O channels completed with result: Ok
2024-09-12T06:54:14.7203678+02:00 stdout F [2024-09-12T04:54:14.720225935+00:00  INFO io_engine::bdev::nexus::nexus_child:nexus_child.rs:1157] Child 'bdev:///4711b421-0210-4db5-b88f-c2c55cac52da?uuid=4711b421-0210-4db5-b88f-c2c55cac52da @ 331c0652-0a75-4a2c-8946-3caa0590af06' [closed synced]: child successfully unplugged

I'll raise a separate ticket for this.

AlexanderDotH commented 2 months ago

In tuned you can save everything like kernel boot cmd inside profiles and also use other toolings with it. To do isolate cores you can do that: https://arc.net/l/quote/tkgjmvqz

I looked at my profile and those lines are not present and also the content of /sys/devices/system/cpu/isolated is empty.

Here is the weird part, despite that I don't allow any isolated cores, the io-engine uses those cores. (OFC because I specified it inside the helm deployment but the OS doesn't allow that and it still works.):

Attached is how I deploy openebs. openebs.zip

Here are commands I previously ran to setup openebs. Partitioning is on another slide but I think it's not necessary in this case. https://alex-private.notion.site/4-6-4-OpenEBS-0ade457a4a0343638503dcee0a12a7d6

AlexanderDotH commented 2 months ago

You can also see the live metrics. Everytime when the io-engine has a high cpu throttle you can assume it's getting disconnected. I'll keep the user account online untill there is a fix for this.

Username: openebs Password: openebs

https://grafana.dasprojekt.haus/d/k8s_views_ns_public/kubernetes-views-namespaces?orgId=1&refresh=30s

tiagolobocastro commented 2 months ago

Here is the weird part, despite that I don't allow any isolated cores, the io-engine uses those cores. (OFC because I specified it inside the helm deployment but the OS doesn't allow that and it still works.):

It can use those cores because nothing prevents using them. The io-engine pod is not using guaranteed QoS so even with static cpu manager policy the allowed core list for the process would be the entire list of cores AIUI.

Btw on the nexus list you did above, did you paste the entire list? The nexus 7a9000f5-8729-4010-9db2-86449fa36f4b for volume 331c0652-0a75-4a2c-8946-3caa0590af06 is missing from that list somehow... I don't see the logs for its destruction, which is odd..

AlexanderDotH commented 2 months ago

I just went through my logs and found that: 2024-09-12T16:43:53.514425914+02:00 stdout F [2024-09-12T14:43:53.514303915+00:00 INFO io_engine::bdev::nexus::nexus_bdev_children:nexus_bdev_children.rs:899] Unplugging nexus child device [3mnexus_name[0m[2m=[0m"331c0652-0a75-4a2c-8946-3caa0590af06" [3mchild_device[0m[2m=[0m"4711b421-0210-4db5-b88f-c2c55cac52da"

I am also unable to find any errors with the nexus 7a9000f5-8729-4010-9db2-86449fa36f4b, but it is mentioned many times inside the log file.

Full log: 3.log

I just pasted and formated the list as markdown. I just didn't remove anything. Which logs can I provide? The files and the data right after where the failure happened are the most accurate. In the meantime I just fixed the faulty volumes by deleting them an rebuilding them using stackgres.

dsharma-dc commented 2 months ago

For nexus 7a9000f5-8729-4010-9db2-86449fa36f4b of volume 331c0652-0a75-4a2c-8946-3caa0590af06, there is almost 10 hours time gap from the time nexus had no child till nexus is destroyed at 2024-09-12T16:43:53. But this doesn't explain the original issue though. Why do we deadlock though @tiagolobocastro ? How did this destroy call get triggered here, which typically is initiated by control-plane during child fault and retire. Seems to me like a manual replica destroy attempt?

2024-09-12T06:54:14.717272539+02:00 stdout F [2024-09-12T04:54:14.717133800+00:00 INFO io_engine::grpc::v1::replica:replica.rs:402] DestroyReplicaRequest { uuid: "4711b421-0210-4db5-b88f-c2c55cac52da", pool: Some(PoolName("alex-cloud-sn-2-pool")) }

For the postgres volume 5b1da3b6-7890-4e54-ac08-9ef12bd50f9e, I see the volume has got republished which is why the nexus was shutdown on node 179.61.253.33 and republished on node 179.61.253.31. The volume remained degraded for some time because it couldn't reconcile replica count due to lock contention.

tiagolobocastro commented 2 months ago

Ah I see it in this new log file now, thank you @AlexanderDotH @dsharma-dc the reason is explained on the other ticket I raised.

But great, because the nexus is now destroyed the lockout on the pool is now removed.

@AlexanderDotH again I see some intermittent networking failures:

2024-09-13T01:21:56.360287893+02:00 stdout F [2024-09-12T23:21:56.356249814+00:00 ERROR io_engine::subsys::registration::registration_grpc:registration_grpc.rs:228] Registration failed: Status { code: Cancelled, message: "Timeout expired", source: Some(tonic::transport::Error(Transport, TimeoutExpired(()))) }
2024-09-13T01:22:01.362367386+02:00 stdout F [2024-09-12T23:22:01.362239928+00:00  INFO io_engine::subsys::registration::registration_grpc:registration_grpc.rs:219] Re-registered '"alex-cloud-sn-2"' with grpc server 179.61.253.33:10124 ...

AlexanderDotH commented 2 months ago

No problem :). How can I test the connectivity? Which pods should I ping?

AlexanderDotH commented 2 months ago

Hey I saw some packet drops today and thought it would be worth checking on openebs and it happened again. Some pods got disconnected and nearly all are degraded(from kubectl mayastor get volumes).

Also attached is a broader log from the cluster. Even networking but I couldn't find anything. Do you know something new @tiagolobocastro ?

Sorry I had to upload it to google drive because the dump is around 200MB. https://drive.google.com/file/d/1hLjFSMxf9JYkqWdiJmMW5o6S12ODM5CH/view?usp=sharing

tiagolobocastro commented 2 months ago

PR to fix the control-plane locking the pool: https://github.com/openebs/mayastor-control-plane/pull/862

tiagolobocastro commented 2 months ago

No problem :). How can I test the connectivity? Which pods should I ping?

I'm not sure tbh. @Abhinandan-Purkait any ideas on how to identify connection issues between ha-cluster and ha-node?

I'm also again thinking about the fact this is dual stack, let me see if I can setup a dual stack cluster and see if I also have any issues there. Currently we know already that we bind only to IPv4 (work for IPv6 is in-progress), I wonder if that could have anything to do with it. FYI - https://github.com/openebs/mayastor/issues/1730

AlexanderDotH commented 2 months ago

Thanks for the PR when will it be available via helm? Also some benchmarkring tools would be great. Maybe implemented into the mayastor kubectl plugin? There are many fio tests but no tests between all openebs nodes and some stress testing.

Other question: How does actually replica rebuild work? Can you maybe force a rebuild?

tiagolobocastro commented 2 months ago

Hey, the locking PR is now release as part of 2.7.1 The IPv6 PR is still ongoing.

We recently did some benchmarking with cloudnative pg benchmarks but we don't have any read made solution of our own. The community might have something to help here, I remember @kukacz was doing something similar at some point.

For the rebuild, when a volume is published the nexus automatically copies the data from one replica to another.

AlexanderDotH commented 2 months ago

Thanks you! I guess I'll wait until the new release is out

openebs / mayastor

Storage keeps disconnecting #1734