openebs / mayastor

Dynamically provision Stateful Persistent Replicated Cluster-wide Fabric Volumes & Filesystems for Kubernetes that is provisioned from an optimized NVME SPDK backend data storage stack.
Apache License 2.0
750 stars 109 forks source link

Storage keeps disconnecting #1734

Open AlexanderDotH opened 2 months ago

AlexanderDotH commented 2 months ago

Describe the bug Some of my volumes are randomly disconnecting due to unknown reason. I have 3 storage nodes with 6 cores(4 of them are dedicated to mayastor) and 8GB of ram. Some volumes are mounted via mayastor to the worker nodes. After some time when I don't look at my cluster it keeps disconnecting to the pods and keep them in a read-only state. The cluster is a fresh installation of native Kubernetes 1.31.0 with Kubeadm. After setup everything works fine and after some time it doesn't work. The mayastor csi-nde logs also say its published and working.

To Reproduce

  1. Create a new Kubernetes cluster.
  2. Setup cilium for networking.
  3. Install openebs mayastor using the helm chart.

helm install openebs --namespace kube-storage openebs/openebs --create-namespace \ --set mayastor.enabled=true \ --set mayastor.crds.enabled=true \ --set mayastor.etcd.clusterDomain=alex-cloud.internal \ --set engines.local.lvm.enabled=false \ --set engines.local.zfs.enabled=false \ --set localprovisioner.enabled=false \ --set 'mayastor.io_engine.coreList={2,3,4,5}' \ --set zfs-localpv.localpv.tolerations[0].key=role \ --set zfs-localpv.localpv.tolerations[0].operator=Equal \ --set zfs-localpv.localpv.tolerations[0].value=storage \ --set zfs-localpv.localpv.tolerations[0].effect=NoSchedule \ --set zfs-localpv.zfsController.provisioner.tolerations[0].key=role \ --set zfs-localpv.zfsController.provisioner.tolerations[0].operator=Equal \ --set zfs-localpv.zfsController.provisioner.tolerations[0].value=storage \ --set zfs-localpv.zfsController.provisioner.tolerations[0].effect=NoSchedule \ --set mayastor.crds.csi.volumeSnapshots.enabled=false \ --set mayastor.tolerations[0].key=role \ --set mayastor.tolerations[0].operator=Equal \ --set mayastor.tolerations[0].value=storage \ --set mayastor.tolerations[0].effect=NoSchedule \ --no-hooks

  1. Setup storage pools from a partition on the storage nodes. (For this just one) apiVersion: "openebs.io/v1beta2" kind: DiskPool metadata: name: alex-cloud-sn-1-pool namespace: kube-storage spec: node: alex-cloud-sn-1 disks: ["/dev/sda2"]

  2. Setup the storage class. apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: alex-cloud-default-sc annotations: storageclass.kubernetes.io/is-default-class: "true" parameters: ioTimeout: "30" protocol: nvmf repl: "3" fsType: "ext4" allowVolumeExpansion: true provisioner: io.openebs.csi-mayastor

  3. Attach the volume to any pod or deployment.

Expected behavior Stay connected no matter what happens.

Screenshots Not really possible but I can provide logs.

OS info (please complete the following information):

Additional context We can also jump on a call or something. This drives me crazy. Here is my discord: @alexdoth

tiagolobocastro commented 2 months ago

Hi,

hmm I wonder if the nvmf connection is dropping or something like that. The logs might provide some help here, would you be able to take a support bundle and upload it here? https://openebs.io/docs/4.0.x/user-guides/replicated-storage-user-guide/replicated-pv-mayastor/advanced-operations/supportability#using-the-supportability-tool

Also a small dmesg snippet from around the time when this happens might give some clues as well.

Thank you

AlexanderDotH commented 2 months ago

Hey! Today at around 11:20AM my postgres pod just got disconnected.

There are no dmesg logs around the time where it got disconnected. I can still export the logs from all nodes if you want.

My Kubernetes setup:

Host Core Ram (GB) Disk (GB)
alex-cloud-mn-1 6 8 60
alex-cloud-wn-1 8 8 50
alex-cloud-wn-2 8 8 50
alex-cloud-wn-3 8 8 50
alex-cloud-wn-4 8 8 50
alex-cloud-sn-1 6 8 380
alex-cloud-sn-2 6 8 380
alex-cloud-sn-3 6 8 380

If you need direct access to grafana, let me know! I also have some metrics from 11-12AM and one pod is using more cpu power than the others:

image

I also found evidence by looking at the volumes and found that the postgres volume was degraded and it was exact the same node as shown in the grafana metrics.

ID REPLICAS TARGET-NODE ACCESSIBILITY STATUS SIZE THIN-PROVISIONED ALLOCATED SNAPSHOTS SOURCE
0a07f4dc-974c-4b39-ba47-f10c51f1fbf3 3 alex-cloud-sn-3 nvmf Online 2GiB false 2GiB 0
5b1da3b6-7890-4e54-ac08-9ef12bd50f9e 3 alex-cloud-sn-1 nvmf Degraded 5GiB false 5GiB 0
613b787b-309f-4102-9829-d4d1674a7f0c 3 alex-cloud-sn-3 nvmf Online 2GiB false 2GiB 0
8fd8ab9e-0aad-4beb-8b48-3715461ec1c9 3 alex-cloud-sn-2 nvmf Online 5GiB false 5GiB 0
9a56b118-9713-42f1-bdda-52d89f91aa84 3 alex-cloud-sn-1 nvmf Online 5GiB false 5GiB 0
b2158a99-0952-45db-b59d-463b3c2b8dd3 3 alex-cloud-sn-1 nvmf Online 2GiB false 2GiB 0
cea5219d-143d-4a77-84e4-c4b92f3fcfaa 3 alex-cloud-sn-1 nvmf Online 5GiB false 5GiB 0
84c8620c-fcdf-4fa2-afd3-ac7d4a06143b 3 alex-cloud-sn-1 nvmf Online 5GiB false 5GiB 0
d4a61347-db11-414c-aac6-1710162ae357 3 alex-cloud-sn-3 nvmf Online 5GiB false 5GiB 0
8f438727-f1a4-4d1c-b6a2-00388e845bd6 3 alex-cloud-sn-3 nvmf Degraded 5GiB false 5GiB 0
15bb5925-cc23-44b2-920e-3ac6d5ec6929 3 alex-cloud-sn-1 nvmf Online 10GiB false 10GiB 0
a46141de-ef66-4cdf-bd1a-3b2dd1c07fbd 3 alex-cloud-sn-3 nvmf Online 8GiB false 8GiB 0
331c0652-0a75-4a2c-8946-3caa0590af06 3 alex-cloud-sn-1 nvmf Online 50GiB false 50GiB 0

Here is my dump: mayastor-2024-09-12--14-28-45-UTC.tar.gz

tiagolobocastro commented 2 months ago

Thanks for the bundle!

@dsharma-dc lately I've been seeing these messages prop up, any clue?

2024-09-12T14:46:32.627058782+02:00 stdout F [2024-09-12T12:46:32.611677145+00:00 ERROR mayastor::spdk:tcp.c:2212] The TCP/IP connection is not negotiated
2024-09-12T14:47:02.648815057+02:00 stdout F [2024-09-12T12:47:02.648483952+00:00 ERROR mayastor::spdk:tcp.c:1605] No pdu coming for tqpair=0x561ca17d9570 within 30 seconds

I also see on this bundle:

"gRPC request 'share_replica' for 'Replica' failed with 'status: AlreadyExists, message: \"Failed to acquire lock for the resource: alex-cloud-sn-2-pool, lock already held\

At around this time, replica service seems to get stuck:

[2024-09-12T04:54:29.727501359+00:00  WARN io_engine::grpc::v1::replica:replica.rs:83] destroy_replica: gRPC method timed out, args: DestroyReplicaRequest { uuid: "4711b421-0210-4db5-b88f-c2c55cac52da", pool: Some(PoolName("alex-cloud-sn-2-pool")) }
tiagolobocastro commented 2 months ago

@AlexanderDotH would you be able to exec into io-engine pod on node sn-2, on container io-engine and run:

io-engine-client bdev list
io-engine-client nexus list
io-engine-client replica list

Thank you

AlexanderDotH commented 2 months ago

Sure! Here is the output:

/ # io-engine-client bdev list UUID NUM_BLOCKS BLK_SIZE CLAIMED_BY NAME SHARE_URI
613b787b-309f-4102-9829-d4d1674a7f0c 4184030 512 NVMe-oF Target 613b787b-309f-4102-9829-d4d1674a7f0c nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:613b787b-309f-4102-9829-d4d1674a7f0c
f7cec220-a89f-485c-a2b2-80d555d0776f 692060159 512 lvol /dev/sda2 bdev:////dev/sda2
228ae271-673b-4d57-8db8-8a6bfb311f69 10485760 512 NVMe-oF Target 228ae271-673b-4d57-8db8-8a6bfb311f69 nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:228ae271-673b-4d57-8db8-8a6bfb311f69
f1168840-bce1-4333-bca7-1170e9a3f045 10485760 512 NVMe-oF Target f1168840-bce1-4333-bca7-1170e9a3f045 nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:f1168840-bce1-4333-bca7-1170e9a3f045
84c55ffb-8cf2-4b82-8ccd-779ae1224128 10485760 512 NVMe-oF Target 84c55ffb-8cf2-4b82-8ccd-779ae1224128 nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:84c55ffb-8cf2-4b82-8ccd-779ae1224128
3e6a9b9f-7da4-498a-96b9-11ebe8fd14f9 10485760 512 NVMe-oF Target 3e6a9b9f-7da4-498a-96b9-11ebe8fd14f9 nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:3e6a9b9f-7da4-498a-96b9-11ebe8fd14f9
2af3ef1f-4129-456c-b3e7-d511cde9f58a 4194304 512 NVMe-oF Target 2af3ef1f-4129-456c-b3e7-d511cde9f58a nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:2af3ef1f-4129-456c-b3e7-d511cde9f58a
a8f20aa7-70d7-4474-84e9-7ffcdc190d45 4194304 512 NVMe-oF Target a8f20aa7-70d7-4474-84e9-7ffcdc190d45 nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:a8f20aa7-70d7-4474-84e9-7ffcdc190d45
c000f370-0d65-405f-9d7e-10fa8a6d07aa 4194304 512 NVMe-oF Target c000f370-0d65-405f-9d7e-10fa8a6d07aa nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:c000f370-0d65-405f-9d7e-10fa8a6d07aa
8fd8ab9e-0aad-4beb-8b48-3715461ec1c9 10475486 512 NVMe-oF Target 8fd8ab9e-0aad-4beb-8b48-3715461ec1c9 nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:8fd8ab9e-0aad-4beb-8b48-3715461ec1c9
5b1da3b6-7890-4e54-ac08-9ef12bd50f9e 10475486 512 NVMe-oF Target 5b1da3b6-7890-4e54-ac08-9ef12bd50f9e nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:5b1da3b6-7890-4e54-ac08-9ef12bd50f9e
/ # io-engine-client nexus list NAME UUID SIZE STATE REBUILDS PATH
613b787b-309f-4102-9829-d4d1674a7f0c 2ab837e5-fdd5-47d2-96de-ad5a8aa4e765 2147483648 shutdown 0 nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:613b787b-309f-4102-9829-d4d1674a7f0c
8fd8ab9e-0aad-4beb-8b48-3715461ec1c9 9759759c-a0e6-4777-93e7-af6db9bed125 5368709120 shutdown 0 nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:8fd8ab9e-0aad-4beb-8b48-3715461ec1c9
5b1da3b6-7890-4e54-ac08-9ef12bd50f9e a50c8ffb-b1d3-4ffd-b16d-085bc1be6ee5 5368709120 shutdown 0 nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:5b1da3b6-7890-4e54-ac08-9ef12bd50f9e
/ # io-engine-client replica list POOL NAME UUID THIN SHARE SIZE CAP ALLOC URI IS_SNAPSHOT IS_CLONE SNAP_ANCESTOR_SIZE CLONE_SNAP_ANCESTOR_SIZE
alex-cloud-sn-2-pool 228ae271-673b-4d57-8db8-8a6bfb311f69 228ae271-673b-4d57-8db8-8a6bfb311f69 false nvmf 5368709120 5368709120 5368709120 nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:228ae271-673b-4d57-8db8-8a6bfb311f69?uuid=228ae271-673b-4d57-8db8-8a6bfb311f69 false false 0 0
alex-cloud-sn-2-pool f1168840-bce1-4333-bca7-1170e9a3f045 f1168840-bce1-4333-bca7-1170e9a3f045 false nvmf 5368709120 5368709120 5368709120 nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:f1168840-bce1-4333-bca7-1170e9a3f045?uuid=f1168840-bce1-4333-bca7-1170e9a3f045 false false 0 0
alex-cloud-sn-2-pool 84c55ffb-8cf2-4b82-8ccd-779ae1224128 84c55ffb-8cf2-4b82-8ccd-779ae1224128 false nvmf 5368709120 5368709120 5368709120 nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:84c55ffb-8cf2-4b82-8ccd-779ae1224128?uuid=84c55ffb-8cf2-4b82-8ccd-779ae1224128 false false 0 0
alex-cloud-sn-2-pool 3e6a9b9f-7da4-498a-96b9-11ebe8fd14f9 3e6a9b9f-7da4-498a-96b9-11ebe8fd14f9 false nvmf 5368709120 5368709120 5368709120 nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:3e6a9b9f-7da4-498a-96b9-11ebe8fd14f9?uuid=3e6a9b9f-7da4-498a-96b9-11ebe8fd14f9 false false 0 0
alex-cloud-sn-2-pool 2af3ef1f-4129-456c-b3e7-d511cde9f58a 2af3ef1f-4129-456c-b3e7-d511cde9f58a false nvmf 2147483648 2147483648 2147483648 nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:2af3ef1f-4129-456c-b3e7-d511cde9f58a?uuid=2af3ef1f-4129-456c-b3e7-d511cde9f58a false false 0 0
alex-cloud-sn-2-pool a8f20aa7-70d7-4474-84e9-7ffcdc190d45 a8f20aa7-70d7-4474-84e9-7ffcdc190d45 false nvmf 2147483648 2147483648 2147483648 nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:a8f20aa7-70d7-4474-84e9-7ffcdc190d45?uuid=a8f20aa7-70d7-4474-84e9-7ffcdc190d45 false false 0 0
alex-cloud-sn-2-pool c000f370-0d65-405f-9d7e-10fa8a6d07aa c000f370-0d65-405f-9d7e-10fa8a6d07aa false nvmf 2147483648 2147483648 2147483648 nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:c000f370-0d65-405f-9d7e-10fa8a6d07aa?uuid=c000f370-0d65-405f-9d7e-10fa8a6d07aa false false 0 0
tiagolobocastro commented 2 months ago

Strange, also connection issues between ha cluster and ha node agents?

2024-09-12T16:26:54.739700449+02:00 stdout F   2024-09-12T14:26:54.739611Z ERROR grpc::operations::ha_node::client: error: Unavailable: status: Unavailable, message: "error trying to connect: tcp connect error: No route to host (os error 113)", details: [], metadata: MetadataMap { headers: {} }
2024-09-12T16:26:54.739720687+02:00 stdout F     at control-plane/grpc/src/operations/ha_node/client.rs:99
2024-09-12T16:26:54.739725225+02:00 stdout F
2024-09-12T16:26:54.739730355+02:00 stdout F   2024-09-12T14:26:54.739628Z  INFO agent_ha_cluster::switchover: Sending failed Switchover request back to the work queue, volume.uuid: b2158a99-0952-45db-b59d-463b3c2b8dd3, error: Nvme path replacement failed: Unavailable: status: Unavailable, message: "error trying to connect: tcp connect error: No route to host (os error 113)", details: [], metadata: MetadataMap { headers: {} }
2024-09-12T16:26:54.739733972+02:00 stdout F     at control-plane/agents/src/bin/ha/cluster/switchover.rs:573
2024-09-12T16:26:54.739737288+02:00 stdout F
dsharma-dc commented 2 months ago

I haven't noticed these errors recently. However, looking around I get indications that it might be something to do with how networking is working in cluster. @AlexanderDotH Is the Cilium configured to use encrypted connections? If yes could you try disabling encryption and see if you observe better behaviour? --set encryption.enabled=false

AlexanderDotH commented 2 months ago

Encryption is always disabled but it's a dualstack cluster with IPv4 and IPv6 with BGP. I also could't observe any packet drops or something. Since I opened the issue there wasn't a single outage. Unless today and most of the degraded pods are postgres(stackgres) cluster. Are many read and write actions a issue? Maybe because its constantly replicating the WAL files between each replica. Network throughput is not a issue ig. I ran multiple network benchmarks and it's always around 600-800Gib/s. I could optimize it further using native routing but it's too complicated for me. The 3 storage nodes are providing the entire cluster with storage, is this setup more likely to throw errors and degraded performance? About performance: 4/6 cores on each storage node is dedicated to the io-engine. I also tainted the storage nodes to block any random scheduling on them. (OpenEBS is tainted to deploy on the storage nodes)

tiagolobocastro commented 2 months ago

From what I can see, the agent-ha-cluster tries to call the agent-ha-node, example node is at: "179.61.253.10:50053" And we get: connect: tcp connect error: No route to host (os error 113)

Could the dual stack cause this?

The 3 storage nodes are providing the entire cluster with storage, is this setup more likely to throw errors and degraded performance?

hard to say until we find the root cause Did you isolate the cores btw?

-- A simple fix which won't help but found on these logs: https://github.com/openebs/mayastor/pull/1736

AlexanderDotH commented 2 months ago

That's also weird. In the past I used tuned for core isolation but in the newest kubernetes version I simply had to set it inside the helm command

tiagolobocastro commented 2 months ago

I'm not familiar with tuned, I set it up on the kernel boot cmdline. You can check the isolated cores with: cat /sys/devices/system/cpu/isolated

Maybe we can also check this from within mayastor and report whether we're isolated or not? @dsharma-dc ?

-- Ok I think I have the cause for the lockout of the pool, seems like we try to delete the replica whilst it was still on the nexus, and this may cause some deadlocking behaviour:

2024-09-12T06:54:14.717272539+02:00 stdout F [2024-09-12T04:54:14.717133800+00:00  INFO io_engine::grpc::v1::replica:replica.rs:402] DestroyReplicaRequest { uuid: "4711b421-0210-4db5-b88f-c2c55cac52da", pool: Some(PoolName("alex-cloud-sn-2-pool")) }
2024-09-12T06:54:14.719461813+02:00 stdout F [2024-09-12T04:54:14.719375762+00:00  INFO io_engine::lvs::lvs_lvol:lvs_lvol.rs:247] Lvol 'alex-cloud-sn-2-pool/3a6b6004-dbc8-4613-b316-f1f35fce24e0/4711b421-0210-4db5-b88f-c2c55cac52da' [50.00 GiB]: unshared
2024-09-12T06:54:14.719767776+02:00 stdout F [2024-09-12T04:54:14.719701943+00:00  INFO io_engine::bdev::device:device.rs:785] Received SPDK remove event for bdev '4711b421-0210-4db5-b88f-c2c55cac52da'
2024-09-12T06:54:14.719783986+02:00 stdout F [2024-09-12T04:54:14.719730236+00:00  INFO io_engine::bdev::nexus::nexus_bdev_children:nexus_bdev_children.rs:899] Unplugging nexus child device nexus_name="331c0652-0a75-4a2c-8946-3caa0590af06" child_device="4711b421-0210-4db5-b88f-c2c55cac52da"
2024-09-12T06:54:14.719851202+02:00 stdout F [2024-09-12T04:54:14.719744462+00:00  INFO io_engine::bdev::nexus::nexus_child:nexus_child.rs:1113] Child 'bdev:///4711b421-0210-4db5-b88f-c2c55cac52da?uuid=4711b421-0210-4db5-b88f-c2c55cac52da @ 331c0652-0a75-4a2c-8946-3caa0590af06' [open synced]: unplugging child...
2024-09-12T06:54:14.720345689+02:00 stdout F [2024-09-12T04:54:14.719979192+00:00  INFO io_engine::bdev::nexus::nexus_bdev:nexus_bdev.rs:657] Nexus '331c0652-0a75-4a2c-8946-3caa0590af06' [open]: dynamic reconfiguration event: unplug, reconfiguring I/O channels...
2024-09-12T06:54:14.720361719+02:00 stdout F [2024-09-12T04:54:14.720206068+00:00  INFO io_engine::bdev::nexus::nexus_bdev:nexus_bdev.rs:680] Nexus '331c0652-0a75-4a2c-8946-3caa0590af06' [open]: dynamic reconfiguration event: unplug, reconfiguring I/O channels completed with result: Ok
2024-09-12T06:54:14.7203678+02:00 stdout F [2024-09-12T04:54:14.720225935+00:00  INFO io_engine::bdev::nexus::nexus_child:nexus_child.rs:1157] Child 'bdev:///4711b421-0210-4db5-b88f-c2c55cac52da?uuid=4711b421-0210-4db5-b88f-c2c55cac52da @ 331c0652-0a75-4a2c-8946-3caa0590af06' [closed synced]: child successfully unplugged

I'll raise a separate ticket for this.

AlexanderDotH commented 2 months ago

In tuned you can save everything like kernel boot cmd inside profiles and also use other toolings with it. To do isolate cores you can do that: https://arc.net/l/quote/tkgjmvqz

I looked at my profile and those lines are not present and also the content of /sys/devices/system/cpu/isolated is empty.

image

Here is the weird part, despite that I don't allow any isolated cores, the io-engine uses those cores. (OFC because I specified it inside the helm deployment but the OS doesn't allow that and it still works.):

image

Attached is how I deploy openebs. openebs.zip

Here are commands I previously ran to setup openebs. Partitioning is on another slide but I think it's not necessary in this case. https://alex-private.notion.site/4-6-4-OpenEBS-0ade457a4a0343638503dcee0a12a7d6

AlexanderDotH commented 2 months ago

You can also see the live metrics. Everytime when the io-engine has a high cpu throttle you can assume it's getting disconnected. I'll keep the user account online untill there is a fix for this.

Username: openebs Password: openebs

https://grafana.dasprojekt.haus/d/k8s_views_ns_public/kubernetes-views-namespaces?orgId=1&refresh=30s

tiagolobocastro commented 2 months ago

Here is the weird part, despite that I don't allow any isolated cores, the io-engine uses those cores. (OFC because I specified it inside the helm deployment but the OS doesn't allow that and it still works.):

It can use those cores because nothing prevents using them. The io-engine pod is not using guaranteed QoS so even with static cpu manager policy the allowed core list for the process would be the entire list of cores AIUI.

Btw on the nexus list you did above, did you paste the entire list? The nexus 7a9000f5-8729-4010-9db2-86449fa36f4b for volume 331c0652-0a75-4a2c-8946-3caa0590af06 is missing from that list somehow... I don't see the logs for its destruction, which is odd..

AlexanderDotH commented 2 months ago

I just went through my logs and found that: 2024-09-12T16:43:53.514425914+02:00 stdout F [2024-09-12T14:43:53.514303915+00:00 INFO io_engine::bdev::nexus::nexus_bdev_children:nexus_bdev_children.rs:899] Unplugging nexus child device nexus_name="331c0652-0a75-4a2c-8946-3caa0590af06" child_device="4711b421-0210-4db5-b88f-c2c55cac52da"

I am also unable to find any errors with the nexus 7a9000f5-8729-4010-9db2-86449fa36f4b, but it is mentioned many times inside the log file.

Full log: 3.log

I just pasted and formated the list as markdown. I just didn't remove anything. Which logs can I provide? The files and the data right after where the failure happened are the most accurate. In the meantime I just fixed the faulty volumes by deleting them an rebuilding them using stackgres.

dsharma-dc commented 2 months ago

For nexus 7a9000f5-8729-4010-9db2-86449fa36f4b of volume 331c0652-0a75-4a2c-8946-3caa0590af06, there is almost 10 hours time gap from the time nexus had no child till nexus is destroyed at 2024-09-12T16:43:53. But this doesn't explain the original issue though. Why do we deadlock though @tiagolobocastro ? How did this destroy call get triggered here, which typically is initiated by control-plane during child fault and retire. Seems to me like a manual replica destroy attempt?

2024-09-12T06:54:14.717272539+02:00 stdout F [2024-09-12T04:54:14.717133800+00:00 INFO io_engine::grpc::v1::replica:replica.rs:402] DestroyReplicaRequest { uuid: "4711b421-0210-4db5-b88f-c2c55cac52da", pool: Some(PoolName("alex-cloud-sn-2-pool")) }

For the postgres volume 5b1da3b6-7890-4e54-ac08-9ef12bd50f9e, I see the volume has got republished which is why the nexus was shutdown on node 179.61.253.33 and republished on node 179.61.253.31. The volume remained degraded for some time because it couldn't reconcile replica count due to lock contention.

tiagolobocastro commented 2 months ago

Ah I see it in this new log file now, thank you @AlexanderDotH @dsharma-dc the reason is explained on the other ticket I raised.

But great, because the nexus is now destroyed the lockout on the pool is now removed.

@AlexanderDotH again I see some intermittent networking failures:

2024-09-13T01:21:56.360287893+02:00 stdout F [2024-09-12T23:21:56.356249814+00:00 ERROR io_engine::subsys::registration::registration_grpc:registration_grpc.rs:228] Registration failed: Status { code: Cancelled, message: "Timeout expired", source: Some(tonic::transport::Error(Transport, TimeoutExpired(()))) }
2024-09-13T01:22:01.362367386+02:00 stdout F [2024-09-12T23:22:01.362239928+00:00  INFO io_engine::subsys::registration::registration_grpc:registration_grpc.rs:219] Re-registered '"alex-cloud-sn-2"' with grpc server 179.61.253.33:10124 ...
AlexanderDotH commented 2 months ago

No problem :). How can I test the connectivity? Which pods should I ping?

AlexanderDotH commented 2 months ago

Hey I saw some packet drops today and thought it would be worth checking on openebs and it happened again. Some pods got disconnected and nearly all are degraded(from kubectl mayastor get volumes).

image

Also attached is a broader log from the cluster. Even networking but I couldn't find anything. Do you know something new @tiagolobocastro ?

Sorry I had to upload it to google drive because the dump is around 200MB. https://drive.google.com/file/d/1hLjFSMxf9JYkqWdiJmMW5o6S12ODM5CH/view?usp=sharing

tiagolobocastro commented 2 months ago

PR to fix the control-plane locking the pool: https://github.com/openebs/mayastor-control-plane/pull/862

tiagolobocastro commented 2 months ago

No problem :). How can I test the connectivity? Which pods should I ping?

I'm not sure tbh. @Abhinandan-Purkait any ideas on how to identify connection issues between ha-cluster and ha-node?

I'm also again thinking about the fact this is dual stack, let me see if I can setup a dual stack cluster and see if I also have any issues there. Currently we know already that we bind only to IPv4 (work for IPv6 is in-progress), I wonder if that could have anything to do with it. FYI - https://github.com/openebs/mayastor/issues/1730

AlexanderDotH commented 2 months ago

Thanks for the PR when will it be available via helm? Also some benchmarkring tools would be great. Maybe implemented into the mayastor kubectl plugin? There are many fio tests but no tests between all openebs nodes and some stress testing.

Other question: How does actually replica rebuild work? Can you maybe force a rebuild?

tiagolobocastro commented 2 months ago

Hey, the locking PR is now release as part of 2.7.1 The IPv6 PR is still ongoing.

We recently did some benchmarking with cloudnative pg benchmarks but we don't have any read made solution of our own. The community might have something to help here, I remember @kukacz was doing something similar at some point.

For the rebuild, when a volume is published the nexus automatically copies the data from one replica to another.

AlexanderDotH commented 2 months ago

Thanks you! I guess I'll wait until the new release is out