Open AlexanderDotH opened 2 months ago
Hi,
hmm I wonder if the nvmf connection is dropping or something like that. The logs might provide some help here, would you be able to take a support bundle and upload it here? https://openebs.io/docs/4.0.x/user-guides/replicated-storage-user-guide/replicated-pv-mayastor/advanced-operations/supportability#using-the-supportability-tool
Also a small dmesg snippet from around the time when this happens might give some clues as well.
Thank you
Hey! Today at around 11:20AM my postgres pod just got disconnected.
There are no dmesg logs around the time where it got disconnected. I can still export the logs from all nodes if you want.
My Kubernetes setup:
Host | Core | Ram (GB) | Disk (GB) |
---|---|---|---|
alex-cloud-mn-1 | 6 | 8 | 60 |
alex-cloud-wn-1 | 8 | 8 | 50 |
alex-cloud-wn-2 | 8 | 8 | 50 |
alex-cloud-wn-3 | 8 | 8 | 50 |
alex-cloud-wn-4 | 8 | 8 | 50 |
alex-cloud-sn-1 | 6 | 8 | 380 |
alex-cloud-sn-2 | 6 | 8 | 380 |
alex-cloud-sn-3 | 6 | 8 | 380 |
If you need direct access to grafana, let me know! I also have some metrics from 11-12AM and one pod is using more cpu power than the others:
I also found evidence by looking at the volumes and found that the postgres volume was degraded and it was exact the same node as shown in the grafana metrics.
ID | REPLICAS | TARGET-NODE | ACCESSIBILITY | STATUS | SIZE | THIN-PROVISIONED | ALLOCATED | SNAPSHOTS | SOURCE |
---|---|---|---|---|---|---|---|---|---|
0a07f4dc-974c-4b39-ba47-f10c51f1fbf3 | 3 | alex-cloud-sn-3 | nvmf | Online | 2GiB | false | 2GiB | 0 | |
5b1da3b6-7890-4e54-ac08-9ef12bd50f9e | 3 | alex-cloud-sn-1 | nvmf | Degraded | 5GiB | false | 5GiB | 0 | |
613b787b-309f-4102-9829-d4d1674a7f0c | 3 | alex-cloud-sn-3 | nvmf | Online | 2GiB | false | 2GiB | 0 | |
8fd8ab9e-0aad-4beb-8b48-3715461ec1c9 | 3 | alex-cloud-sn-2 | nvmf | Online | 5GiB | false | 5GiB | 0 | |
9a56b118-9713-42f1-bdda-52d89f91aa84 | 3 | alex-cloud-sn-1 | nvmf | Online | 5GiB | false | 5GiB | 0 | |
b2158a99-0952-45db-b59d-463b3c2b8dd3 | 3 | alex-cloud-sn-1 | nvmf | Online | 2GiB | false | 2GiB | 0 | |
cea5219d-143d-4a77-84e4-c4b92f3fcfaa | 3 | alex-cloud-sn-1 | nvmf | Online | 5GiB | false | 5GiB | 0 | |
84c8620c-fcdf-4fa2-afd3-ac7d4a06143b | 3 | alex-cloud-sn-1 | nvmf | Online | 5GiB | false | 5GiB | 0 | |
d4a61347-db11-414c-aac6-1710162ae357 | 3 | alex-cloud-sn-3 | nvmf | Online | 5GiB | false | 5GiB | 0 | |
8f438727-f1a4-4d1c-b6a2-00388e845bd6 | 3 | alex-cloud-sn-3 | nvmf | Degraded | 5GiB | false | 5GiB | 0 | |
15bb5925-cc23-44b2-920e-3ac6d5ec6929 | 3 | alex-cloud-sn-1 | nvmf | Online | 10GiB | false | 10GiB | 0 | |
a46141de-ef66-4cdf-bd1a-3b2dd1c07fbd | 3 | alex-cloud-sn-3 | nvmf | Online | 8GiB | false | 8GiB | 0 | |
331c0652-0a75-4a2c-8946-3caa0590af06 | 3 | alex-cloud-sn-1 | nvmf | Online | 50GiB | false | 50GiB | 0 |
Here is my dump: mayastor-2024-09-12--14-28-45-UTC.tar.gz
Thanks for the bundle!
@dsharma-dc lately I've been seeing these messages prop up, any clue?
2024-09-12T14:46:32.627058782+02:00 stdout F [2024-09-12T12:46:32.611677145+00:00 ERROR mayastor::spdk:tcp.c:2212] The TCP/IP connection is not negotiated
2024-09-12T14:47:02.648815057+02:00 stdout F [2024-09-12T12:47:02.648483952+00:00 ERROR mayastor::spdk:tcp.c:1605] No pdu coming for tqpair=0x561ca17d9570 within 30 seconds
I also see on this bundle:
"gRPC request 'share_replica' for 'Replica' failed with 'status: AlreadyExists, message: \"Failed to acquire lock for the resource: alex-cloud-sn-2-pool, lock already held\
At around this time, replica service seems to get stuck:
[2024-09-12T04:54:29.727501359+00:00 WARN io_engine::grpc::v1::replica:replica.rs:83] destroy_replica: gRPC method timed out, args: DestroyReplicaRequest { uuid: "4711b421-0210-4db5-b88f-c2c55cac52da", pool: Some(PoolName("alex-cloud-sn-2-pool")) }
@AlexanderDotH would you be able to exec into io-engine pod on node sn-2, on container io-engine and run:
io-engine-client bdev list
io-engine-client nexus list
io-engine-client replica list
Thank you
Sure! Here is the output:
/ # io-engine-client bdev list | UUID | NUM_BLOCKS | BLK_SIZE | CLAIMED_BY | NAME | SHARE_URI |
---|---|---|---|---|---|---|
613b787b-309f-4102-9829-d4d1674a7f0c | 4184030 | 512 | NVMe-oF Target | 613b787b-309f-4102-9829-d4d1674a7f0c | nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:613b787b-309f-4102-9829-d4d1674a7f0c | |
f7cec220-a89f-485c-a2b2-80d555d0776f | 692060159 | 512 | lvol | /dev/sda2 | bdev:////dev/sda2 | |
228ae271-673b-4d57-8db8-8a6bfb311f69 | 10485760 | 512 | NVMe-oF Target | 228ae271-673b-4d57-8db8-8a6bfb311f69 | nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:228ae271-673b-4d57-8db8-8a6bfb311f69 | |
f1168840-bce1-4333-bca7-1170e9a3f045 | 10485760 | 512 | NVMe-oF Target | f1168840-bce1-4333-bca7-1170e9a3f045 | nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:f1168840-bce1-4333-bca7-1170e9a3f045 | |
84c55ffb-8cf2-4b82-8ccd-779ae1224128 | 10485760 | 512 | NVMe-oF Target | 84c55ffb-8cf2-4b82-8ccd-779ae1224128 | nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:84c55ffb-8cf2-4b82-8ccd-779ae1224128 | |
3e6a9b9f-7da4-498a-96b9-11ebe8fd14f9 | 10485760 | 512 | NVMe-oF Target | 3e6a9b9f-7da4-498a-96b9-11ebe8fd14f9 | nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:3e6a9b9f-7da4-498a-96b9-11ebe8fd14f9 | |
2af3ef1f-4129-456c-b3e7-d511cde9f58a | 4194304 | 512 | NVMe-oF Target | 2af3ef1f-4129-456c-b3e7-d511cde9f58a | nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:2af3ef1f-4129-456c-b3e7-d511cde9f58a | |
a8f20aa7-70d7-4474-84e9-7ffcdc190d45 | 4194304 | 512 | NVMe-oF Target | a8f20aa7-70d7-4474-84e9-7ffcdc190d45 | nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:a8f20aa7-70d7-4474-84e9-7ffcdc190d45 | |
c000f370-0d65-405f-9d7e-10fa8a6d07aa | 4194304 | 512 | NVMe-oF Target | c000f370-0d65-405f-9d7e-10fa8a6d07aa | nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:c000f370-0d65-405f-9d7e-10fa8a6d07aa | |
8fd8ab9e-0aad-4beb-8b48-3715461ec1c9 | 10475486 | 512 | NVMe-oF Target | 8fd8ab9e-0aad-4beb-8b48-3715461ec1c9 | nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:8fd8ab9e-0aad-4beb-8b48-3715461ec1c9 | |
5b1da3b6-7890-4e54-ac08-9ef12bd50f9e | 10475486 | 512 | NVMe-oF Target | 5b1da3b6-7890-4e54-ac08-9ef12bd50f9e | nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:5b1da3b6-7890-4e54-ac08-9ef12bd50f9e |
/ # io-engine-client nexus list | NAME | UUID | SIZE | STATE | REBUILDS | PATH |
---|---|---|---|---|---|---|
613b787b-309f-4102-9829-d4d1674a7f0c | 2ab837e5-fdd5-47d2-96de-ad5a8aa4e765 | 2147483648 | shutdown | 0 | nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:613b787b-309f-4102-9829-d4d1674a7f0c | |
8fd8ab9e-0aad-4beb-8b48-3715461ec1c9 | 9759759c-a0e6-4777-93e7-af6db9bed125 | 5368709120 | shutdown | 0 | nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:8fd8ab9e-0aad-4beb-8b48-3715461ec1c9 | |
5b1da3b6-7890-4e54-ac08-9ef12bd50f9e | a50c8ffb-b1d3-4ffd-b16d-085bc1be6ee5 | 5368709120 | shutdown | 0 | nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:5b1da3b6-7890-4e54-ac08-9ef12bd50f9e |
/ # io-engine-client replica list | POOL | NAME | UUID | THIN | SHARE | SIZE | CAP | ALLOC | URI | IS_SNAPSHOT | IS_CLONE | SNAP_ANCESTOR_SIZE | CLONE_SNAP_ANCESTOR_SIZE |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
alex-cloud-sn-2-pool | 228ae271-673b-4d57-8db8-8a6bfb311f69 | 228ae271-673b-4d57-8db8-8a6bfb311f69 | false | nvmf | 5368709120 | 5368709120 | 5368709120 | nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:228ae271-673b-4d57-8db8-8a6bfb311f69?uuid=228ae271-673b-4d57-8db8-8a6bfb311f69 | false | false | 0 | 0 | |
alex-cloud-sn-2-pool | f1168840-bce1-4333-bca7-1170e9a3f045 | f1168840-bce1-4333-bca7-1170e9a3f045 | false | nvmf | 5368709120 | 5368709120 | 5368709120 | nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:f1168840-bce1-4333-bca7-1170e9a3f045?uuid=f1168840-bce1-4333-bca7-1170e9a3f045 | false | false | 0 | 0 | |
alex-cloud-sn-2-pool | 84c55ffb-8cf2-4b82-8ccd-779ae1224128 | 84c55ffb-8cf2-4b82-8ccd-779ae1224128 | false | nvmf | 5368709120 | 5368709120 | 5368709120 | nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:84c55ffb-8cf2-4b82-8ccd-779ae1224128?uuid=84c55ffb-8cf2-4b82-8ccd-779ae1224128 | false | false | 0 | 0 | |
alex-cloud-sn-2-pool | 3e6a9b9f-7da4-498a-96b9-11ebe8fd14f9 | 3e6a9b9f-7da4-498a-96b9-11ebe8fd14f9 | false | nvmf | 5368709120 | 5368709120 | 5368709120 | nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:3e6a9b9f-7da4-498a-96b9-11ebe8fd14f9?uuid=3e6a9b9f-7da4-498a-96b9-11ebe8fd14f9 | false | false | 0 | 0 | |
alex-cloud-sn-2-pool | 2af3ef1f-4129-456c-b3e7-d511cde9f58a | 2af3ef1f-4129-456c-b3e7-d511cde9f58a | false | nvmf | 2147483648 | 2147483648 | 2147483648 | nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:2af3ef1f-4129-456c-b3e7-d511cde9f58a?uuid=2af3ef1f-4129-456c-b3e7-d511cde9f58a | false | false | 0 | 0 | |
alex-cloud-sn-2-pool | a8f20aa7-70d7-4474-84e9-7ffcdc190d45 | a8f20aa7-70d7-4474-84e9-7ffcdc190d45 | false | nvmf | 2147483648 | 2147483648 | 2147483648 | nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:a8f20aa7-70d7-4474-84e9-7ffcdc190d45?uuid=a8f20aa7-70d7-4474-84e9-7ffcdc190d45 | false | false | 0 | 0 | |
alex-cloud-sn-2-pool | c000f370-0d65-405f-9d7e-10fa8a6d07aa | c000f370-0d65-405f-9d7e-10fa8a6d07aa | false | nvmf | 2147483648 | 2147483648 | 2147483648 | nvmf://179.61.253.33:8420/nqn.2019-05.io.openebs:c000f370-0d65-405f-9d7e-10fa8a6d07aa?uuid=c000f370-0d65-405f-9d7e-10fa8a6d07aa | false | false | 0 | 0 |
Strange, also connection issues between ha cluster and ha node agents?
2024-09-12T16:26:54.739700449+02:00 stdout F [2m2024-09-12T14:26:54.739611Z[0m [31mERROR[0m [1;31mgrpc::operations::ha_node::client[0m[31m: [1;31merror[0m[31m: Unavailable: status: Unavailable, message: "error trying to connect: tcp connect error: No route to host (os error 113)", details: [], metadata: MetadataMap { headers: {} }[0m
2024-09-12T16:26:54.739720687+02:00 stdout F [2;3mat[0m control-plane/grpc/src/operations/ha_node/client.rs:99
2024-09-12T16:26:54.739725225+02:00 stdout F
2024-09-12T16:26:54.739730355+02:00 stdout F [2m2024-09-12T14:26:54.739628Z[0m [32m INFO[0m [1;32magent_ha_cluster::switchover[0m[32m: [32mSending failed Switchover request back to the work queue, [1;32mvolume.uuid[0m[32m: b2158a99-0952-45db-b59d-463b3c2b8dd3, [1;32merror[0m[32m: Nvme path replacement failed: Unavailable: status: Unavailable, message: "error trying to connect: tcp connect error: No route to host (os error 113)", details: [], metadata: MetadataMap { headers: {} }[0m
2024-09-12T16:26:54.739733972+02:00 stdout F [2;3mat[0m control-plane/agents/src/bin/ha/cluster/switchover.rs:573
2024-09-12T16:26:54.739737288+02:00 stdout F
I haven't noticed these errors recently. However, looking around I get indications that it might be something to do with how networking is working in cluster.
@AlexanderDotH Is the Cilium configured to use encrypted connections? If yes could you try disabling encryption and see if you observe better behaviour? --set encryption.enabled=false
Encryption is always disabled but it's a dualstack cluster with IPv4 and IPv6 with BGP. I also could't observe any packet drops or something. Since I opened the issue there wasn't a single outage. Unless today and most of the degraded pods are postgres(stackgres) cluster. Are many read and write actions a issue? Maybe because its constantly replicating the WAL files between each replica. Network throughput is not a issue ig. I ran multiple network benchmarks and it's always around 600-800Gib/s. I could optimize it further using native routing but it's too complicated for me. The 3 storage nodes are providing the entire cluster with storage, is this setup more likely to throw errors and degraded performance? About performance: 4/6 cores on each storage node is dedicated to the io-engine. I also tainted the storage nodes to block any random scheduling on them. (OpenEBS is tainted to deploy on the storage nodes)
From what I can see, the agent-ha-cluster tries to call the agent-ha-node, example node is at: "179.61.253.10:50053"
And we get: connect: tcp connect error: No route to host (os error 113)
Could the dual stack cause this?
The 3 storage nodes are providing the entire cluster with storage, is this setup more likely to throw errors and degraded performance?
hard to say until we find the root cause Did you isolate the cores btw?
-- A simple fix which won't help but found on these logs: https://github.com/openebs/mayastor/pull/1736
That's also weird. In the past I used tuned for core isolation but in the newest kubernetes version I simply had to set it inside the helm command
I'm not familiar with tuned, I set it up on the kernel boot cmdline.
You can check the isolated cores with:
cat /sys/devices/system/cpu/isolated
Maybe we can also check this from within mayastor and report whether we're isolated or not? @dsharma-dc ?
-- Ok I think I have the cause for the lockout of the pool, seems like we try to delete the replica whilst it was still on the nexus, and this may cause some deadlocking behaviour:
2024-09-12T06:54:14.717272539+02:00 stdout F [2024-09-12T04:54:14.717133800+00:00 INFO io_engine::grpc::v1::replica:replica.rs:402] DestroyReplicaRequest { uuid: "4711b421-0210-4db5-b88f-c2c55cac52da", pool: Some(PoolName("alex-cloud-sn-2-pool")) }
2024-09-12T06:54:14.719461813+02:00 stdout F [2024-09-12T04:54:14.719375762+00:00 INFO io_engine::lvs::lvs_lvol:lvs_lvol.rs:247] Lvol 'alex-cloud-sn-2-pool/3a6b6004-dbc8-4613-b316-f1f35fce24e0/4711b421-0210-4db5-b88f-c2c55cac52da' [50.00 GiB]: unshared
2024-09-12T06:54:14.719767776+02:00 stdout F [2024-09-12T04:54:14.719701943+00:00 INFO io_engine::bdev::device:device.rs:785] Received SPDK remove event for bdev '4711b421-0210-4db5-b88f-c2c55cac52da'
2024-09-12T06:54:14.719783986+02:00 stdout F [2024-09-12T04:54:14.719730236+00:00 INFO io_engine::bdev::nexus::nexus_bdev_children:nexus_bdev_children.rs:899] Unplugging nexus child device [3mnexus_name[0m[2m=[0m"331c0652-0a75-4a2c-8946-3caa0590af06" [3mchild_device[0m[2m=[0m"4711b421-0210-4db5-b88f-c2c55cac52da"
2024-09-12T06:54:14.719851202+02:00 stdout F [2024-09-12T04:54:14.719744462+00:00 INFO io_engine::bdev::nexus::nexus_child:nexus_child.rs:1113] Child 'bdev:///4711b421-0210-4db5-b88f-c2c55cac52da?uuid=4711b421-0210-4db5-b88f-c2c55cac52da @ 331c0652-0a75-4a2c-8946-3caa0590af06' [open synced]: unplugging child...
2024-09-12T06:54:14.720345689+02:00 stdout F [2024-09-12T04:54:14.719979192+00:00 INFO io_engine::bdev::nexus::nexus_bdev:nexus_bdev.rs:657] Nexus '331c0652-0a75-4a2c-8946-3caa0590af06' [open]: dynamic reconfiguration event: unplug, reconfiguring I/O channels...
2024-09-12T06:54:14.720361719+02:00 stdout F [2024-09-12T04:54:14.720206068+00:00 INFO io_engine::bdev::nexus::nexus_bdev:nexus_bdev.rs:680] Nexus '331c0652-0a75-4a2c-8946-3caa0590af06' [open]: dynamic reconfiguration event: unplug, reconfiguring I/O channels completed with result: Ok
2024-09-12T06:54:14.7203678+02:00 stdout F [2024-09-12T04:54:14.720225935+00:00 INFO io_engine::bdev::nexus::nexus_child:nexus_child.rs:1157] Child 'bdev:///4711b421-0210-4db5-b88f-c2c55cac52da?uuid=4711b421-0210-4db5-b88f-c2c55cac52da @ 331c0652-0a75-4a2c-8946-3caa0590af06' [closed synced]: child successfully unplugged
I'll raise a separate ticket for this.
In tuned you can save everything like kernel boot cmd inside profiles and also use other toolings with it. To do isolate cores you can do that: https://arc.net/l/quote/tkgjmvqz
I looked at my profile and those lines are not present and also the content of /sys/devices/system/cpu/isolated
is empty.
Here is the weird part, despite that I don't allow any isolated cores, the io-engine uses those cores. (OFC because I specified it inside the helm deployment but the OS doesn't allow that and it still works.):
Attached is how I deploy openebs. openebs.zip
Here are commands I previously ran to setup openebs. Partitioning is on another slide but I think it's not necessary in this case. https://alex-private.notion.site/4-6-4-OpenEBS-0ade457a4a0343638503dcee0a12a7d6
You can also see the live metrics. Everytime when the io-engine has a high cpu throttle you can assume it's getting disconnected. I'll keep the user account online untill there is a fix for this.
Username: openebs Password: openebs
Here is the weird part, despite that I don't allow any isolated cores, the io-engine uses those cores. (OFC because I specified it inside the helm deployment but the OS doesn't allow that and it still works.):
It can use those cores because nothing prevents using them. The io-engine pod is not using guaranteed QoS so even with static cpu manager policy the allowed core list for the process would be the entire list of cores AIUI.
Btw on the nexus list you did above, did you paste the entire list? The nexus 7a9000f5-8729-4010-9db2-86449fa36f4b
for volume 331c0652-0a75-4a2c-8946-3caa0590af06
is missing from that list somehow... I don't see the logs for its destruction, which is odd..
I just went through my logs and found that:
2024-09-12T16:43:53.514425914+02:00 stdout F [2024-09-12T14:43:53.514303915+00:00 INFO io_engine::bdev::nexus::nexus_bdev_children:nexus_bdev_children.rs:899] Unplugging nexus child device [3mnexus_name[0m[2m=[0m"331c0652-0a75-4a2c-8946-3caa0590af06" [3mchild_device[0m[2m=[0m"4711b421-0210-4db5-b88f-c2c55cac52da"
I am also unable to find any errors with the nexus 7a9000f5-8729-4010-9db2-86449fa36f4b
, but it is mentioned many times inside the log file.
Full log: 3.log
I just pasted and formated the list as markdown. I just didn't remove anything. Which logs can I provide? The files and the data right after where the failure happened are the most accurate. In the meantime I just fixed the faulty volumes by deleting them an rebuilding them using stackgres.
For nexus 7a9000f5-8729-4010-9db2-86449fa36f4b
of volume 331c0652-0a75-4a2c-8946-3caa0590af06
, there is almost 10 hours time gap from the time nexus had no child till nexus is destroyed at 2024-09-12T16:43:53
. But this doesn't explain the original issue though.
Why do we deadlock though @tiagolobocastro ? How did this destroy call get triggered here, which typically is initiated by control-plane during child fault and retire. Seems to me like a manual replica destroy attempt?
2024-09-12T06:54:14.717272539+02:00 stdout F [2024-09-12T04:54:14.717133800+00:00 INFO io_engine::grpc::v1::replica:replica.rs:402] DestroyReplicaRequest { uuid: "4711b421-0210-4db5-b88f-c2c55cac52da", pool: Some(PoolName("alex-cloud-sn-2-pool")) }
For the postgres volume 5b1da3b6-7890-4e54-ac08-9ef12bd50f9e, I see the volume has got republished which is why the nexus was shutdown on node 179.61.253.33 and republished on node 179.61.253.31. The volume remained degraded for some time because it couldn't reconcile replica count due to lock contention.
Ah I see it in this new log file now, thank you @AlexanderDotH @dsharma-dc the reason is explained on the other ticket I raised.
But great, because the nexus is now destroyed the lockout on the pool is now removed.
@AlexanderDotH again I see some intermittent networking failures:
2024-09-13T01:21:56.360287893+02:00 stdout F [2024-09-12T23:21:56.356249814+00:00 ERROR io_engine::subsys::registration::registration_grpc:registration_grpc.rs:228] Registration failed: Status { code: Cancelled, message: "Timeout expired", source: Some(tonic::transport::Error(Transport, TimeoutExpired(()))) }
2024-09-13T01:22:01.362367386+02:00 stdout F [2024-09-12T23:22:01.362239928+00:00 INFO io_engine::subsys::registration::registration_grpc:registration_grpc.rs:219] Re-registered '"alex-cloud-sn-2"' with grpc server 179.61.253.33:10124 ...
No problem :). How can I test the connectivity? Which pods should I ping?
Hey I saw some packet drops today and thought it would be worth checking on openebs and it happened again. Some pods got disconnected and nearly all are degraded(from kubectl mayastor get volumes
).
Also attached is a broader log from the cluster. Even networking but I couldn't find anything. Do you know something new @tiagolobocastro ?
Sorry I had to upload it to google drive because the dump is around 200MB. https://drive.google.com/file/d/1hLjFSMxf9JYkqWdiJmMW5o6S12ODM5CH/view?usp=sharing
PR to fix the control-plane locking the pool: https://github.com/openebs/mayastor-control-plane/pull/862
No problem :). How can I test the connectivity? Which pods should I ping?
I'm not sure tbh. @Abhinandan-Purkait any ideas on how to identify connection issues between ha-cluster and ha-node?
I'm also again thinking about the fact this is dual stack, let me see if I can setup a dual stack cluster and see if I also have any issues there. Currently we know already that we bind only to IPv4 (work for IPv6 is in-progress), I wonder if that could have anything to do with it. FYI - https://github.com/openebs/mayastor/issues/1730
Thanks for the PR when will it be available via helm? Also some benchmarkring tools would be great. Maybe implemented into the mayastor kubectl plugin? There are many fio tests but no tests between all openebs nodes and some stress testing.
Other question: How does actually replica rebuild work? Can you maybe force a rebuild?
Hey, the locking PR is now release as part of 2.7.1 The IPv6 PR is still ongoing.
We recently did some benchmarking with cloudnative pg benchmarks but we don't have any read made solution of our own. The community might have something to help here, I remember @kukacz was doing something similar at some point.
For the rebuild, when a volume is published the nexus automatically copies the data from one replica to another.
Thanks you! I guess I'll wait until the new release is out
Describe the bug Some of my volumes are randomly disconnecting due to unknown reason. I have 3 storage nodes with 6 cores(4 of them are dedicated to mayastor) and 8GB of ram. Some volumes are mounted via mayastor to the worker nodes. After some time when I don't look at my cluster it keeps disconnecting to the pods and keep them in a read-only state. The cluster is a fresh installation of native Kubernetes 1.31.0 with Kubeadm. After setup everything works fine and after some time it doesn't work. The mayastor csi-nde logs also say its published and working.
To Reproduce
helm install openebs --namespace kube-storage openebs/openebs --create-namespace \ --set mayastor.enabled=true \ --set mayastor.crds.enabled=true \ --set mayastor.etcd.clusterDomain=alex-cloud.internal \ --set engines.local.lvm.enabled=false \ --set engines.local.zfs.enabled=false \ --set localprovisioner.enabled=false \ --set 'mayastor.io_engine.coreList={2,3,4,5}' \ --set zfs-localpv.localpv.tolerations[0].key=role \ --set zfs-localpv.localpv.tolerations[0].operator=Equal \ --set zfs-localpv.localpv.tolerations[0].value=storage \ --set zfs-localpv.localpv.tolerations[0].effect=NoSchedule \ --set zfs-localpv.zfsController.provisioner.tolerations[0].key=role \ --set zfs-localpv.zfsController.provisioner.tolerations[0].operator=Equal \ --set zfs-localpv.zfsController.provisioner.tolerations[0].value=storage \ --set zfs-localpv.zfsController.provisioner.tolerations[0].effect=NoSchedule \ --set mayastor.crds.csi.volumeSnapshots.enabled=false \ --set mayastor.tolerations[0].key=role \ --set mayastor.tolerations[0].operator=Equal \ --set mayastor.tolerations[0].value=storage \ --set mayastor.tolerations[0].effect=NoSchedule \ --no-hooks
Setup storage pools from a partition on the storage nodes. (For this just one)
apiVersion: "openebs.io/v1beta2" kind: DiskPool metadata: name: alex-cloud-sn-1-pool namespace: kube-storage spec: node: alex-cloud-sn-1 disks: ["/dev/sda2"]
Setup the storage class.
apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: alex-cloud-default-sc annotations: storageclass.kubernetes.io/is-default-class: "true" parameters: ioTimeout: "30" protocol: nvmf repl: "3" fsType: "ext4" allowVolumeExpansion: true provisioner: io.openebs.csi-mayastor
Attach the volume to any pod or deployment.
Expected behavior Stay connected no matter what happens.
Screenshots Not really possible but I can provide logs.
OS info (please complete the following information):
Distro: Rocky Linux 9.4 (Blue Onyx)
Kernel version:
Newest from helm (2.7.0)
Additional context We can also jump on a call or something. This drives me crazy. Here is my discord: @alexdoth