Open dmrub opened 4 months ago
I see also errors in dmesg output on the k8s-m1 node
[Mo Jun 3 12:00:36 2024] IPVS: rr: TCP 10.96.29.122:3371 - no destination available
[Mo Jun 3 12:00:36 2024] IPVS: rr: TCP [fd12::7128]:3371 - no destination available
[Mo Jun 3 12:00:36 2024] IPVS: rr: TCP 10.96.29.122:3371 - no destination available
[Mo Jun 3 12:00:36 2024] IPVS: rr: TCP [fd12::7128]:3371 - no destination available
[Mo Jun 3 12:00:37 2024] IPVS: rr: TCP [fd12::7128]:3371 - no destination available
[Mo Jun 3 12:00:37 2024] IPVS: rr: TCP 10.96.29.122:3371 - no destination available
[Mo Jun 3 12:00:37 2024] IPVS: rr: TCP [fd12::7128]:3371 - no destination available
[Mo Jun 3 12:00:37 2024] IPVS: rr: TCP [fd12::7128]:3371 - no destination available
[Mo Jun 3 12:00:38 2024] IPVS: rr: TCP [fd12::7128]:3371 - no destination available
[Mo Jun 3 12:00:38 2024] IPVS: rr: TCP 10.96.29.122:3371 - no destination available
[Mo Jun 3 12:00:41 2024] net_ratelimit: 8 callbacks suppressed
[Mo Jun 3 12:00:41 2024] IPVS: rr: TCP 10.96.29.122:3371 - no destination available
[Mo Jun 3 12:00:41 2024] IPVS: rr: TCP [fd12::7128]:3371 - no destination available
[Mo Jun 3 12:00:42 2024] IPVS: rr: TCP [fd12::7128]:3371 - no destination available
[Mo Jun 3 12:00:42 2024] IPVS: rr: TCP [fd12::7128]:3371 - no destination available
[Mo Jun 3 12:00:42 2024] IPVS: rr: TCP 10.96.29.122:3371 - no destination available
[Mo Jun 3 12:00:42 2024] IPVS: rr: TCP [fd12::7128]:3371 - no destination available
[Mo Jun 3 12:00:43 2024] IPVS: rr: TCP [fd12::7128]:3371 - no destination available
[Mo Jun 3 12:00:43 2024] IPVS: rr: TCP 10.96.29.122:3371 - no destination available
[Mo Jun 3 12:00:43 2024] IPVS: rr: TCP [fd12::7128]:3371 - no destination available
[Mo Jun 3 12:00:44 2024] IPVS: rr: TCP 10.96.29.122:3371 - no destination available
[Mo Jun 3 12:00:47 2024] net_ratelimit: 7 callbacks suppressed
[Mo Jun 3 12:00:47 2024] IPVS: rr: TCP [fd12::7128]:3371 - no destination available
[Mo Jun 3 12:00:47 2024] IPVS: rr: TCP 10.96.29.122:3371 - no destination available
[Mo Jun 3 12:00:47 2024] IPVS: rr: TCP [fd12::7128]:3371 - no destination available
10.96.29.122 is the IP of the linstor-controller service:
$ kubectl get svc -A -o wide | grep -F 10.96.29.122
piraeus-datastore linstor-controller ClusterIP 10.96.29.122 <none> 3371/TCP,3370/TCP 28d app.kubernetes.io/component=linstor-controller,app.kubernetes.io/instance=linstorcluster,app.kubernetes.io/name=piraeus-datastore
Something is still using the resource on node m2, so it cannot start on m0. Check the output of mount
on m2 to see where the volume is in use.
This was a Grafana pod as part of the Kubernetes monitoring deployment. Kubernetes tried to restart it several times due to issues with Linstor Storage until it was successfully started on the k8s-m2 node. Now I have scaled down the corresponding deployment and the problem is still there:
$ kubectl scale deployment -n monitoring kube-prometheus-stack-grafana --replicas 0
$ kubectl exec -ti -n piraeus-datastore deployments/linstor-controller -- /bin/bash
root@linstor-controller-797bc7456f-8mgws:/# linstor r l
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-40a7bc3f-d655-4606-a671-863913f657c0 ┊ k8s-m0 ┊ ┊ Unused ┊ StandAlone(k8s-m1) ┊ Diskless ┊ 2024-05-06 16:11:35 ┊
┊ pvc-40a7bc3f-d655-4606-a671-863913f657c0 ┊ k8s-m1 ┊ ┊ Unused ┊ Connecting(k8s-m0) ┊ UpToDate ┊ 2024-05-06 16:11:31 ┊
┊ pvc-40a7bc3f-d655-4606-a671-863913f657c0 ┊ k8s-m2 ┊ ┊ Unused ┊ Ok ┊ UpToDate ┊ 2024-05-06 16:11:35 ┊
┊ pvc-335fe40b-7250-4b2a-a0b3-c9eb1780e528 ┊ k8s-m0 ┊ ┊ Unused ┊ Ok ┊ UpToDate ┊ 2024-05-29 15:11:23 ┊
┊ pvc-335fe40b-7250-4b2a-a0b3-c9eb1780e528 ┊ k8s-m1 ┊ ┊ Unused ┊ Ok ┊ UpToDate ┊ 2024-05-29 15:11:27 ┊
┊ pvc-335fe40b-7250-4b2a-a0b3-c9eb1780e528 ┊ k8s-m2 ┊ ┊ Unused ┊ Ok ┊ TieBreaker ┊ 2024-05-29 15:11:26 ┊
┊ pvc-2492b46b-6466-4e2d-8820-b5fa9299ad9c ┊ k8s-m0 ┊ ┊ Unused ┊ Ok ┊ TieBreaker ┊ 2024-06-04 13:08:55 ┊
┊ pvc-2492b46b-6466-4e2d-8820-b5fa9299ad9c ┊ k8s-m1 ┊ ┊ Unused ┊ Ok ┊ UpToDate ┊ 2024-06-04 13:08:52 ┊
┊ pvc-2492b46b-6466-4e2d-8820-b5fa9299ad9c ┊ k8s-m2 ┊ ┊ Unused ┊ Ok ┊ UpToDate ┊ 2024-06-04 13:08:56 ┊
┊ pvc-4297b5a5-4c61-4638-a63d-729f5021d46f ┊ k8s-m0 ┊ ┊ InUse ┊ Ok ┊ UpToDate ┊ 2024-05-06 16:11:28 ┊
┊ pvc-4297b5a5-4c61-4638-a63d-729f5021d46f ┊ k8s-m1 ┊ ┊ Unused ┊ Ok ┊ Diskless ┊ 2024-05-06 16:11:35 ┊
┊ pvc-4297b5a5-4c61-4638-a63d-729f5021d46f ┊ k8s-m2 ┊ ┊ Unused ┊ Ok ┊ UpToDate ┊ 2024-05-06 16:11:34 ┊
┊ pvc-cae1b7e0-d80d-47a8-8161-53063a5ccf36 ┊ k8s-m0 ┊ ┊ Unused ┊ Ok ┊ UpToDate ┊ 2024-05-23 16:37:38 ┊
┊ pvc-cae1b7e0-d80d-47a8-8161-53063a5ccf36 ┊ k8s-m1 ┊ ┊ Unused ┊ Ok ┊ TieBreaker ┊ 2024-05-23 16:37:43 ┊
┊ pvc-cae1b7e0-d80d-47a8-8161-53063a5ccf36 ┊ k8s-m2 ┊ ┊ Unused ┊ Ok ┊ UpToDate ┊ 2024-05-23 16:37:44 ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
You can try running kubctl exec ds/linstor-satellite.k8s-m0 -- drbdadm adjust pvc-40a7bc3f-d655-4606-a671-863913f657c0
to kick things back into working order.
Thanks, but when I went to execute the command, I realized that Linstor had already somehow repaired itself:
$ kubectl exec -ti -n piraeus-datastore deployments/linstor-controller -- /bin/bash
root@linstor-controller-797bc7456f-8mgws:/# linstor r l
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞══════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-40a7bc3f-d655-4606-a671-863913f657c0 ┊ k8s-m0 ┊ ┊ Unused ┊ Ok ┊ Diskless ┊ 2024-05-06 16:11:35 ┊
┊ pvc-40a7bc3f-d655-4606-a671-863913f657c0 ┊ k8s-m1 ┊ ┊ Unused ┊ Ok ┊ UpToDate ┊ 2024-05-06 16:11:31 ┊
┊ pvc-40a7bc3f-d655-4606-a671-863913f657c0 ┊ k8s-m2 ┊ ┊ Unused ┊ Ok ┊ UpToDate ┊ 2024-05-06 16:11:35 ┊
┊ pvc-335fe40b-7250-4b2a-a0b3-c9eb1780e528 ┊ k8s-m0 ┊ ┊ Unused ┊ Ok ┊ UpToDate ┊ 2024-05-29 15:11:23 ┊
┊ pvc-335fe40b-7250-4b2a-a0b3-c9eb1780e528 ┊ k8s-m1 ┊ ┊ Unused ┊ Ok ┊ UpToDate ┊ 2024-05-29 15:11:27 ┊
┊ pvc-335fe40b-7250-4b2a-a0b3-c9eb1780e528 ┊ k8s-m2 ┊ ┊ Unused ┊ Ok ┊ TieBreaker ┊ 2024-05-29 15:11:26 ┊
┊ pvc-2492b46b-6466-4e2d-8820-b5fa9299ad9c ┊ k8s-m0 ┊ ┊ Unused ┊ Ok ┊ TieBreaker ┊ 2024-06-04 13:08:55 ┊
┊ pvc-2492b46b-6466-4e2d-8820-b5fa9299ad9c ┊ k8s-m1 ┊ ┊ Unused ┊ Ok ┊ UpToDate ┊ 2024-06-04 13:08:52 ┊
┊ pvc-2492b46b-6466-4e2d-8820-b5fa9299ad9c ┊ k8s-m2 ┊ ┊ Unused ┊ Ok ┊ UpToDate ┊ 2024-06-04 13:08:56 ┊
┊ pvc-4297b5a5-4c61-4638-a63d-729f5021d46f ┊ k8s-m0 ┊ ┊ Unused ┊ Ok ┊ UpToDate ┊ 2024-05-06 16:11:28 ┊
┊ pvc-4297b5a5-4c61-4638-a63d-729f5021d46f ┊ k8s-m1 ┊ ┊ Unused ┊ Ok ┊ Diskless ┊ 2024-05-06 16:11:35 ┊
┊ pvc-4297b5a5-4c61-4638-a63d-729f5021d46f ┊ k8s-m2 ┊ ┊ InUse ┊ Ok ┊ UpToDate ┊ 2024-05-06 16:11:34 ┊
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
I also tested to run grafana deployment with nodeSelector on all nodes without problems. Can you give me any clues as to what actually happened and what your command should have done?
I see errors in the pod events:
The cluster consists of three master/worker nodes k8s-m0, k8s-m1 and k8s-m2
I see only one linstor error
There is an issue with PVC used by the above pod when the pod is running on node k8s-m0
When I run
dmesg -T
on the k8s-m0 node, I get the following output:Create and attach SOS report:
sos_2024-06-03_14-45-58.tar.gz