Error “can't read superblock on /dev/drbd1001.” when trying to start a pod

dmrub commented 4 months ago

I see errors in the pod events:

$ kubectl describe pod -n monitoring kube-prometheus-stack-grafana-5fd67c647-7m5hs
Name:             kube-prometheus-stack-grafana-5fd67c647-7m5hs
Namespace:        monitoring
Priority:         0
Service Account:  kube-prometheus-stack-grafana
Node:             k8s-m0/10.1.27.210
Start Time:       Mon, 03 Jun 2024 08:55:48 +0200
Labels:           app.kubernetes.io/instance=kube-prometheus-stack
                  app.kubernetes.io/name=grafana
                  pod-template-hash=5fd67c647
Annotations:      checksum/config: 4eb333c36d59bdac7f568835d4c71dc0e5b1ca4f0bbc06eb5379f9ca352b8914
                  checksum/sc-dashboard-provider-config: 593c0a8778b83f11fe80ccb21dfb20bc46705e2be3178df1dc4c89d164c8cd9c
                  checksum/secret: 032056e9c62bbe9d1daa41ee49cd3d9524c076f51ca4c65adadf4ef08ef28712
                  kubectl.kubernetes.io/default-container: grafana
Status:           Pending
IP:               
IPs:              <none>
Controlled By:    ReplicaSet/kube-prometheus-stack-grafana-5fd67c647
Init Containers:
  init-chown-data:
    Container ID:    
    Image:           docker.io/library/busybox:1.31.1
    Image ID:        
    Port:            <none>
    Host Port:       <none>
    SeccompProfile:  RuntimeDefault
    Command:
      chown
      -R
      472:472
      /var/lib/grafana
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/lib/grafana from storage (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-p2xnp (ro)
Containers:
  grafana-sc-dashboard:
    Container ID:    
    Image:           quay.io/kiwigrid/k8s-sidecar:1.26.1
    Image ID:        
    Port:            <none>
    Host Port:       <none>
    SeccompProfile:  RuntimeDefault
    State:           Waiting
      Reason:        PodInitializing
    Ready:           False
    Restart Count:   0
    Environment:
      METHOD:        WATCH
      LABEL:         grafana_dashboard
      LABEL_VALUE:   1
      FOLDER:        /tmp/dashboards
      RESOURCE:      both
      NAMESPACE:     ALL
      REQ_USERNAME:  <set to the key 'admin-user' in secret 'kube-prometheus-stack-grafana'>      Optional: false
      REQ_PASSWORD:  <set to the key 'admin-password' in secret 'kube-prometheus-stack-grafana'>  Optional: false
      REQ_URL:       http://localhost:3000/api/admin/provisioning/dashboards/reload
      REQ_METHOD:    POST
    Mounts:
      /tmp/dashboards from sc-dashboard-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-p2xnp (ro)
  grafana-sc-datasources:
    Container ID:    
    Image:           quay.io/kiwigrid/k8s-sidecar:1.26.1
    Image ID:        
    Port:            <none>
    Host Port:       <none>
    SeccompProfile:  RuntimeDefault
    State:           Waiting
      Reason:        PodInitializing
    Ready:           False
    Restart Count:   0
    Environment:
      METHOD:        WATCH
      LABEL:         grafana_datasource
      LABEL_VALUE:   1
      FOLDER:        /etc/grafana/provisioning/datasources
      RESOURCE:      both
      REQ_USERNAME:  <set to the key 'admin-user' in secret 'kube-prometheus-stack-grafana'>      Optional: false
      REQ_PASSWORD:  <set to the key 'admin-password' in secret 'kube-prometheus-stack-grafana'>  Optional: false
      REQ_URL:       http://localhost:3000/api/admin/provisioning/datasources/reload
      REQ_METHOD:    POST
    Mounts:
      /etc/grafana/provisioning/datasources from sc-datasources-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-p2xnp (ro)
  grafana:
    Container ID:    
    Image:           docker.io/grafana/grafana:10.4.1
    Image ID:        
    Ports:           3000/TCP, 9094/TCP, 9094/UDP
    Host Ports:      0/TCP, 0/TCP, 0/UDP
    SeccompProfile:  RuntimeDefault
    State:           Waiting
      Reason:        PodInitializing
    Ready:           False
    Restart Count:   0
    Liveness:        http-get http://:3000/api/health delay=60s timeout=30s period=10s #success=1 #failure=10
    Readiness:       http-get http://:3000/api/health delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      POD_IP:                       (v1:status.podIP)
      GF_SECURITY_ADMIN_USER:      <set to the key 'admin-user' in secret 'kube-prometheus-stack-grafana'>      Optional: false
      GF_SECURITY_ADMIN_PASSWORD:  <set to the key 'admin-password' in secret 'kube-prometheus-stack-grafana'>  Optional: false
      GF_PATHS_DATA:               /var/lib/grafana/
      GF_PATHS_LOGS:               /var/log/grafana
      GF_PATHS_PLUGINS:            /var/lib/grafana/plugins
      GF_PATHS_PROVISIONING:       /etc/grafana/provisioning
    Mounts:
      /etc/grafana/grafana.ini from config (rw,path="grafana.ini")
      /etc/grafana/provisioning/dashboards/sc-dashboardproviders.yaml from sc-dashboard-provider (rw,path="provider.yaml")
      /etc/grafana/provisioning/datasources from sc-datasources-volume (rw)
      /tmp/dashboards from sc-dashboard-volume (rw)
      /var/lib/grafana from storage (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-p2xnp (ro)
Conditions:
  Type              Status
  Initialized       False 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      kube-prometheus-stack-grafana
    Optional:  false
  storage:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  kube-prometheus-stack-grafana
    ReadOnly:   false
  sc-dashboard-volume:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  sc-dashboard-provider:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      kube-prometheus-stack-grafana-config-dashboards
    Optional:  false
  sc-datasources-volume:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-p2xnp:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason              Age   From                     Message
  ----     ------              ----  ----                     -------
  Normal   Scheduled           49m   default-scheduler        Successfully assigned monitoring/kube-prometheus-stack-grafana-5fd67c647-7m5hs to k8s-m0
  Warning  FailedAttachVolume  49m   attachdetach-controller  Multi-Attach error for volume "pvc-40a7bc3f-d655-4606-a671-863913f657c0" Volume is already used by pod(s) kube-prometheus-stack-grafana-5fd67c647-dcqrx
  Warning  FailedMount         49m   kubelet                  MountVolume.SetUp failed for volume "pvc-40a7bc3f-d655-4606-a671-863913f657c0" : rpc error: code = Internal desc = NodePublishVolume failed for pvc-40a7bc3f-d655-4606-a671-863913f657c0: mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t ext4 -o _netdev /dev/drbd1001 /var/lib/kubelet/pods/1fc6504d-b1e0-43d1-8012-32a69210124a/volumes/kubernetes.io~csi/pvc-40a7bc3f-d655-4606-a671-863913f657c0/mount
Output: mount: /var/lib/kubelet/pods/1fc6504d-b1e0-43d1-8012-32a69210124a/volumes/kubernetes.io~csi/pvc-40a7bc3f-d655-4606-a671-863913f657c0/mount: can't read superblock on /dev/drbd1001.
       dmesg(1) may have more information after failed mount system call.
  Normal   SuccessfulAttachVolume  4m5s (x46 over 49m)  attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-40a7bc3f-d655-4606-a671-863913f657c0"
  Warning  FailedMount             6s (x31 over 49m)    kubelet                  MountVolume.WaitForAttach failed for volume "pvc-40a7bc3f-d655-4606-a671-863913f657c0" : volume pvc-40a7bc3f-d655-4606-a671-863913f657c0 has GET error for volume attachment csi-1e1f5fcc627e2e7c3ca2595b3593e4d17c9188e63cec5e7d8f1cbe950418e564: volumeattachments.storage.k8s.io "csi-1e1f5fcc627e2e7c3ca2595b3593e4d17c9188e63cec5e7d8f1cbe950418e564" is forbidden: User "system:node:k8s-m0" cannot get resource "volumeattachments" in API group "storage.k8s.io" at the cluster scope: no relationship found between node 'k8s-m0' and this object

The cluster consists of three master/worker nodes k8s-m0, k8s-m1 and k8s-m2

$ kubectl get nodes -o wide
NAME     STATUS   ROLES           AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION       CONTAINER-RUNTIME
k8s-m0   Ready    control-plane   28d   v1.28.9   10.1.27.210   <none>        Ubuntu 22.04.4 LTS   5.15.0-107-generic   containerd://1.7.3
k8s-m1   Ready    control-plane   28d   v1.28.9   10.1.27.211   <none>        Ubuntu 22.04.4 LTS   5.15.0-107-generic   containerd://1.7.3
k8s-m2   Ready    control-plane   28d   v1.28.9   10.1.27.212   <none>        Ubuntu 22.04.4 LTS   5.15.0-107-generic   containerd://1.7.3

I see only one linstor error

$ kubectl exec -ti -n piraeus-datastore deployments/linstor-controller -- /bin/bash

root@linstor-controller-797bc7456f-8mgws:/# linstor error-reports list
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ Id                    ┊ Datetime            ┊ Node                                  ┊ Exception                               ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ 664F7013-00000-000000 ┊ 2024-06-02 22:00:12 ┊ C|linstor-controller-797bc7456f-8mgws ┊ SocketException: Network is unreachable ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
root@linstor-controller-797bc7456f-8mgws:/# linstor error-reports show 664F7013-00000-000000
ERROR REPORT 664F7013-00000-000000

============================================================

Application:                        LINBIT? LINSTOR
Module:                             Controller
Version:                            1.27.0
Build ID:                           8250eddde5f533facba39b4d1f77f1ef85f8521d
Build time:                         2024-04-02T07:12:21+00:00
Error time:                         2024-06-02 22:00:12
Node:                               linstor-controller-797bc7456f-8mgws
Thread:                             SslConnector
Access context information

Identity:                           PUBLIC
Role:                               PUBLIC
Domain:                             PUBLIC

Peer:                               Node: 'k8s-m2'

============================================================

Reported error:
===============

Category:                           Exception
Class name:                         SocketException
Class canonical name:               java.net.SocketException
Generated at:                       Method 'pollConnect', Source file 'Net.java, Unknown line number

Error message:                      Network is unreachable

Error context:
        I/O exception while attempting to connect to the peer
Call backtrace:

    Method                                   Native Class:Line number
    pollConnect                              Y      sun.nio.ch.Net:unknown
    pollConnectNow                           N      sun.nio.ch.Net:672
    finishConnect                            N      sun.nio.ch.SocketChannelImpl:946
    establishConnection                      N      com.linbit.linstor.netcom.TcpConnectorService:993
    run                                      N      com.linbit.linstor.netcom.TcpConnectorService:728
    run                                      N      java.lang.Thread:840

END OF ERROR REPORT.

There is an issue with PVC used by the above pod when the pod is running on node k8s-m0

root@linstor-controller-797bc7456f-8mgws:/# linstor r l
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName                             ┊ Node   ┊ Port ┊ Usage  ┊ Conns              ┊      State ┊ CreatedOn           ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-40a7bc3f-d655-4606-a671-863913f657c0 ┊ k8s-m0 ┊      ┊ Unused ┊ StandAlone(k8s-m1) ┊   Diskless ┊ 2024-05-06 16:11:35 ┊
┊ pvc-40a7bc3f-d655-4606-a671-863913f657c0 ┊ k8s-m1 ┊      ┊ Unused ┊ Connecting(k8s-m0) ┊   UpToDate ┊ 2024-05-06 16:11:31 ┊
┊ pvc-40a7bc3f-d655-4606-a671-863913f657c0 ┊ k8s-m2 ┊      ┊ InUse  ┊ Ok                 ┊   UpToDate ┊ 2024-05-06 16:11:35 ┊
┊ pvc-335fe40b-7250-4b2a-a0b3-c9eb1780e528 ┊ k8s-m0 ┊      ┊ Unused ┊ Ok                 ┊   UpToDate ┊ 2024-05-29 15:11:23 ┊
┊ pvc-335fe40b-7250-4b2a-a0b3-c9eb1780e528 ┊ k8s-m1 ┊      ┊ Unused ┊ Ok                 ┊   UpToDate ┊ 2024-05-29 15:11:27 ┊
┊ pvc-335fe40b-7250-4b2a-a0b3-c9eb1780e528 ┊ k8s-m2 ┊      ┊ Unused ┊ Ok                 ┊ TieBreaker ┊ 2024-05-29 15:11:26 ┊
┊ pvc-4297b5a5-4c61-4638-a63d-729f5021d46f ┊ k8s-m0 ┊      ┊ InUse  ┊ Ok                 ┊   UpToDate ┊ 2024-05-06 16:11:28 ┊
┊ pvc-4297b5a5-4c61-4638-a63d-729f5021d46f ┊ k8s-m1 ┊      ┊ Unused ┊ Ok                 ┊   Diskless ┊ 2024-05-06 16:11:35 ┊
┊ pvc-4297b5a5-4c61-4638-a63d-729f5021d46f ┊ k8s-m2 ┊      ┊ Unused ┊ Ok                 ┊   UpToDate ┊ 2024-05-06 16:11:34 ┊
┊ pvc-cae1b7e0-d80d-47a8-8161-53063a5ccf36 ┊ k8s-m0 ┊      ┊ Unused ┊ Ok                 ┊   UpToDate ┊ 2024-05-23 16:37:38 ┊
┊ pvc-cae1b7e0-d80d-47a8-8161-53063a5ccf36 ┊ k8s-m1 ┊      ┊ Unused ┊ Ok                 ┊ TieBreaker ┊ 2024-05-23 16:37:43 ┊
┊ pvc-cae1b7e0-d80d-47a8-8161-53063a5ccf36 ┊ k8s-m2 ┊      ┊ Unused ┊ Ok                 ┊   UpToDate ┊ 2024-05-23 16:37:44 ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

When I run dmesg -T on the k8s-m0 node, I get the following output:


[Mo Jun  3 05:16:06 2024] drbd pvc-40a7bc3f-d655-4606-a671-863913f657c0: Preparing cluster-wide state change 352557914 (2->-1 3/1)
[Mo Jun  3 05:16:06 2024] drbd pvc-40a7bc3f-d655-4606-a671-863913f657c0: Aborting cluster-wide state change 352557914 (0ms) rv = -10
[Mo Jun  3 05:16:06 2024] drbd pvc-40a7bc3f-d655-4606-a671-863913f657c0: Declined by peer k8s-m2 (id: 1), see the kernel log there
[Mo Jun  3 05:16:06 2024] drbd pvc-40a7bc3f-d655-4606-a671-863913f657c0: Preparing cluster-wide state change 2847615758 (2->-1 3/1)
[Mo Jun  3 05:16:06 2024] drbd pvc-40a7bc3f-d655-4606-a671-863913f657c0: Aborting cluster-wide state change 2847615758 (0ms) rv = -10
[Mo Jun  3 05:16:06 2024] drbd pvc-40a7bc3f-d655-4606-a671-863913f657c0: Declined by peer k8s-m2 (id: 1), see the kernel log there
[Mo Jun  3 05:16:06 2024] drbd pvc-40a7bc3f-d655-4606-a671-863913f657c0: Auto-promote failed: State change was refused by peer node (-10)
[Mo Jun  3 05:16:06 2024] EXT4-fs (drbd1001): INFO: recovery required on readonly filesystem
[Mo Jun  3 05:16:06 2024] EXT4-fs (drbd1001): write access will be enabled during recovery
[Mo Jun  3 05:16:07 2024] drbd pvc-40a7bc3f-d655-4606-a671-863913f657c0/0 drbd1001: Rejected WRITE request, not in Primary role. open_cnt:1 [mount:600441:2024-06-03_03:15:46.918]
[Mo Jun  3 05:16:07 2024] Buffer I/O error on dev drbd1001, logical block 0, lost async page write
[Mo Jun  3 05:16:07 2024] Buffer I/O error on dev drbd1001, logical block 1, lost async page write
[Mo Jun  3 05:16:07 2024] Buffer I/O error on dev drbd1001, logical block 1028, lost async page write
[Mo Jun  3 05:16:07 2024] Buffer I/O error on dev drbd1001, logical block 1043, lost async page write
[Mo Jun  3 05:16:07 2024] Buffer I/O error on dev drbd1001, logical block 1059, lost async page write
[Mo Jun  3 05:16:07 2024] Buffer I/O error on dev drbd1001, logical block 9267, lost async page write
[Mo Jun  3 05:16:07 2024] Buffer I/O error on dev drbd1001, logical block 524320, lost async page write
[Mo Jun  3 05:16:07 2024] Buffer I/O error on dev drbd1001, logical block 1572880, lost async page write
[Mo Jun  3 05:16:07 2024] Buffer I/O error on dev drbd1001, logical block 1572896, lost async page write
[Mo Jun  3 05:16:07 2024] Buffer I/O error on dev drbd1001, logical block 1581106, lost async page write
[Mo Jun  3 05:16:07 2024] JBD2: recovery failed
[Mo Jun  3 05:16:07 2024] EXT4-fs (drbd1001): error loading journal
[Mo Jun  3 05:16:10 2024] drbd pvc-40a7bc3f-d655-4606-a671-863913f657c0 k8s-m2: Preparing remote state change 502811644
[Mo Jun  3 05:16:10 2024] drbd pvc-40a7bc3f-d655-4606-a671-863913f657c0 k8s-m2: Committing remote state change 502811644 (primary_nodes=0)

Create and attach SOS report:

kubectl exec -n piraeus-datastore -it deploy/linstor-controller -- linstor sos-report create

sos_2024-06-03_14-45-58.tar.gz

dmrub commented 4 months ago

I see also errors in dmesg output on the k8s-m1 node

[Mo Jun  3 12:00:36 2024] IPVS: rr: TCP 10.96.29.122:3371 - no destination available
[Mo Jun  3 12:00:36 2024] IPVS: rr: TCP [fd12::7128]:3371 - no destination available
[Mo Jun  3 12:00:36 2024] IPVS: rr: TCP 10.96.29.122:3371 - no destination available
[Mo Jun  3 12:00:36 2024] IPVS: rr: TCP [fd12::7128]:3371 - no destination available
[Mo Jun  3 12:00:37 2024] IPVS: rr: TCP [fd12::7128]:3371 - no destination available
[Mo Jun  3 12:00:37 2024] IPVS: rr: TCP 10.96.29.122:3371 - no destination available
[Mo Jun  3 12:00:37 2024] IPVS: rr: TCP [fd12::7128]:3371 - no destination available
[Mo Jun  3 12:00:37 2024] IPVS: rr: TCP [fd12::7128]:3371 - no destination available
[Mo Jun  3 12:00:38 2024] IPVS: rr: TCP [fd12::7128]:3371 - no destination available
[Mo Jun  3 12:00:38 2024] IPVS: rr: TCP 10.96.29.122:3371 - no destination available
[Mo Jun  3 12:00:41 2024] net_ratelimit: 8 callbacks suppressed
[Mo Jun  3 12:00:41 2024] IPVS: rr: TCP 10.96.29.122:3371 - no destination available
[Mo Jun  3 12:00:41 2024] IPVS: rr: TCP [fd12::7128]:3371 - no destination available
[Mo Jun  3 12:00:42 2024] IPVS: rr: TCP [fd12::7128]:3371 - no destination available
[Mo Jun  3 12:00:42 2024] IPVS: rr: TCP [fd12::7128]:3371 - no destination available
[Mo Jun  3 12:00:42 2024] IPVS: rr: TCP 10.96.29.122:3371 - no destination available
[Mo Jun  3 12:00:42 2024] IPVS: rr: TCP [fd12::7128]:3371 - no destination available
[Mo Jun  3 12:00:43 2024] IPVS: rr: TCP [fd12::7128]:3371 - no destination available
[Mo Jun  3 12:00:43 2024] IPVS: rr: TCP 10.96.29.122:3371 - no destination available
[Mo Jun  3 12:00:43 2024] IPVS: rr: TCP [fd12::7128]:3371 - no destination available
[Mo Jun  3 12:00:44 2024] IPVS: rr: TCP 10.96.29.122:3371 - no destination available
[Mo Jun  3 12:00:47 2024] net_ratelimit: 7 callbacks suppressed
[Mo Jun  3 12:00:47 2024] IPVS: rr: TCP [fd12::7128]:3371 - no destination available
[Mo Jun  3 12:00:47 2024] IPVS: rr: TCP 10.96.29.122:3371 - no destination available
[Mo Jun  3 12:00:47 2024] IPVS: rr: TCP [fd12::7128]:3371 - no destination available

10.96.29.122 is the IP of the linstor-controller service:

$ kubectl get svc -A -o wide | grep -F 10.96.29.122
piraeus-datastore   linstor-controller                               ClusterIP      10.96.29.122    <none>        3371/TCP,3370/TCP              28d   app.kubernetes.io/component=linstor-controller,app.kubernetes.io/instance=linstorcluster,app.kubernetes.io/name=piraeus-datastore

WanzenBug commented 4 months ago

Something is still using the resource on node m2, so it cannot start on m0. Check the output of mount on m2 to see where the volume is in use.

dmrub commented 4 months ago

This was a Grafana pod as part of the Kubernetes monitoring deployment. Kubernetes tried to restart it several times due to issues with Linstor Storage until it was successfully started on the k8s-m2 node. Now I have scaled down the corresponding deployment and the problem is still there:

$ kubectl scale deployment -n monitoring kube-prometheus-stack-grafana --replicas 0
$ kubectl exec -ti -n piraeus-datastore deployments/linstor-controller -- /bin/bash
root@linstor-controller-797bc7456f-8mgws:/# linstor r l
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName                             ┊ Node   ┊ Port ┊ Usage  ┊ Conns              ┊      State ┊ CreatedOn           ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-40a7bc3f-d655-4606-a671-863913f657c0 ┊ k8s-m0 ┊      ┊ Unused ┊ StandAlone(k8s-m1) ┊   Diskless ┊ 2024-05-06 16:11:35 ┊
┊ pvc-40a7bc3f-d655-4606-a671-863913f657c0 ┊ k8s-m1 ┊      ┊ Unused ┊ Connecting(k8s-m0) ┊   UpToDate ┊ 2024-05-06 16:11:31 ┊
┊ pvc-40a7bc3f-d655-4606-a671-863913f657c0 ┊ k8s-m2 ┊      ┊ Unused ┊ Ok                 ┊   UpToDate ┊ 2024-05-06 16:11:35 ┊
┊ pvc-335fe40b-7250-4b2a-a0b3-c9eb1780e528 ┊ k8s-m0 ┊      ┊ Unused ┊ Ok                 ┊   UpToDate ┊ 2024-05-29 15:11:23 ┊
┊ pvc-335fe40b-7250-4b2a-a0b3-c9eb1780e528 ┊ k8s-m1 ┊      ┊ Unused ┊ Ok                 ┊   UpToDate ┊ 2024-05-29 15:11:27 ┊
┊ pvc-335fe40b-7250-4b2a-a0b3-c9eb1780e528 ┊ k8s-m2 ┊      ┊ Unused ┊ Ok                 ┊ TieBreaker ┊ 2024-05-29 15:11:26 ┊
┊ pvc-2492b46b-6466-4e2d-8820-b5fa9299ad9c ┊ k8s-m0 ┊      ┊ Unused ┊ Ok                 ┊ TieBreaker ┊ 2024-06-04 13:08:55 ┊
┊ pvc-2492b46b-6466-4e2d-8820-b5fa9299ad9c ┊ k8s-m1 ┊      ┊ Unused ┊ Ok                 ┊   UpToDate ┊ 2024-06-04 13:08:52 ┊
┊ pvc-2492b46b-6466-4e2d-8820-b5fa9299ad9c ┊ k8s-m2 ┊      ┊ Unused ┊ Ok                 ┊   UpToDate ┊ 2024-06-04 13:08:56 ┊
┊ pvc-4297b5a5-4c61-4638-a63d-729f5021d46f ┊ k8s-m0 ┊      ┊ InUse  ┊ Ok                 ┊   UpToDate ┊ 2024-05-06 16:11:28 ┊
┊ pvc-4297b5a5-4c61-4638-a63d-729f5021d46f ┊ k8s-m1 ┊      ┊ Unused ┊ Ok                 ┊   Diskless ┊ 2024-05-06 16:11:35 ┊
┊ pvc-4297b5a5-4c61-4638-a63d-729f5021d46f ┊ k8s-m2 ┊      ┊ Unused ┊ Ok                 ┊   UpToDate ┊ 2024-05-06 16:11:34 ┊
┊ pvc-cae1b7e0-d80d-47a8-8161-53063a5ccf36 ┊ k8s-m0 ┊      ┊ Unused ┊ Ok                 ┊   UpToDate ┊ 2024-05-23 16:37:38 ┊
┊ pvc-cae1b7e0-d80d-47a8-8161-53063a5ccf36 ┊ k8s-m1 ┊      ┊ Unused ┊ Ok                 ┊ TieBreaker ┊ 2024-05-23 16:37:43 ┊
┊ pvc-cae1b7e0-d80d-47a8-8161-53063a5ccf36 ┊ k8s-m2 ┊      ┊ Unused ┊ Ok                 ┊   UpToDate ┊ 2024-05-23 16:37:44 ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

WanzenBug commented 4 months ago

You can try running kubctl exec ds/linstor-satellite.k8s-m0 -- drbdadm adjust pvc-40a7bc3f-d655-4606-a671-863913f657c0 to kick things back into working order.

dmrub commented 4 months ago

Thanks, but when I went to execute the command, I realized that Linstor had already somehow repaired itself:

$ kubectl exec -ti -n piraeus-datastore deployments/linstor-controller -- /bin/bash
root@linstor-controller-797bc7456f-8mgws:/# linstor r l
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName                             ┊ Node   ┊ Port ┊ Usage  ┊ Conns ┊      State ┊ CreatedOn           ┊
╞══════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-40a7bc3f-d655-4606-a671-863913f657c0 ┊ k8s-m0 ┊      ┊ Unused ┊ Ok    ┊   Diskless ┊ 2024-05-06 16:11:35 ┊
┊ pvc-40a7bc3f-d655-4606-a671-863913f657c0 ┊ k8s-m1 ┊      ┊ Unused ┊ Ok    ┊   UpToDate ┊ 2024-05-06 16:11:31 ┊
┊ pvc-40a7bc3f-d655-4606-a671-863913f657c0 ┊ k8s-m2 ┊      ┊ Unused ┊ Ok    ┊   UpToDate ┊ 2024-05-06 16:11:35 ┊
┊ pvc-335fe40b-7250-4b2a-a0b3-c9eb1780e528 ┊ k8s-m0 ┊      ┊ Unused ┊ Ok    ┊   UpToDate ┊ 2024-05-29 15:11:23 ┊
┊ pvc-335fe40b-7250-4b2a-a0b3-c9eb1780e528 ┊ k8s-m1 ┊      ┊ Unused ┊ Ok    ┊   UpToDate ┊ 2024-05-29 15:11:27 ┊
┊ pvc-335fe40b-7250-4b2a-a0b3-c9eb1780e528 ┊ k8s-m2 ┊      ┊ Unused ┊ Ok    ┊ TieBreaker ┊ 2024-05-29 15:11:26 ┊
┊ pvc-2492b46b-6466-4e2d-8820-b5fa9299ad9c ┊ k8s-m0 ┊      ┊ Unused ┊ Ok    ┊ TieBreaker ┊ 2024-06-04 13:08:55 ┊
┊ pvc-2492b46b-6466-4e2d-8820-b5fa9299ad9c ┊ k8s-m1 ┊      ┊ Unused ┊ Ok    ┊   UpToDate ┊ 2024-06-04 13:08:52 ┊
┊ pvc-2492b46b-6466-4e2d-8820-b5fa9299ad9c ┊ k8s-m2 ┊      ┊ Unused ┊ Ok    ┊   UpToDate ┊ 2024-06-04 13:08:56 ┊
┊ pvc-4297b5a5-4c61-4638-a63d-729f5021d46f ┊ k8s-m0 ┊      ┊ Unused ┊ Ok    ┊   UpToDate ┊ 2024-05-06 16:11:28 ┊
┊ pvc-4297b5a5-4c61-4638-a63d-729f5021d46f ┊ k8s-m1 ┊      ┊ Unused ┊ Ok    ┊   Diskless ┊ 2024-05-06 16:11:35 ┊
┊ pvc-4297b5a5-4c61-4638-a63d-729f5021d46f ┊ k8s-m2 ┊      ┊ InUse  ┊ Ok    ┊   UpToDate ┊ 2024-05-06 16:11:34 ┊
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

I also tested to run grafana deployment with nodeSelector on all nodes without problems. Can you give me any clues as to what actually happened and what your command should have done?

piraeusdatastore / piraeus-operator

Error “can't read superblock on /dev/drbd1001.” when trying to start a pod #671