openebs / openebs

Most popular & widely deployed Open Source Container Native Storage platform for Stateful Persistent Applications on Kubernetes.
https://www.openebs.io
Apache License 2.0
9.01k stars 944 forks source link

cspi went to offline state when the ndm got rebooted several times #3389

Closed nsathyaseelan closed 2 years ago

nsathyaseelan commented 3 years ago

CSPI pools went to offline state upon the reboot of the worker nodes. which restarted the operator NDM pods several times cause

  1. NDM was not able to identify cstor was installed on the disks /dev/sdb
  2. NDM proceeded with creating blockdevices. But since the partition table UUID was already present on the disk (/dev/sdb1) and it was having type zfs_member, NDM updated the devlinks for the blockevice.
  3. Now, there is a blockdevice that is claimed and active but pointing to /dev/sdb1 instead of /dev/sdb
  4. CSPI took this blockdevice (sdb1) /dev/disk/by-path/pci-0000:00:10.0-scsi-0:0:1:0-part1 and tried to use it in a pool, and since it was already part of sdb, caused the issue.
root@gitlab-k8s-master:~# kubectl get cspi -n openebs
NAME                     HOSTNAME                       FREE     CAPACITY   READONLY   PROVISIONEDREPLICAS   HEALTHYREPLICAS   STATUS    AGE
cstor-disk-gitlab-2bqs   gitlab-k8s-node3.mayalabs.io   65G      193G       false      5                     5                 ONLINE    270d
cstor-disk-gitlab-9zwv   gitlab-k8s-node1.mayalabs.io   107G     192400M    false      3                     3                 ONLINE    270d
cstor-disk-gitlab-fktx   gitlab-k8s-node4.mayalabs.io   141G     192900M    false      4                     4                 OFFLINE   270d
cstor-disk-gitlab-x2qg   gitlab-k8s-node2.mayalabs.io   74400M   192400M    false      3                     3                 ONLINE    270d

root@gitlab-k8s-master:~# kubectl get cspi -n openebs cstor-disk-gitlab-fktx -o yaml
apiVersion: cstor.openebs.io/v1
kind: CStorPoolInstance
metadata:
  creationTimestamp: "2020-08-15T13:01:28Z"
  finalizers:
  - cstorpoolcluster.openebs.io/finalizer
  - openebs.io/pool-protection
  generation: 177935
  labels:
    kubernetes.io/hostname: gitlab-k8s-node4.mayalabs.io
    openebs.io/cas-type: cstor
    openebs.io/cstor-pool-cluster: cstor-disk-gitlab
    openebs.io/version: 2.8.0
  managedFields:
  - apiVersion: cstor.openebs.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          .: {}
          f:kubernetes.io/hostname: {}
          f:openebs.io/cas-type: {}
          f:openebs.io/cstor-pool-cluster: {}
        f:ownerReferences: {}
      f:spec:
        .: {}
        f:hostName: {}
        f:nodeSelector:
          .: {}
          f:kubernetes.io/hostname: {}
        f:poolConfig:
          .: {}
          f:auxResources: {}
          f:dataRaidGroupType: {}
          f:priorityClassName: {}
          f:resources: {}
          f:roThresholdLimit: {}
      f:status:
        .: {}
        f:capacity:
          .: {}
          f:zfs: {}
        f:readOnly: {}
      f:versionDetails:
        .: {}
        f:status:
          .: {}
          f:dependentsUpgraded: {}
    manager: cspc-operator
    operation: Update
    time: "2020-08-15T13:01:28Z"
  - apiVersion: cstor.openebs.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          f:openebs.io/version: {}
      f:versionDetails:
        f:desired: {}
    manager: upgrade
    operation: Update
    time: "2021-04-16T02:51:23Z"
  - apiVersion: cstor.openebs.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers: {}
      f:spec:
        f:dataRaidGroups: {}
      f:status:
        f:capacity:
          f:free: {}
          f:total: {}
          f:used: {}
          f:zfs:
            f:logicalUsed: {}
        f:conditions: {}
        f:healthyReplicas: {}
        f:phase: {}
        f:provisionedReplicas: {}
      f:versionDetails:
        f:status:
          f:current: {}
          f:lastUpdateTime: {}
          f:state: {}
    manager: pool-manager
    operation: Update
    time: "2021-05-13T03:05:23Z"
  name: cstor-disk-gitlab-fktx
  namespace: openebs
  ownerReferences:
  - apiVersion: cstor.openebs.io/v1
    blockOwnerDeletion: true
    controller: true
    kind: CStorPoolCluster
    name: cstor-disk-gitlab
    uid: 495d1e82-b44c-42a1-a416-01f5fe23f0c9
  resourceVersion: "188193339"
  selfLink: /apis/cstor.openebs.io/v1/namespaces/openebs/cstorpoolinstances/cstor-disk-gitlab-fktx
  uid: 69f4e285-a24d-447e-bb32-351428b90806
spec:
  dataRaidGroups:
  - blockDevices:
    - blockDeviceName: blockdevice-2f2596d46e0aafb32e6704e92d77842d
      devLink: /dev/disk/by-path/pci-0000:00:10.0-scsi-0:0:1:0
    - blockDeviceName: blockdevice-420814416e700121bff8b9569c99ab8f
      devLink: /dev/disk/by-path/pci-0000:00:10.0-scsi-0:0:2:0
  hostName: gitlab-k8s-node4.mayalabs.io
  nodeSelector:
    kubernetes.io/hostname: gitlab-k8s-node4.mayalabs.io
  poolConfig:
    auxResources: {}
    dataRaidGroupType: stripe
    priorityClassName: ""
    resources: {}
    roThresholdLimit: 85
status:
  capacity:
    free: 141G
    total: 192900M
    used: 51900M
    zfs:
      logicalUsed: 106G
  conditions:
  - lastTransitionTime: "2021-02-19T20:26:51Z"
    lastUpdateTime: "2021-04-12T08:52:58Z"
    message: failed to importcstor-495d1e82-b44c-42a1-a416-01f5fe23f0c9pool
    reason: PoolLost
    status: "True"
    type: PoolLost
  - lastTransitionTime: "2021-05-13T02:45:15Z"
    lastUpdateTime: "2021-05-13T03:05:23Z"
    message: |
      Pool expansion is in progress because of blockdevice/raid group addition error: Failed to add raidGroup{v1.RaidGroup{CStorPoolInstanceBlockDevices:[]v1.CStorPoolInstanceBlockDevice{v1.CStorPoolInstanceBlockDevice{BlockDeviceName:"blockdevice-2f2596d46e0aafb32e6704e92d77842d", Capacity:0x0, DevLink:"/dev/disk/by-path/pci-0000:00:10.0-scsi-0:0:1:0"}, v1.CStorPoolInstanceBlockDevice{BlockDeviceName:"blockdevice-420814416e700121bff8b9569c99ab8f", Capacity:0x0, DevLink:"/dev/disk/by-path/pci-0000:00:10.0-scsi-0:0:2:0"}}}}.. error exit status 1 invalid vdev specification
      use '-f' to override the following errors:
      /dev/disk/by-path/pci-0000:00:10.0-scsi-0:0:1:0-part1 is part of active pool 'cstor-495d1e82-b44c-42a1-a416-01f5fe23f0c9'
      /dev/disk/by-path/pci-0000:00:10.0-scsi-0:0:2:0-part1 is part of active pool 'cstor-495d1e82-b44c-42a1-a416-01f5fe23f0c9'
    reason: PoolExpansionInProgress
    status: "True"
    type: PoolExpansion
  healthyReplicas: 4
  phase: OFFLINE
  provisionedReplicas: 4
  readOnly: false
versionDetails:
  desired: 2.8.0
  status:
    current: 2.8.0
    dependentsUpgraded: true
    lastUpdateTime: "2021-04-16T02:51:23Z"
    state: Reconciled

Possible Solution:

OpenEBS version: 2.8.0

kubernetes version

root@gitlab-k8s-master:~# kubectl version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.2", GitCommit:"52c56ce7a8272c798dbc29846288d7cd9fbae032", GitTreeState:"clean", BuildDate:"2020-04-16T11:56:40Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.2", GitCommit:"52c56ce7a8272c798dbc29846288d7cd9fbae032", GitTreeState:"clean", BuildDate:"2020-04-16T11:48:36Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
akhilerm commented 3 years ago

Cause of issue:

When the node gets rebooted and the NDM pod comes up, the ndm pod fetches information before udev database is updated. This causes NDM failing to detect that cstor is installed on the parent disk (say /dev/sdb) and goes onto create blockdevice resource for the cstor partition. But during scanning the details of partition device(/dev/sdb1), it finds the zfs signature on the device and updates the blockdevice resource of parent disks with the devlinks of the partition.

Now CSPI takes this devlink from the device resource and try to use it (/dev/sdb1) in pool which will fail, since the complete disk (/dev/sdb) is already part of a pool.

Solution: Restart the NDM pod. When the pod is restarted, it will identify that cstor is installed on the complete disk and proceed with updating the devlink to the correct value, and CSPI will come online,

github-actions[bot] commented 2 years ago

Issues go stale after 90d of inactivity. Please comment or re-open the issue if you are still interested in getting this issue fixed.

akhilerm commented 2 years ago

Closing this issue, as this is a timing issue after the node gets rebooted. A restart of the NDM pod fixes the issue.