Openshift 4.7 pods fail to stop after csi-rbdplugin restart, VM hangs and do not shutdown

PiotrKlimczak commented 3 years ago

Deviation from expected behavior: After csi-rbdplugin restart, pods which are using storage on that node where csi-rbdplugin was restarted are hanging endlessly. Force deleting pods, doesn't solve the problem as then compute machine (VM) fails to shutdown hanging. This is 100% reproducible.

Expected behavior: If csi-rbdplugin restart happens, it should not affect pods nor VM in any way.

How to reproduce it (minimal and precise): Restart csi-rbdplugin pod on compute node.

File(s) to submit:

spec:
  security:
    kms: {}
  crashCollector: {}
  monitoring:
    rulesNamespace: rook-ceph
  logCollector: {}
  external: {}
  healthCheck:
    daemonHealth:
      mon:
        interval: 45s
      osd:
        interval: 1m0s
      status:
        interval: 1m0s
    livenessProbe:
      mgr: {}
      mon: {}
      osd: {}
  mon:
    count: 3
  network:
    ipFamily: IPv4
  dataDirHostPath: /var/lib/rook
  priorityClassNames:
    all: system-node-critical
  dashboard:
    enabled: true
    ssl: false
  cleanupPolicy:
    sanitizeDisks:
      dataSource: zero
      iteration: 1
      method: quick
  disruptionManagement:
    machineDisruptionBudgetNamespace: openshift-machine-api
    managePodBudgets: true
    osdMaintenanceTimeout: 30
  mgr:
    count: 1
    modules:
      - enabled: true
        name: pg_autoscaler
  waitTimeoutForHealthyOSDInMinutes: 10
  storage:
    useAllDevices: true
    useAllNodes: true
  cephVersion:
    image: 'ceph/ceph:v16.2.3'

Also nothing unusual in dmesg

[  169.971660] libceph: loaded (mon/osd proto 15/24)
[  169.988393] libceph: mon1 (1)172.30.168.158:6789 session established
[  169.991240] libceph: mon1 (1)172.30.168.158:6789 socket closed (con state OPEN)
[  169.993104] libceph: mon1 (1)172.30.168.158:6789 session lost, hunting for new mon
[  171.008982] libceph: mon0 (1)172.30.97.243:6789 socket closed (con state CONNECTING)
[  173.248569] libceph: mon0 (1)172.30.97.243:6789 socket closed (con state CONNECTING)
[  175.232554] libceph: mon0 (1)172.30.97.243:6789 socket closed (con state CONNECTING)
[  176.579911] libceph: mon1 (1)172.30.168.158:6789 session established
[  176.582763] libceph: client558630 fsid a272fa00-b5a7-4dbc-9912-0c306ecd356d
[ 1635.246891] libceph: mon1 (1)172.30.168.158:6789 session lost, hunting for new mon <----- THIS IS WHERE I HAVE MANUALLY RESTARTED CSI RBD PLUGIN
[ 2558.373682] libceph: osd2 (1)10.129.4.9:6801 socket closed (con state OPEN)
[ 2558.376331] libceph: osd1 (1)10.128.4.10:6801 socket closed (con state OPEN)
[ 2562.467780] libceph: osd0 (1)10.129.2.10:6801 socket closed (con state OPEN)
[ 2689.442135] libceph: osd1 (1)10.128.4.10:6801 socket closed (con state CONNECTING)
[ 2689.444714] libceph: osd2 (1)10.129.4.9:6801 socket closed (con state CONNECTING)
[ 2693.538020] libceph: osd0 (1)10.129.2.10:6801 socket closed (con state CONNECTING)
[ 2820.512486] libceph: osd1 (1)10.128.4.10:6801 socket closed (con state CONNECTING)

Operator config map

  ROOK_CSI_ENABLE_RBD: 'true'
  ROOK_OBC_WATCH_OPERATOR_NAMESPACE: 'true'
  ROOK_ENABLE_DISCOVERY_DAEMON: 'false'
  CSI_PROVISIONER_PRIORITY_CLASSNAME: system-cluster-critical
  ROOK_CSI_ALLOW_UNSUPPORTED_VERSION: 'false'
  CSI_PLUGIN_PRIORITY_CLASSNAME: system-node-critical
  CSI_FORCE_CEPHFS_KERNEL_CLIENT: 'true'
  ROOK_CSI_ENABLE_GRPC_METRICS: 'false'
  CSI_CEPHFS_FSGROUPPOLICY: ReadWriteOnceWithFSType
  CSI_RBD_FSGROUPPOLICY: ReadWriteOnceWithFSType
  ROOK_CSI_ENABLE_CEPHFS: 'true'
  ROOK_ENABLE_FLEX_DRIVER: 'false'
  CSI_ENABLE_CEPHFS_SNAPSHOTTER: 'true'
  CSI_ENABLE_RBD_SNAPSHOTTER: 'true'
  CSI_ENABLE_VOLUME_REPLICATION: 'false'

Operator deployment:

      containers:
        - resources: {}
          terminationMessagePath: /dev/termination-log
          name: rook-ceph-operator
          env:
            - name: ROOK_CURRENT_NAMESPACE_ONLY
              value: 'false'
            - name: FLEXVOLUME_DIR_PATH
              value: /etc/kubernetes/kubelet-plugins/volume/exec
            - name: ROOK_LOG_LEVEL
              value: INFO
            - name: ROOK_DISCOVER_DEVICES_INTERVAL
              value: 60m
            - name: ROOK_HOSTPATH_REQUIRES_PRIVILEGED
              value: 'true'
            - name: ROOK_ENABLE_SELINUX_RELABELING
              value: 'true'
            - name: ROOK_ENABLE_FSGROUP
              value: 'true'
            - name: ROOK_DISABLE_DEVICE_HOTPLUG
              value: 'false'
            - name: DISCOVER_DAEMON_UDEV_BLACKLIST
              value: '(?i)dm-[0-9]+,(?i)rbd[0-9]+,(?i)nbd[0-9]+'
            - name: ROOK_ENABLE_MACHINE_DISRUPTION_BUDGET
              value: 'false'
            - name: ROOK_UNREACHABLE_NODE_TOLERATION_SECONDS
              value: '5'
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: spec.nodeName
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.name
            - name: POD_NAMESPACE
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.namespace

I have checked all the logs I could, but have not found anything useful or different than on pods where there was no restart. CSI RBD plugin seems to be starting again correctly and is not complaining in logs in any way.

Environment:

OS (e.g. from /etc/os-release): Fedora CoreOS 33.20210328.3.0
Kernel (e.g. uname -a): 5.10.19-200.fc33.x86_64
Cloud provider or hardware configuration: OKD 4.7, ESXi Bare Metal
Rook version (use rook version inside of a Rook Pod): 1.6.2, was happening on 1.5.11 too
Storage backend version (e.g. for ceph do ceph -v): 16.2.3, was happening on previous versions too.
Kubernetes version (use kubectl version): v1.20.0-1058+7d0a2b269a2741-dirty
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): OKD 4.7 however it was happening on OKD 4.6 too, not sure about previous versions.
Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox): HEALTH_WARN mons are allowing insecure global_id reclaim

We have updated everything in hope this might be fixed in newer version but problem still persist. Honestly I have no idea where to look, there I cannot find anything useful/different in logs when compared "unhealthy" VM with healthy one.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 3 years ago

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

rook / rook

Openshift 4.7 pods fail to stop after csi-rbdplugin restart, VM hangs and do not shutdown #7861