Sudden OSD Down State on Node resulting in unresponsive OSD pods

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior:

In our production environments where Rook Ceph is deployed, we encountered a situation where all OSDs (Object Storage Daemons) on a specific node suddenly went into a "down" state, despite the OSD pods themselves remaining in a "running" state. This unexpected behavior triggered the Ceph operator to initiate the back-filling of PGs (Placement Groups), which subsequently led to extended processing times.

a) While the OSD pods appeared to be running, further examination revealed that they were actually stuck and unresponsive. b) Upon connecting to one of the stuck OSD pods, we reviewed the Ceph volume logs, which displayed the following error messages:

MicrosoftTeams-image (14)

MicrosoftTeams-image (15)

c) Attempting to execute system-level commands, such as fdisk -l, resulted in hangs and unresponsiveness. d) As a last resort, we were forced to perform a node reboot to recover from the issue. Following the reboot, all OSDs automatically returned to an operational state.

Expected behavior: There should not be sudden OSDs failure and even if that is the case at least OSD pods should crash showing actual failure.

How to reproduce it (minimal and precise):

We are not sure how to reproduce it but we have Ceph Cluster running on separate network from the management network.

File(s) to submit:

Cluster CR (custom resource), typically called cluster.yaml, if necessary

apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
  {"apiVersion":"ceph.rook.io/v1","kind":"CephCluster","metadata":{"annotations":{},"name":"rook-ceph","namespace":"rook-ceph"},"spec":{"annotations":null,"cephVersion":{"allowUnsupported":false,"image":"quay.io/ceph/ceph:v16.2.7"},"cleanupPolicy":{"allowUninstallWithVolumes":false,"confirmation":"","sanitizeDisks":{"dataSource":"zero","iteration":1,"method":"quick"}},"continueUpgradeAfterChecksEvenIfNotHealthy":false,"crashCollector":{"disable":false},"dashboard":{"enabled":true,"ssl":false},"dataDirHostPath":"/var/lib/rook/","disruptionManagement":{"machineDisruptionBudgetNamespace":"openshift-machine-api","manageMachineDisruptionBudgets":false,"managePodBudgets":true,"osdMaintenanceTimeout":30,"pgHealthCheckTimeout":0},"healthCheck":{"daemonHealth":{"mon":{"disabled":false,"interval":"45s"},"osd":{"disabled":false,"interval":"60s"},"status":{"disabled":false,"interval":"60s"}},"livenessProbe":{"mgr":{"disabled":false},"mon":{"disabled":false},"osd":{"disabled":false}},"startupProbe":{"mgr":{"disabled":false},"mon":{"disabled":false},"osd":{"disabled":false}}},"labels":null,"mgr":{"allowMultiplePerNode":false,"count":2,"modules":[{"enabled":true,"name":"pg_autoscaler"}]},"mon":{"allowMultiplePerNode":false,"count":3},"monitoring":{"enabled":false},"network":{"connections":{"compression":{"enabled":false},"encryption":{"enabled":false}},"provider":"multus","selectors":{"public":"rook-ceph/ceph-nw"}},"priorityClassNames":{"mgr":"system-cluster-critical","mon":"system-node-critical","osd":"system-node-critical"},"removeOSDsIfOutAndSafeToRemove":true,"resources":null,"skipUpgradeChecks":false,"storage":{"config":null,"nodes":[{"devices":[{"name":"sda"},{"name":"sdb"},{"name":"sdc"},{"name":"sdd"},{"name":"sde"},{"name":"sdf"},{"name":"sdg"},{"name":"sdh"},{"name":"sdi"},{"name":"sdj"},{"name":"sdk"},{"name":"sdl"},{"name":"sdm"},{"name":"sdn"},{"name":"sdo"}],"name":"la-cloudlyte-sr11"},{"devices":[{"name":"sda"},{"name":"sdb"},{"name":"sdc"},{"name":"sdd"},{"name":"sde"},{"name":"sdf"},{"name":"sdg"},{"name":"sdh"},{"name":"sdi"},{"name":"sdj"},{"name":"sdk"},{"name":"sdl"},{"name":"sdm"},{"name":"sdn"}],"name":"la-cloudlyte-sr12"},{"devices":[{"name":"sda"},{"name":"sdb"},{"name":"sdc"},{"name":"sdd"},{"name":"sde"},{"name":"sdf"},{"name":"sdg"},{"name":"sdh"},{"name":"sdi"},{"name":"sdj"},{"name":"sdk"},{"name":"sdl"},{"name":"sdn"},{"name":"sdo"}],"name":"la-cloudlyte-sr13"},{"devices":[{"name":"sda"},{"name":"sdb"},{"name":"sdc"},{"name":"sdd"},{"name":"sde"}],"name":"la-cloudlyte-sr14"},{"devices":[{"name":"sda"},{"name":"sdb"},{"name":"sdc"},{"name":"sde"},{"name":"sdf"}],"name":"la-cloudlyte-sr15"}],"onlyApplyOSDPlacement":false,"useAllDevices":false,"useAllNodes":false},"waitTimeoutForHealthyOSDInMinutes":10}}
creationTimestamp: "2023-08-01T15:01:39Z"
finalizers:
- cephcluster.ceph.rook.io
generation: 13
name: rook-ceph
namespace: rook-ceph
resourceVersion: "50435187"
selfLink: /apis/ceph.rook.io/v1/namespaces/rook-ceph/cephclusters/rook-ceph
uid: 47f4d7ec-b9a9-45e6-af01-74e4c8e8f9dc
spec:
cephVersion:
image: quay.io/ceph/ceph:v16.2.7
cleanupPolicy:
sanitizeDisks:
  dataSource: zero
  iteration: 1
  method: quick
crashCollector: {}
dashboard:
enabled: true
dataDirHostPath: /var/lib/rook/
disruptionManagement:
machineDisruptionBudgetNamespace: openshift-machine-api
managePodBudgets: true
osdMaintenanceTimeout: 30
external: {}
healthCheck:
daemonHealth:
  mon:
    interval: 45s
  osd:
    interval: 1m0s
  status:
    interval: 1m0s
livenessProbe:
  mgr: {}
  mon: {}
  osd: {}
startupProbe:
  mgr: {}
  mon: {}
  osd: {}
logCollector: {}
mgr:
count: 2
modules:
- enabled: true
  name: pg_autoscaler
mon:
count: 3
monitoring:
enabled: true
network:
connections:
  compression: {}
  encryption: {}
multiClusterService: {}
provider: multus
selectors:
  public: rook-ceph/ceph-nw
priorityClassNames:
mgr: system-cluster-critical
mon: system-node-critical
osd: system-node-critical
removeOSDsIfOutAndSafeToRemove: true
resources:
osd:
  limits:
    cpu: 2
    memory: 10Gi
  requests:
    memory: 512Mi
security:
keyRotation:
  enabled: false
kms: {}
storage:
nodes:
- devices:
  - name: /dev/disk/by-id/scsi-SNVMe_Dell_Ent_NVMe_P5_PHAB209403PW3P8EGN
  - name: /dev/disk/by-id/scsi-SNVMe_Dell_Ent_NVMe_P5_PHAB152601DP3P8EGN
  - name: /dev/disk/by-id/scsi-SNVMe_Dell_Ent_NVMe_P5_PHAB206000AN3P8EGN
  - name: /dev/disk/by-id/scsi-SNVMe_Dell_Ent_NVMe_P5_PHAB209602L83P8EGN
  - name: /dev/disk/by-id/scsi-SNVMe_Dell_Ent_NVMe_P5_PHAB210402953P8EGN
  - name: /dev/disk/by-id/scsi-SNVMe_Dell_Ent_NVMe_P5_PHAB210402GX3P8EGN
  - name: /dev/disk/by-id/scsi-SNVMe_Dell_Ent_NVMe_P5_PHAB210402J03P8EGN
  - name: /dev/disk/by-id/scsi-SNVMe_Dell_Ent_NVMe_P5_PHAB209100063P8EGN
  - name: /dev/disk/by-id/scsi-SNVMe_Dell_Ent_NVMe_P5_PHAB152601J93P8EGN
  - name: /dev/disk/by-id/scsi-SNVMe_Dell_Ent_NVMe_P5_PHAB152601KL3P8EGN
  - name: /dev/disk/by-id/scsi-SNVMe_Dell_DC_NVMe_PE8_SDA9N7266I140A72X
  - name: /dev/disk/by-id/scsi-SNVMe_Dell_DC_NVMe_PE8_SDA9N7266I140A72M
  - name: /dev/disk/by-id/scsi-SNVMe_Dell_DC_NVMe_PE8_SJAAN4347I150A72A
  - name: /dev/disk/by-id/scsi-SATA_MTFDDAV480TDS_221536CE7287
  - name: /dev/disk/by-id/scsi-SATA_MTFDDAV480TDS_221536CE7275
  name: la-cloudlyte-sr11
  resources: {}
- devices:
  - name: /dev/disk/by-id/scsi-SNVMe_Dell_Ent_NVMe_P5_PHAB2073003Y3P8EGN
  - name: /dev/disk/by-id/scsi-SNVMe_Dell_Ent_NVMe_P5_PHAB210402JQ3P8EGN
  - name: /dev/disk/by-id/scsi-SNVMe_Dell_Ent_NVMe_P5_PHAB210402HY3P8EGN
  - name: /dev/disk/by-id/scsi-SNVMe_Dell_Ent_NVMe_P5_PHAB210402HS3P8EGN
  - name: /dev/disk/by-id/scsi-SNVMe_Dell_Ent_NVMe_P5_PHAB210402F03P8EGN
  - name: /dev/disk/by-id/scsi-SNVMe_Dell_Ent_NVMe_P5_PHAB2091000J3P8EGN
  - name: /dev/disk/by-id/scsi-SNVMe_Dell_Ent_NVMe_P5_PHAB210305893P8EGN
  - name: /dev/disk/by-id/scsi-SNVMe_Dell_Ent_NVMe_P5_PHAB210307P43P8EGN
  - name: /dev/disk/by-id/scsi-SNVMe_Dell_Ent_NVMe_P5_PHAB210307Z53P8EGN
  - name: /dev/disk/by-id/scsi-SNVMe_Dell_DC_NVMe_PE8_SJAAN4347I150A72Y
  - name: /dev/disk/by-id/scsi-SNVMe_Dell_DC_NVMe_PE8_SJAAN4347I150A70P
  - name: /dev/disk/by-id/scsi-SNVMe_Dell_DC_NVMe_PE8_SJAAN4347I150A736
  - name: /dev/disk/by-id/scsi-SATA_MTFDDAV480TDS_221536CE60EA
  - name: /dev/disk/by-id/scsi-SATA_MTFDDAV480TDS_221536CE60A0
  name: la-cloudlyte-sr12
  resources: {}
- devices:
  - name: /dev/disk/by-id/scsi-SNVMe_Dell_Ent_NVMe_P5_PHAB211100753P8EGN
  - name: /dev/disk/by-id/scsi-SNVMe_Dell_Ent_NVMe_P5_PHAB210307RV3P8EGN
  - name: /dev/disk/by-id/scsi-SNVMe_Dell_Ent_NVMe_P5_PHAB210307XK3P8EGN
  - name: /dev/disk/by-id/scsi-SNVMe_Dell_Ent_NVMe_P5_PHAB210305E73P8EGN
  - name: /dev/disk/by-id/scsi-SNVMe_Dell_Ent_NVMe_P5_PHAB210305BQ3P8EGN
  - name: /dev/disk/by-id/scsi-SNVMe_Dell_Ent_NVMe_P5_PHAB210305FJ3P8EGN
  - name: /dev/disk/by-id/scsi-SNVMe_Dell_Ent_NVMe_P5_PHAB210307UE3P8EGN
  - name: /dev/disk/by-id/scsi-SNVMe_Dell_Ent_NVMe_P5_PHAB2103056L3P8EGN
  - name: /dev/disk/by-id/scsi-SNVMe_Dell_Ent_NVMe_P5_PHAB2103055K3P8EGN
  - name: /dev/disk/by-id/scsi-SATA_MTFDDAV480TDS_221536CE5C33
  - name: /dev/disk/by-id/scsi-SATA_MTFDDAV480TDS_221536CE5C47
  - name: /dev/disk/by-id/scsi-SNVMe_Dell_DC_NVMe_PE8_SJAAN4347I270A735
  - name: /dev/disk/by-id/scsi-SNVMe_Dell_DC_NVMe_PE8_SDA9N7266I140A70X
  - name: /dev/disk/by-id/scsi-SNVMe_Dell_DC_NVMe_PE8_SJAAN4347I150A724
  name: la-cloudlyte-sr13
  resources: {}
- devices:
  - name: /dev/disk/by-id/scsi-SATA_MTFDDAV480TDS_221536CE6BC7
  - name: /dev/disk/by-id/scsi-SATA_MTFDDAV480TDS_221536CE6B5E
  - name: /dev/disk/by-id/scsi-SNVMe_Dell_DC_NVMe_PE8_SJAAN4347I150A70V
  - name: /dev/disk/by-id/scsi-SNVMe_Dell_DC_NVMe_PE8_SDA9N7266I140A709
  - name: /dev/disk/by-id/scsi-SNVMe_Dell_DC_NVMe_PE8_SJAAN4347I150A72F
  name: la-cloudlyte-sr14
  resources: {}
- devices:
  - name: /dev/disk/by-id/scsi-SNVMe_Dell_DC_NVMe_PE8_SJAAN4347I150A70N
  - name: /dev/disk/by-id/scsi-SNVMe_Dell_DC_NVMe_PE8_SDA9N7266I140A71N
  name: la-cloudlyte-sr15
  resources: {}
store: {}
useAllDevices: false
waitTimeoutForHealthyOSDInMinutes: 11
status:
ceph:
capacity:
  bytesAvailable: 98570781372416
  bytesTotal: 124824759091200
  bytesUsed: 26253977718784
  lastUpdated: "2023-09-14T11:35:54Z"
fsid: de770eed-b929-4acb-9c62-7bbb674d55cf
health: HEALTH_OK
lastChanged: "2023-09-14T04:55:37Z"
lastChecked: "2023-09-14T11:35:54Z"
previousHealth: HEALTH_WARN
versions:
  mgr:
    ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable): 2
  mon:
    ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable): 3
  osd:
    ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable): 50
  overall:
    ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable): 55
conditions:
- lastHeartbeatTime: "2023-09-14T11:35:55Z"
lastTransitionTime: "2023-08-01T15:07:13Z"
message: Cluster created successfully
reason: ClusterCreated
status: "True"
type: Ready
message: Cluster created successfully
observedGeneration: 13
phase: Ready
state: Created
storage:
deviceClasses:
- name: ssd
version:
image: quay.io/ceph/ceph:v16.2.7
version: 16.2.7-0

Logs to submit:

Operator's logs, if necessary
Crashing pod(s) logs, if necessary

To get logs, use kubectl -n <namespace> logs <pod name> When pasting logs, always surround them with backticks or use the insert code button from the Github UI. Read GitHub documentation if you need help.

Cluster Status to submit:

Output of krew commands, if necessary

To get the health of the cluster, use kubectl rook-ceph health To get the status of the cluster, use kubectl rook-ceph ceph status For more details, see the Rook Krew Plugin

Environment:

OS (e.g. from /etc/os-release):

NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

Kernel (e.g. uname -a):

Linux 5.4.0-162-generic #179-Ubuntu SMP Mon Aug 14 08:51:31 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Cloud provider or hardware configuration:
```
Dell Servers
```

Rook version (use rook version inside of a Rook Pod):

image: rook/ceph:master
image: quay.io/ceph/ceph:v16.2.7

Storage backend version (e.g. for ceph do ceph -v):

ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)

Kubernetes version (use kubectl version):


Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.9", GitCommit:"6df4433e288edc9c40c2e344eb336f63fad45cd2", GitTreeState:"clean", BuildDate:"2022-04-13T19:57:43Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.9", GitCommit:"6df4433e288edc9c40c2e344eb336f63fad45cd2", GitTreeState:"clean", BuildDate:"2022-04-13T19:52:02Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

* Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift):

On Prem Kubernetes Cluster

* Storage backend status (e.g. for Ceph use `ceph health` in the [Rook Ceph toolbox](https://rook.io/docs/rook/latest-release/Troubleshooting/ceph-toolbox/#interactive-toolbox)):

cluster: id: de770eed-b929-4acb-9c62-7bbb674d55cf health: HEALTH_OK

services: mon: 3 daemons, quorum b,c,e (age 10d) mgr: b(active, since 10d), standbys: a osd: 50 osds: 50 up (since 7h), 50 in (since 7h)

data: pools: 2 pools, 513 pgs objects: 2.21M objects, 8.1 TiB usage: 24 TiB used, 90 TiB / 114 TiB avail pgs: 513 active+clean

io: client: 14 KiB/s rd, 23 MiB/s wr, 3 op/s rd, 959 op/s wr

rook / rook

Sudden OSD Down State on Node resulting in unresponsive OSD pods #12902