rook / rook

Storage Orchestration for Kubernetes
https://rook.io
Apache License 2.0
12.37k stars 2.69k forks source link

Sudden OSD Down State on Node resulting in unresponsive OSD pods #12902

Closed nics90 closed 11 months ago

nics90 commented 1 year ago

Is this a bug report or feature request?

Deviation from expected behavior:

In our production environments where Rook Ceph is deployed, we encountered a situation where all OSDs (Object Storage Daemons) on a specific node suddenly went into a "down" state, despite the OSD pods themselves remaining in a "running" state. This unexpected behavior triggered the Ceph operator to initiate the back-filling of PGs (Placement Groups), which subsequently led to extended processing times.

a) While the OSD pods appeared to be running, further examination revealed that they were actually stuck and unresponsive. b) Upon connecting to one of the stuck OSD pods, we reviewed the Ceph volume logs, which displayed the following error messages:

MicrosoftTeams-image (14)

MicrosoftTeams-image (15)

c) Attempting to execute system-level commands, such as fdisk -l, resulted in hangs and unresponsiveness. d) As a last resort, we were forced to perform a node reboot to recover from the issue. Following the reboot, all OSDs automatically returned to an operational state.

Expected behavior: There should not be sudden OSDs failure and even if that is the case at least OSD pods should crash showing actual failure.

How to reproduce it (minimal and precise):

We are not sure how to reproduce it but we have Ceph Cluster running on separate network from the management network.

File(s) to submit:

Cluster Status to submit:

Environment:

* Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift):

On Prem Kubernetes Cluster

* Storage backend status (e.g. for Ceph use `ceph health` in the [Rook Ceph toolbox](https://rook.io/docs/rook/latest-release/Troubleshooting/ceph-toolbox/#interactive-toolbox)):

cluster: id: de770eed-b929-4acb-9c62-7bbb674d55cf health: HEALTH_OK

services: mon: 3 daemons, quorum b,c,e (age 10d) mgr: b(active, since 10d), standbys: a osd: 50 osds: 50 up (since 7h), 50 in (since 7h)

data: pools: 2 pools, 513 pgs objects: 2.21M objects, 8.1 TiB usage: 24 TiB used, 90 TiB / 114 TiB avail pgs: 513 active+clean

io: client: 14 KiB/s rd, 23 MiB/s wr, 3 op/s rd, 959 op/s wr

travisn commented 1 year ago

The errors show that the device is not found. If you can repro, can lsblk see the devices on the host? There is not much rook or ceph can do if the devices are not found.

github-actions[bot] commented 12 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 11 months ago

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.