Closed nics90 closed 11 months ago
The errors show that the device is not found. If you can repro, can lsblk
see the devices on the host? There is not much rook or ceph can do if the devices are not found.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.
Is this a bug report or feature request?
Deviation from expected behavior:
In our production environments where Rook Ceph is deployed, we encountered a situation where all OSDs (Object Storage Daemons) on a specific node suddenly went into a "down" state, despite the OSD pods themselves remaining in a "running" state. This unexpected behavior triggered the Ceph operator to initiate the back-filling of PGs (Placement Groups), which subsequently led to extended processing times.
a) While the OSD pods appeared to be running, further examination revealed that they were actually stuck and unresponsive. b) Upon connecting to one of the stuck OSD pods, we reviewed the Ceph volume logs, which displayed the following error messages:
c) Attempting to execute system-level commands, such as fdisk -l, resulted in hangs and unresponsiveness. d) As a last resort, we were forced to perform a node reboot to recover from the issue. Following the reboot, all OSDs automatically returned to an operational state.
Expected behavior: There should not be sudden OSDs failure and even if that is the case at least OSD pods should crash showing actual failure.
How to reproduce it (minimal and precise):
We are not sure how to reproduce it but we have Ceph Cluster running on separate network from the management network.
File(s) to submit:
Cluster CR (custom resource), typically called
cluster.yaml
, if necessaryLogs to submit:
Operator's logs, if necessary
Crashing pod(s) logs, if necessary
To get logs, use
kubectl -n <namespace> logs <pod name>
When pasting logs, always surround them with backticks or use theinsert code
button from the Github UI. Read GitHub documentation if you need help.Cluster Status to submit:
Output of krew commands, if necessary
To get the health of the cluster, use
kubectl rook-ceph health
To get the status of the cluster, usekubectl rook-ceph ceph status
For more details, see the Rook Krew PluginEnvironment:
uname -a
):rook version
inside of a Rook Pod):ceph -v
):kubectl version
):On Prem Kubernetes Cluster
cluster: id: de770eed-b929-4acb-9c62-7bbb674d55cf health: HEALTH_OK
services: mon: 3 daemons, quorum b,c,e (age 10d) mgr: b(active, since 10d), standbys: a osd: 50 osds: 50 up (since 7h), 50 in (since 7h)
data: pools: 2 pools, 513 pgs objects: 2.21M objects, 8.1 TiB usage: 24 TiB used, 90 TiB / 114 TiB avail pgs: 513 active+clean
io: client: 14 KiB/s rd, 23 MiB/s wr, 3 op/s rd, 959 op/s wr