Closed Milstein closed 2 weeks ago
@dystewart noticed that the storage console is reporting an error state...
...but I think this is because the ceph cluster itself is reporting HEALTH_ERR (and I don't think this error is relevant to us):
bash-5.1$ ceph --user healthchecker-nerc-ocp-infra-1-rbd status
cluster:
id: 6de96983-eef7-4690-9a6d-9124d3707a30
health: HEALTH_ERR
935 OSD(s) reporting legacy (not per-pool) BlueStore omap usage stats
There are daemons running an older version of ceph
8 large omap objects
mons mon01,mon02,mon03,mon04,mon05 are using a lot of disk space
19 scrub errors
Possible data damage: 1 pg inconsistent
4 pgs not deep-scrubbed in time
27 pools have too few placement groups
25 pools have too many placement groups
1 daemons have recently crashed
services:
mon: 5 daemons, quorum mon02,mon01,mon03,mon04,mon05 (age 10d)
mgr: mon03(active, since 10d), standbys: mon05, mon04
mds: 1/1 daemons up, 2 standby
osd: 1863 osds: 1862 up (since 13h), 1861 in (since 7d); 9 remapped pgs
rgw: 3 daemons active (3 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 53 pools, 38297 pgs
objects: 2.76G objects, 9.1 PiB
usage: 14 PiB used, 8.4 PiB / 23 PiB avail
pgs: 489779/19247975008 objects misplaced (0.003%)
38218 active+clean
68 active+clean+scrubbing+deep
5 active+remapped+backfill_wait
4 active+remapped+backfilling
1 active+clean+scrubbing
1 active+clean+inconsistent
io:
client: 358 MiB/s rd, 877 MiB/s wr, 3.98k op/s rd, 1.31k op/s wr
recovery: 27 MiB/s, 6 objects/s
progress:
Global Recovery Event (25h)
[===========================.] (remaining: 21s)
I had previously opened a ticket when the ceph cluster was in HEALTH_ERR state, and received this reply:
This is a large cluster with many users. Any problem with any of them reflects on the general health state report even when it does not affect you or your users. The general health state report is an call for action for the Ceph cluster administrators. In this case one placement group has got some issues. This, by itself, does not affect vast majority of our storage/users.
@Milstein resolved?
This may be a transient issue but need to keep track on our observability setup.
Issue related to mounting the PVC:
Output: