nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
1 stars 0 forks source link

NERC RHOAI failing to mount volume #608

Closed Milstein closed 2 weeks ago

Milstein commented 3 weeks ago

Issue related to mounting the PVC:

FailedMount
MountVolume.MountDevice failed for volume
"pvc-02726871-26e3-4d22-95a8-9269466e6ddc" : rpc error: code = Aborted desc =
an operation with the given Volume ID
0001-0011-openshift-storage-0000000000000025-1bb33e7a-c5be-4a5d-b794-daef53801f80
already exists

Output: image

larsks commented 3 weeks ago

@dystewart noticed that the storage console is reporting an error state...

Screenshot from 2024-06-06 13-09-02

...but I think this is because the ceph cluster itself is reporting HEALTH_ERR (and I don't think this error is relevant to us):

bash-5.1$ ceph --user healthchecker-nerc-ocp-infra-1-rbd status
  cluster:
    id:     6de96983-eef7-4690-9a6d-9124d3707a30
    health: HEALTH_ERR
            935 OSD(s) reporting legacy (not per-pool) BlueStore omap usage stats
            There are daemons running an older version of ceph
            8 large omap objects
            mons mon01,mon02,mon03,mon04,mon05 are using a lot of disk space
            19 scrub errors
            Possible data damage: 1 pg inconsistent
            4 pgs not deep-scrubbed in time
            27 pools have too few placement groups
            25 pools have too many placement groups
            1 daemons have recently crashed

  services:
    mon: 5 daemons, quorum mon02,mon01,mon03,mon04,mon05 (age 10d)
    mgr: mon03(active, since 10d), standbys: mon05, mon04
    mds: 1/1 daemons up, 2 standby
    osd: 1863 osds: 1862 up (since 13h), 1861 in (since 7d); 9 remapped pgs
    rgw: 3 daemons active (3 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   53 pools, 38297 pgs
    objects: 2.76G objects, 9.1 PiB
    usage:   14 PiB used, 8.4 PiB / 23 PiB avail
    pgs:     489779/19247975008 objects misplaced (0.003%)
             38218 active+clean
             68    active+clean+scrubbing+deep
             5     active+remapped+backfill_wait
             4     active+remapped+backfilling
             1     active+clean+scrubbing
             1     active+clean+inconsistent

  io:
    client:   358 MiB/s rd, 877 MiB/s wr, 3.98k op/s rd, 1.31k op/s wr
    recovery: 27 MiB/s, 6 objects/s

  progress:
    Global Recovery Event (25h)
      [===========================.] (remaining: 21s)
larsks commented 3 weeks ago

I had previously opened a ticket when the ceph cluster was in HEALTH_ERR state, and received this reply:

This is a large cluster with many users. Any problem with any of them reflects on the general health state report even when it does not affect you or your users. The general health state report is an call for action for the Ceph cluster administrators. In this case one placement group has got some issues. This, by itself, does not affect vast majority of our storage/users.

joachimweyl commented 2 weeks ago

@Milstein resolved?

Milstein commented 2 weeks ago

This may be a transient issue but need to keep track on our observability setup.