After OSD Remove Cluster has unknonw pgs

tman5 commented 1 month ago

After removing underlying k8s nodes with removing the OSD, rook-ceph is still reporting health issues

bash-4.4$ ceph status
  cluster:
    id:     5bb49f5d-4fad-4b9a-ae5c-48b21aa1bfea
    health: HEALTH_WARN
            Reduced data availability: 3 pgs inactive
            391 slow ops, oldest one blocked for 26815 sec, daemons [osd.14,osd.19,osd.9] have slow ops.

  services:
    mon: 3 daemons, quorum ao,ap,aq (age 4h)
    mgr: a(active, since 5h), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 15 osds: 15 up (since 3h), 15 in (since 6h)
    rgw: 1 daemon active (1 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   12 pools, 265 pgs
    objects: 150.35k objects, 575 GiB
    usage:   1.7 TiB used, 2.7 TiB / 4.4 TiB avail
    pgs:     1.132% pgs unknown
             262 active+clean
             3   unknown

  io:
    client:   13 KiB/s rd, 5.2 KiB/s wr, 13 op/s rd, 8 op/s wr

  progress:
    Global Recovery Event (118m)
      [==========================..] (remaining: 4m)

HEALTH_WARN Reduced data availability: 3 pgs inactive; 380 slow ops, oldest one blocked for 26039 sec, daemons [osd.14,osd.19,osd.9] have slow ops.
[WRN] PG_AVAILABILITY: Reduced data availability: 3 pgs inactive
    pg 2.14 is stuck inactive for 5h, current state unknown, last acting []
    pg 2.30 is stuck inactive for 5h, current state unknown, last acting []
    pg 10.4 is stuck inactive for 5h, current state unknown, last acting []
[WRN] SLOW_OPS: 380 slow ops, oldest one blocked for 26039 sec, daemons [osd.14,osd.19,osd.9] have slow ops.

bash-4.4$ ceph osd tree
ID   CLASS  WEIGHT   TYPE NAME                                      STATUS  REWEIGHT  PRI-AFF
 -1         4.46487  root default                                                            
-19         0.29999      host 1                          
  5    ssd  0.29999          osd.5                                      up   1.00000  1.00000
-31         0.29999      host 2                         
  6    ssd  0.29999          osd.6                                      up   1.00000  1.00000
-34         0.29999      host 3                          
  7    ssd  0.29999          osd.7                                      up   1.00000  1.00000
-28         0.29999      host 4                           
  8    ssd  0.29999          osd.8                                      up   1.00000  1.00000
-40         0.29999      host 5                           
 12    ssd  0.29999          osd.12                                     up   1.00000  1.00000
-22         0.29999      host 6                          
 10    ssd  0.29999          osd.10                                     up   1.00000  1.00000
-25         0.29999      host 7                          
  9    ssd  0.29999          osd.9                                      up   1.00000  1.00000
-37         0.29999      host 8                           
 11    ssd  0.29999          osd.11                                     up   1.00000  1.00000
-43         0.29999      host 9                          
 13    ssd  0.29999          osd.13                                     up   1.00000  1.00000
-46         0.29999      host 10                           
 14    ssd  0.29999          osd.14                                     up   1.00000  1.00000
-55         0.29300      host 11                           
 17    ssd  0.29300          osd.17                                     up   1.00000  1.00000
-49         0.29300      host 12                           
 15    ssd  0.29300          osd.15                                     up   1.00000  1.00000
-52         0.29300      host 13                           
 16    ssd  0.29300          osd.16                                     up   1.00000  1.00000
-61         0.29300      host 14                           
 18    ssd  0.29300          osd.18                                     up   1.00000  1.00000
-58         0.29300      host 15                           
 19    ssd  0.29300          osd.19                                     up   1.00000  1.00000

Pods cannot use PVCs at the moment with these errors:

  Normal   Scheduled               19m                 default-scheduler        Successfully assigned coder/coder-onboarding-workspace-596c77bbc8-l9sn7 to host13
  Normal   SuccessfulAttachVolume  19m                 attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-e1e9ba45-8b06-4b8b-ad68-08f3787cb8f5"
  Warning  FailedMount             17m                 kubelet                  MountVolume.MountDevice failed for volume "pvc-e1e9ba45-8b06-4b8b-ad68-08f3787cb8f5" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Warning  FailedMount             49s (x15 over 17m)  kubelet                  MountVolume.MountDevice failed for volume "pvc-e1e9ba45-8b06-4b8b-ad68-08f3787cb8f5" : rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0009-rook-ceph-0000000000000002-d80a62d2-29cd-401b-b774-5d4a4a5f6efc already exists

pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 6229 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr
pool 2 'replicapool' replicated size 3 min_size 2 crush_rule 12 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change 7365 lfor 0/4238/5052 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 3 'ceph-objectstore.rgw.control' replicated size 3 min_size 2 crush_rule 13 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 7366 flags hashpspool stripe_width 0 pg_num_min 8 application rgw,rook-ceph-rgw
pool 4 'ceph-filesystem-metadata' replicated size 3 min_size 2 crush_rule 3 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on last_change 6229 lfor 0/0/52 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs
pool 5 'ceph-objectstore.rgw.meta' replicated size 3 min_size 2 crush_rule 14 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 7367 flags hashpspool stripe_width 0 pg_num_min 8 application rgw,rook-ceph-rgw
pool 6 'ceph-filesystem-data0' replicated size 3 min_size 2 crush_rule 16 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 6229 lfor 0/0/54 flags hashpspool stripe_width 0 application cephfs
pool 7 'ceph-objectstore.rgw.log' replicated size 3 min_size 2 crush_rule 15 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 7368 lfor 0/221/219 flags hashpspool stripe_width 0 pg_num_min 8 application rgw,rook-ceph-rgw
pool 8 'ceph-objectstore.rgw.buckets.index' replicated size 3 min_size 2 crush_rule 17 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 7369 flags hashpspool stripe_width 0 pg_num_min 8 application rgw,rook-ceph-rgw
pool 9 'ceph-objectstore.rgw.buckets.non-ec' replicated size 3 min_size 2 crush_rule 18 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 7370 flags hashpspool stripe_width 0 pg_num_min 8 application rgw,rook-ceph-rgw
pool 10 'ceph-objectstore.rgw.otp' replicated size 3 min_size 2 crush_rule 19 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 7372 flags hashpspool stripe_width 0 pg_num_min 8 application rgw,rook-ceph-rgw
pool 11 '.rgw.root' replicated size 3 min_size 2 crush_rule 20 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 7373 flags hashpspool stripe_width 0 pg_num_min 8 application rgw,rook-ceph-rgw
pool 12 'ceph-objectstore.rgw.buckets.data' erasure profile ceph-objectstore.rgw.buckets.data_ecprofile size 3 min_size 2 crush_rule 11 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 7374 lfor 0/0/98 flags hashpspool,ec_overwrites stripe_width 8192 application rgw,rook-ceph-rgw

BlaineEXE commented 1 month ago

Firstly, in the future, please follow the Rook bug template questionnaire that is provided when using the "new issue" button. This helps us better understand and triage issues.

Now, I don't completely understand the scenario, but from context, I believe the issue being described is this: you have removed a node (or multiple nodes), and you no longer want the disks on those removed nodes to be used for Rook/Ceph storage. Is that correct?

If so, this is intended behavior for both Ceph and Rook as a data safety mechanism. Neither Ceph nor Rook can know for sure if a node removal event means that the OSDs are gone forever or not. Many k8s platforms remove "Node" resources as part of normal k8s update management, so node removal does not 1:1 imply OSD removal.

If you have removed the node and the OSDs are not going to be brought back online, the OSD purge workflow can be used to tell Ceph that it no longer needs to track the disks.

https://rook.io/docs/rook/latest-release/Storage-Configuration/Advanced/ceph-osd-mgmt/#remove-an-osd

Does this help answer the problem you are bringing up, or have I misread something?

tman5 commented 1 month ago

Correct. The nodes are already destroyed and the OSDs are not showing up in the cluster anymore. We ran that workflow. But the pods are still having issues mounting with the error above

  Normal   Scheduled               19m                 default-scheduler        Successfully assigned coder/coder-onboarding-workspace-596c77bbc8-l9sn7 to host13
  Normal   SuccessfulAttachVolume  19m                 attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-e1e9ba45-8b06-4b8b-ad68-08f3787cb8f5"
  Warning  FailedMount             17m                 kubelet                  MountVolume.MountDevice failed for volume "pvc-e1e9ba45-8b06-4b8b-ad68-08f3787cb8f5" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Warning  FailedMount             49s (x15 over 17m)  kubelet                  MountVolume.MountDevice failed for volume "pvc-e1e9ba45-8b06-4b8b-ad68-08f3787cb8f5" : rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0009-rook-ceph-0000000000000002-d80a62d2-29cd-401b-b774-5d4a4a5f6efc already exists

BlaineEXE commented 1 month ago

If you attempted to purge all 3 OSDs at the same time, you may likely be experiencing some data loss. Ceph's default is to use 3 replicas, and in a cluster that is configured with one-OSD-per-node like this, any 3 nodes/disks are [statistically] likely to contain some data that is not replicated on any other nodes/disks. I don't see any other failure domains in the OSD hierarchy, suggesting this may be likely, unfortunately.

[addendum] If this is the case, it may be necessary to find some way of adding one of the removed OSDs back into the cluster to allow PGs to be read from it and replicated onto other disks.

BlaineEXE commented 1 month ago

Another possibility is that the ongoing data recovery process that Ceph is doing may be saturating the network links and starving clients of their ability to perform IO. When I see daemons [osd.ID, ...] have slow ops, this is my usual first suspect. It's possible that Ceph may recover from the state it's in eventually, allowing client IO to not be bottlenecked.

rook / rook

After OSD Remove Cluster has unknonw pgs #14278