rook / rook

Storage Orchestration for Kubernetes
https://rook.io
Apache License 2.0
12.33k stars 2.69k forks source link

Propose: make the retry timeout on ceph osd remove be a configurable flag. #14247

Closed bdowling closed 4 months ago

bdowling commented 4 months ago

Is this a bug report or feature request?

Provide a flag to the ceph osd remove --retry-timeout=20 to make the retry timeout on osd purge be a configurable flag.

https://github.com/rook/rook/blob/161306a48411a154a3647f407af8902ee31e8a0c/pkg/daemon/ceph/osd/remove.go#L116

What is use case behind this feature:

When going through the process of purging OSDs cleanly from a node before it will be recycled, the 1 minute timeout can become a lag on the overall process. e.g. the task sees that the PGs are in an unclean state for a brief period, then it waits a whole minute before checking again. It feels like a more frequent check could speed up the process.

In observing this and removing 30,40 osds from a node this can take considerable time.

2024-05-21 19:06:18.959676 I | cephosd: validating status of osd.14
2024-05-21 19:06:18.959688 I | cephosd: osd.14 is marked 'DOWN'
2024-05-21 19:06:18.959698 D | exec: Running command: ceph osd find 14 --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json
2024-05-21 19:06:19.224896 I | cephosd: marking osd.14 out
2024-05-21 19:06:19.224915 D | exec: Running command: ceph osd out osd.14 --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json
2024-05-21 19:06:19.501663 D | exec: Running command: ceph osd safe-to-destroy 14 --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json
2024-05-21 19:06:19.766462 W | cephosd: osd.14 is NOT ok to destroy, retrying in 1m until success
2024-05-21 19:07:19.767545 D | exec: Running command: ceph osd safe-to-destroy 14 --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json
2024-05-21 19:07:20.034507 W | cephosd: osd.14 is NOT ok to destroy, retrying in 1m until success
2024-05-21 19:08:20.034729 D | exec: Running command: ceph osd safe-to-destroy 14 --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json
2024-05-21 19:08:20.304655 W | cephosd: osd.14 is NOT ok to destroy, retrying in 1m until success
2024-05-21 19:09:20.363784 D | exec: Running command: ceph osd safe-to-destroy 14 --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json
2024-05-21 19:09:20.630038 W | cephosd: osd.14 is NOT ok to destroy, retrying in 1m until success
2024-05-21 19:10:20.631344 D | exec: Running command: ceph osd safe-to-destroy 14 --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json
2024-05-21 19:10:20.903050 I | cephosd: osd.14 is safe to destroy, proceeding

2024-05-21 19:10:24.184827 I | cephosd: marking osd.22 out
2024-05-21 19:10:24.184845 D | exec: Running command: ceph osd out osd.22 --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json
2024-05-21 19:10:24.458924 D | exec: Running command: ceph osd safe-to-destroy 22 --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring --format json
2024-05-21 19:10:24.725558 W | cephosd: osd.22 is NOT ok to destroy, retrying in 1m until success
...

Environment:

travisn commented 4 months ago

Maybe we just reduce the retry interval to 15 or 20s? I don't see a need to keep the default at a full minute, and we can keep it simple without exposing yet another setting.

bdowling commented 4 months ago

That would certainly work for me and be a simpler patch than figuring how to passing the setting around.

bdowling commented 4 months ago

fwiw, been dealing with a lot of node shuffles, this mini script lets me help the osd-purge task along. Thinking I could probably just recreate it, but using this helps speed things up..

I tried to find the conditions that are causing the pgs to be unstable throughout the purge job process. e.g. I had some other OSDs that were down for a longer time, and I ran the purge and it ran through them all really fast, never getting a timeout. I am marking the OSDs down and out prior to running the osd-purge job, and wait for pgs to stablize for a few mins, but it still seems to periodically report pgs peering after each time an ceph osd purge is run by the job, not sure why that is.

I'm guessing something to do with the crush map changing there is an additional rebalance or does it wait to actually do the rebalance after the osd is deleted?

bash-4.4$ echo -n "Next OSD: "; while read osd; do while ! (ceph osd safe-to-destroy osd.$osd &&  ceph osd purge osd.$osd); do date; sleep 2; done; echo -n "Next OSD: "; done
Next OSD: 108
OSD(s) 108 are safe to destroy without reducing data durability.
purged osd.108
Next OSD: 89
OSD(s) 89 are safe to destroy without reducing data durability.
purged osd.89
Next OSD: 143
Error EAGAIN: OSD(s) 143 have no reported stats, and not all PGs are active+clean; we cannot draw any conclusions.
Wed May 22 21:36:26 UTC 2024
Error EAGAIN: OSD(s) 143 have no reported stats, and not all PGs are active+clean; we cannot draw any conclusions.
Wed May 22 21:36:28 UTC 2024
Error EAGAIN: OSD(s) 143 have no reported stats, and not all PGs are active+clean; we cannot draw any conclusions.
Wed May 22 21:36:31 UTC 2024
Error EAGAIN: OSD(s) 143 have no reported stats, and not all PGs are active+clean; we cannot draw any conclusions.
Wed May 22 21:36:33 UTC 2024
Error EAGAIN: OSD(s) 143 have no reported stats, and not all PGs are active+clean; we cannot draw any conclusions.
Wed May 22 21:36:35 UTC 2024
OSD(s) 143 are safe to destroy without reducing data durability.
purged osd.143
Next OSD:
bdowling commented 4 months ago

I guess my other question would be can the purges be done in more of a batch fashion? e.g. if ceph osd safe-to-destroy 1 2 3 4 5 6 says they are good to go, why not just do all of them at same time? Does it take into account if all the named OSDs form the Nx replication of the pg group?

travisn commented 4 months ago

The OSDs are processed one-by-one just for simplicity in implementation. See the code here. If they are all safe to destroy, I would think there is not much difference in how long it takes overall whether we check safe-to-destroy for each individual OSD, or all at the same time.

From your description of the delays to remove OSDs, it seems Ceph doesn't seem to think they are safe-to-destroy, so the data must need to be backfilled first. To speed up the OSD removal, what about this?

  1. Determine which OSDs you want to remove
  2. Mark them all out (but leave them running)
  3. Wait for the PGs to become fully active+clean again
  4. Then scale down the OSDs and purge them

Btw did you want to open a PR to change the wait timeout here? Or I don't mind opening it.

bdowling commented 4 months ago

From your description of the delays to remove OSDs, it seems Ceph doesn't seem to think they are safe-to-destroy, so the data must need to be backfilled first. To speed up the OSD removal, what about this?

Error EAGAIN: OSD(s) 143 have no reported stats, and not all PGs are active+clean; we cannot draw any conclusions.

From the errors I see mostly as above, I don't think it is the OSDs that are being removed that are in conflict, but just the fact that other PGs are rebalancing, so ceph doesn't want to "draw any conclusions"

When the OSD is actually in use, it is a different error code such as:

Error EBUSY: OSD(s) 1 have 148 pgs currently mapped to them.

bdowling commented 4 months ago

@travisn thanks! Was getting worried that the CI was going to make that a never ending CI for a two line PR. 😉