rook / rook

Storage Orchestration for Kubernetes
https://rook.io
Apache License 2.0
11.98k stars 2.64k forks source link

osd: Legacy LVM-based OSDs on PVCs crash on resize init container (backport #14100) #14105

Closed mergify[bot] closed 3 weeks ago

mergify[bot] commented 3 weeks ago

OSDs on LVM-mode PVCs are failing to come up and crashing in the expand-bluefs init container. To avoid the crash and allow the OSDs to start, a workaround was found to simply remove that init container. Now we disable the OSD resize for this case to avoid others hitting this during upgrade as well.

I am not able to repro this issue with currently available types of OSDs. All new OSDs on PVCs are being created in raw mode, even for encrypted and if they have a metadata device. But this could affect old OSDs that have been upgraded since long ago (as far back as Rook v1.1).

An error is first since in the "osd init" init container where an argument is missing:

Error: ceph-username is required for osd
rook error: ceph-username is required for osd
Usage:
  rook ceph osd init [flags]

But this does not fail the container since other containers are allowed to continue starting. Then the expand container fails with the below error because the ceph config was not initialized because of the previous init container issue:

inferring bluefs devices from bluestore path
unable to read label for /var/lib/ceph/osd/ceph-1: (2) No such file or directory
2024-04-04T13:22:38.461+0000 7f41cddbf900 -1 bluestore(/var/lib/ceph/osd/ceph-1/block) _read_bdev_label failed to open /var/lib/ceph/osd/ceph-1/block: (2) No such file or directory

This seems related to the removal of some variables that were thought to be obsolete in #11331. However, since we can't find a repro and confirm that adding those back actually fixes the issue, the most reliable and low risk solution seems to be just remove the resize init container complete, and then encourage users to replace these legacy OSDs.

Checklist: