Closed sfxworks closed 2 years ago
@sfxworks Since you have useAllDevices: true
and useAllNodes: true
, Rook will always scan all the devices and attempt to start OSDs on them.
Did you delete the data from the disk after the OSD was removed? If not, Rook will see the disk previously configured and attempt to start the same OSD again. It will remain in a failed state, however, since the OSD auth was removed. So it is recommended to clean or remove the disk after purging the OSD, or else update the device filter to exclude that disk.
The disk was removed and was reformatted long before the addition of this new node. I was also referencing that doc in the reproduction steps. The exception would be that it is still on the node, but since rook skips devices that are formatted I expect this behavior even if the device was previously an osd.
@sfxworks After you purge the bad osd, you don't see that osd's auth from the toolbox with ceph auth ls
, right?
From this message there is certainly something leftover from the prior OSD
entity osd.4 exists but key does not match
I was not able to repro this in a test cluster. After purging an OSD, a new OSD was able to be created with the same ID. It is expected that Ceph re-uses the same OSD IDs after they are purged. If you see that everything was purged as expected, you please try the latest Rook v1.6 and Ceph v16 releases to see if it helps.
I should also mention there was a metadata device associated with the osd. Unfortunately I cannot test an upgrade as I have switched storage providers. The intent was to report everything I could here before the migration.
I am seeing a similar issue. @travisn directed me to this issue.
I have previously purged/removed osd.5 continues fail to be reused with new disks.
Always giving me:
debug 2021-06-01T18:00:37.831+0000 7f68beecdf40 -1 auth: unable to find a keyring on /var/lib/ceph/osd/ceph-5/keyring: (2) No such file or directory
debug 2021-06-01T18:00:37.831+0000 7f68beecdf40 -1 AuthRegistry(0x5612b4582940) no keyring found at /var/lib/ceph/osd/ceph-5/keyring, disabling cephx
Prep Steps: 1) Verify osd removal in 'ceph auth' and 'ceph osd' via ceph-tools. 2) Removal of osd deployment within kubernetes 3) Disk detection, and creation appears to work normally. However the deployment fails due to above error.
Whether using a new disk, or a manually cleaned disk, my results are always the same for the disk trying to use the osd.5 id.
My issue has been resolved. Removing the resource constraints I had placed resolved the disk creation to function normally.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.
Is this a bug report or feature request?
Deviation from expected behavior:
Rook continues to add deployment for osd that was removed manually per guide at https://github.com/rook/rook/blob/master/Documentation/ceph-osd-mgmt.md#purge-the-osd-manually
Expected behavior:
Rook does not remake an OSD deployment for a disk that was purged from the cluster while
removeOSDsIfOutAndSafeToRemove
set totrue
.How to reproduce it (minimal and precise):
Opt to remove an osd manually Follow the instructions in the doc Restart node/mount it with a random filesystem Note rook remakes osd deployment Note that when adding a new node, osd count starts at the removed number, making keys not match. This currently blocks me from adding new osds to the cluster.
File(s) to submit:
Cluster CR (custom resource), typically called
cluster.yaml
, if necessaryOperator's logs, if necessary https://pastebin.com/9xrDMCn1
Crashing pod(s) logs, if necessary Old OSD:
OSD_ID=4
OSD_UUID=ad8bbdc2-85e2-4202-b278-45b3f931b5cb
OSD_STORE_FLAG=--bluestore
OSD_DATA_DIR=/var/lib/ceph/osd/ceph-4
CV_MODE=lvm
DEVICE=/dev/ceph-block-dbs-8fb0c077-97b6-4b87-8dab-5c5bd693f931/osd-block-db-7aac0b82-6d1b-4836-984e-7d76a3d6e276
METADATA_DEVICE=
WAL_DEVICE=
[[ lvm == \l\v\m ]] ++ mktemp -d
TMP_DIR=/tmp/tmp.E5Jwqolt78
ceph-volume lvm activate --no-systemd --bluestore 4 ad8bbdc2-85e2-4202-b278-45b3f931b5cb Traceback (most recent call last): File "/usr/sbin/ceph-volume", line 11, in
load_entry_point('ceph-volume==1.0.0', 'console_scripts', 'ceph-volume')()
File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 40, in init
self.main(self.argv)
File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc
return f(*a, *kw)
File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 151, in main
terminal.dispatch(self.mapper, subcommand_args)
File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
instance.main()
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/main.py", line 42, in main
terminal.dispatch(self.mapper, self.argv)
File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
instance.main()
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/activate.py", line 370, in main
self.activate(args)
File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root
return func(a, **kw)
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/activate.py", line 294, in activate
activate_bluestore(lvs, args.no_systemd)
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/activate.py", line 149, in activate_bluestore
raise RuntimeError('could not find a bluestore OSD to activate')
RuntimeError: could not find a bluestore OSD to activate
New prepare job for new node: https://pastebin.com/yu3h4SZm Prepares one disk and then errors because of the key issue. Next run it just skips everything https://pastebin.com/3U3xCKSK Leaving the node only partially prepared
Environment:
and
uname -a
):Linux pfdc-store-2 5.11.21-hardened1-2-hardened #1 SMP PREEMPT Fri, 14 May 2021 21:06:07 +0000 x86_64 GNU/Linux
andLinux ryzen1 5.8.0-53-generic #60-Ubuntu SMP Thu May 6 07:46:32 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
rook version
inside of a Rook Pod): rook: v1.5.8 go: go1.13.8ceph -v
): ceph version 15.2.9 (357616cbf726abb779ca75a551e8d02568e15b17) octopus (stable)kubectl version
): Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.7", GitCommit:"1dd5338295409edcfff11505e7bb246f0d325d15", GitTreeState:"clean", BuildDate:"2021-01-13T13:23:52Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"windows/amd64"} Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.4", GitCommit:"e87da0bd6e03ec3fea7933c4b5263d151aafd07c", GitTreeState:"clean", BuildDate:"2021-02-18T16:03:00Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}Storage backend status (e.g. for Ceph use
ceph health
in the Rook Ceph toolbox):