jiangcuo / Proxmox-Port

Proxmox VE arm64 riscv64 loongarch64
GNU Affero General Public License v3.0
545 stars 27 forks source link

RSP5 OSD Down after reboot #58

Open wuast94 opened 3 months ago

wuast94 commented 3 months ago

Describe the bug When i Create an Ceph OSD it works without problems, as soon as i reboot the node the OSD wont come back up. To Reproduce Steps to reproduce the behavior:

  1. Install Proxmox and Ceph (using your repos of course)
  2. Create OSD
  3. Reboot
  4. OSD gone

ENV (please complete the following information):

Additional context

systemctl status ceph-osd@*.service dont give back anything also journalctl -xeu ceph-osd@2.service no entries

i double checked that i am on the right host and use the right OSD number

output of OSD install:


The ZFS modules cannot be auto-loaded.
Try running 'modprobe zfs' as root to manually load them.
command '/sbin/zpool list -HPLv' failed: exit code 1

create OSD on /dev/sda (bluestore)
wiping block device /dev/sda
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 1.044 s, 201 MB/s
Running command: /bin/ceph-authtool --gen-print-key
Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 8331d767-af24-40da-bac0-ccbaf0fcda92
Running command: vgcreate --force --yes ceph-b9cc563f-5758-4ead-bbec-74c6aafb7099 /dev/sda
 stdout: Physical volume "/dev/sda" successfully created.
 stdout: Volume group "ceph-b9cc563f-5758-4ead-bbec-74c6aafb7099" successfully created
Running command: lvcreate --yes -l 476924 -n osd-block-8331d767-af24-40da-bac0-ccbaf0fcda92 ceph-b9cc563f-5758-4ead-bbec-74c6aafb7099
 stdout: Logical volume "osd-block-8331d767-af24-40da-bac0-ccbaf0fcda92" created.
Running command: /bin/ceph-authtool --gen-print-key
Running command: /bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-2
Running command: /bin/chown -h ceph:ceph /dev/ceph-b9cc563f-5758-4ead-bbec-74c6aafb7099/osd-block-8331d767-af24-40da-bac0-ccbaf0fcda92
Running command: /bin/chown -R ceph:ceph /dev/dm-0
Running command: /bin/ln -s /dev/ceph-b9cc563f-5758-4ead-bbec-74c6aafb7099/osd-block-8331d767-af24-40da-bac0-ccbaf0fcda92 /var/lib/ceph/osd/ceph-2/block
Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o /var/lib/ceph/osd/ceph-2/activate.monmap
 stderr: 2024-03-13T09:48:04.607+0100 7fb083f180 -1 auth: unable to find a keyring on /etc/pve/priv/ceph.client.bootstrap-osd.keyring: (2) No such file or directory
2024-03-13T09:48:04.607+0100 7fb083f180 -1 AuthRegistry(0x7fac063e30) no keyring found at /etc/pve/priv/ceph.client.bootstrap-osd.keyring, disabling cephx
 stderr: got monmap epoch 5
--> Creating keyring file for osd.2
Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-2/keyring
Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-2/
Running command: /bin/ceph-osd --cluster ceph --osd-objectstore bluestore --mkfs -i 2 --monmap /var/lib/ceph/osd/ceph-2/activate.monmap --keyfile - --osd-data /var/lib/ceph/osd/ceph-2/ --osd-uuid 8331d767-af24-40da-bac0-ccbaf0fcda92 --setuser ceph --setgroup ceph
 stderr: 2024-03-13T09:48:05.071+0100 7f93c67040 -1 bluestore(/var/lib/ceph/osd/ceph-2//block) _read_bdev_label unable to decode label at offset 102: void bluestore_bdev_label_t::decode(ceph::buffer::v15_2_0::list::const_iterator&) decode past end of struct encoding: Malformed input [buffer:3]
 stderr: 2024-03-13T09:48:05.075+0100 7f93c67040 -1 bluestore(/var/lib/ceph/osd/ceph-2//block) _read_bdev_label unable to decode label at offset 102: void bluestore_bdev_label_t::decode(ceph::buffer::v15_2_0::list::const_iterator&) decode past end of struct encoding: Malformed input [buffer:3]
 stderr: 2024-03-13T09:48:05.075+0100 7f93c67040 -1 bluestore(/var/lib/ceph/osd/ceph-2//block) _read_bdev_label unable to decode label at offset 102: void bluestore_bdev_label_t::decode(ceph::buffer::v15_2_0::list::const_iterator&) decode past end of struct encoding: Malformed input [buffer:3]
 stderr: 2024-03-13T09:48:05.079+0100 7f93c67040 -1 bluestore(/var/lib/ceph/osd/ceph-2/) _read_fsid unparsable uuid
--> ceph-volume lvm prepare successful for: /dev/sda
Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-2
Running command: /bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-b9cc563f-5758-4ead-bbec-74c6aafb7099/osd-block-8331d767-af24-40da-bac0-ccbaf0fcda92 --path /var/lib/ceph/osd/ceph-2 --no-mon-config
Running command: /bin/ln -snf /dev/ceph-b9cc563f-5758-4ead-bbec-74c6aafb7099/osd-block-8331d767-af24-40da-bac0-ccbaf0fcda92 /var/lib/ceph/osd/ceph-2/block
Running command: /bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-2/block
Running command: /bin/chown -R ceph:ceph /dev/dm-0
Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-2
Running command: /bin/systemctl enable ceph-volume@lvm-2-8331d767-af24-40da-bac0-ccbaf0fcda92
 stderr: Created symlink /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-2-8331d767-af24-40da-bac0-ccbaf0fcda92.service -> /lib/systemd/system/ceph-volume@.service.
Running command: /bin/systemctl enable --runtime ceph-osd@2
 stderr: Created symlink /run/systemd/system/ceph-osd.target.wants/ceph-osd@2.service -> /lib/systemd/system/ceph-osd@.service.
Running command: /bin/systemctl start ceph-osd@2
--> ceph-volume lvm activate successful for osd ID: 2
--> ceph-volume lvm create successful for: /dev/sda
TASK OK
jiangcuo commented 3 months ago

/var/log/ceph is there has osd logs ?

wuast94 commented 3 months ago
[2024-03-13 13:19:12,115][ceph_volume.main][INFO  ] Running command: ceph-volume  lvm trigger 1-18b2426f-90d1-4992-847c-a52b7ef19dc7
[2024-03-13 13:19:12,120][ceph_volume.util.system][WARNING] Executable lvs not found on the host, will return lvs as-is
[2024-03-13 13:19:12,120][ceph_volume.process][INFO  ] Running command: lvs --noheadings --readonly --separator=";" -a --units=b --nosuffix -S tags={ceph.osd_id=1,ceph.osd_fsid=18b2426f-90d1-4992-847c-a52b7ef19dc7} -o lv_tags,lv_path,lv_name,vg_name,lv_uuid,lv_size
[2024-03-13 13:19:12,151][ceph_volume.main][INFO  ] Running command: ceph-volume  lvm trigger 2-611efe97-8305-4a23-9559-33dd95bce599
[2024-03-13 13:19:12,154][ceph_volume.util.system][WARNING] Executable lvs not found on the host, will return lvs as-is
[2024-03-13 13:19:12,154][ceph_volume.process][INFO  ] Running command: lvs --noheadings --readonly --separator=";" -a --units=b --nosuffix -S tags={ceph.osd_id=2,ceph.osd_fsid=611efe97-8305-4a23-9559-33dd95bce599} -o lv_tags,lv_path,lv_name,vg_name,lv_uuid,lv_size
[2024-03-13 13:19:12,192][ceph_volume][ERROR ] exception caught by decorator
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/ceph_volume/decorators.py", line 59, in newfunc
    return f(*a, **kw)
           ^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/ceph_volume/main.py", line 153, in main
    terminal.dispatch(self.mapper, subcommand_args)
  File "/usr/lib/python3/dist-packages/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/usr/lib/python3/dist-packages/ceph_volume/devices/lvm/main.py", line 46, in main
    terminal.dispatch(self.mapper, self.argv)
  File "/usr/lib/python3/dist-packages/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/usr/lib/python3/dist-packages/ceph_volume/decorators.py", line 16, in is_root
    return func(*a, **kw)
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/ceph_volume/devices/lvm/trigger.py", line 70, in main
    Activate(['--auto-detect-objectstore', osd_id, osd_uuid]).main()
  File "/usr/lib/python3/dist-packages/ceph_volume/devices/lvm/activate.py", line 281, in main
    self.activate(args)
  File "/usr/lib/python3/dist-packages/ceph_volume/decorators.py", line 16, in is_root
    return func(*a, **kw)
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/ceph_volume/devices/lvm/activate.py", line 197, in activate
    raise RuntimeError('could not find osd.%s with osd_fsid %s' %
RuntimeError: could not find osd.1 with osd_fsid 18b2426f-90d1-4992-847c-a52b7ef19dc7
[2024-03-13 13:19:12,220][ceph_volume][ERROR ] exception caught by decorator
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/ceph_volume/decorators.py", line 59, in newfunc
    return f(*a, **kw)
           ^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/ceph_volume/main.py", line 153, in main
    terminal.dispatch(self.mapper, subcommand_args)
  File "/usr/lib/python3/dist-packages/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/usr/lib/python3/dist-packages/ceph_volume/devices/lvm/main.py", line 46, in main
    terminal.dispatch(self.mapper, self.argv)
  File "/usr/lib/python3/dist-packages/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/usr/lib/python3/dist-packages/ceph_volume/decorators.py", line 16, in is_root
    return func(*a, **kw)
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/ceph_volume/devices/lvm/trigger.py", line 70, in main
    Activate(['--auto-detect-objectstore', osd_id, osd_uuid]).main()
  File "/usr/lib/python3/dist-packages/ceph_volume/devices/lvm/activate.py", line 281, in main
    self.activate(args)
  File "/usr/lib/python3/dist-packages/ceph_volume/decorators.py", line 16, in is_root
    return func(*a, **kw)
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/ceph_volume/devices/lvm/activate.py", line 197, in activate
    raise RuntimeError('could not find osd.%s with osd_fsid %s' %
RuntimeError: could not find osd.2 with osd_fsid 611efe97-8305-4a23-9559-33dd95bce599
[2024-03-13 13:19:12,365][ceph_volume.main][INFO  ] Running command: ceph-volume  lvm trigger 2-8331d767-af24-40da-bac0-ccbaf0fcda92
[2024-03-13 13:19:12,368][ceph_volume.util.system][WARNING] Executable lvs not found on the host, will return lvs as-is
[2024-03-13 13:19:12,368][ceph_volume.process][INFO  ] Running command: lvs --noheadings --readonly --separator=";" -a --units=b --nosuffix -S tags={ceph.osd_id=2,ceph.osd_fsid=8331d767-af24-40da-bac0-ccbaf0fcda92} -o lv_tags,lv_path,lv_name,vg_name,lv_uuid,lv_size
[2024-03-13 13:19:12,428][ceph_volume.process][INFO  ] stdout ceph.block_device=/dev/ceph-b9cc563f-5758-4ead-bbec-74c6aafb7099/osd-block-8331d767-af24-40da-bac0-ccbaf0fcda92,ceph.block_uuid=OUftbF-UGG7-RZfB-tgrn-2KtY-JJe4-5RT0jM,ceph.cephx_lockbox_secret=,ceph.cluster_fsid=594dd1f3-8f66-4a84-bb9b-ab7b6437e739,ceph.cluster_name=ceph,ceph.crush_device_class=,ceph.encrypted=0,ceph.osd_fsid=8331d767-af24-40da-bac0-ccbaf0fcda92,ceph.osd_id=2,ceph.osdspec_affinity=,ceph.type=block,ceph.vdo=0";"/dev/ceph-b9cc563f-5758-4ead-bbec-74c6aafb7099/osd-block-8331d767-af24-40da-bac0-ccbaf0fcda92";"osd-block-8331d767-af24-40da-bac0-ccbaf0fcda92";"ceph-b9cc563f-5758-4ead-bbec-74c6aafb7099";"OUftbF-UGG7-RZfB-tgrn-2KtY-JJe4-5RT0jM";"2000364240896
[2024-03-13 13:19:12,428][ceph_volume.devices.lvm.activate][INFO  ] auto detecting objectstore
[2024-03-13 13:19:12,432][ceph_volume.devices.lvm.activate][DEBUG ] Found block device (osd-block-8331d767-af24-40da-bac0-ccbaf0fcda92) with encryption: False
[2024-03-13 13:19:12,432][ceph_volume.devices.lvm.activate][DEBUG ] Found block device (osd-block-8331d767-af24-40da-bac0-ccbaf0fcda92) with encryption: False
[2024-03-13 13:19:12,432][ceph_volume.process][INFO  ] Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-2
[2024-03-13 13:19:12,433][ceph_volume.process][INFO  ] Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-b9cc563f-5758-4ead-bbec-74c6aafb7099/osd-block-8331d767-af24-40da-bac0-ccbaf0fcda92 --path /var/lib/ceph/osd/ceph-2 --no-mon-config
[2024-03-13 13:19:12,464][ceph_volume.process][INFO  ] stderr failed to read label for /dev/ceph-b9cc563f-5758-4ead-bbec-74c6aafb7099/osd-block-8331d767-af24-40da-bac0-ccbaf0fcda92: (2) No such file or directory
2024-03-13T13:19:12.460+0100 7fb6a1a040 -1 bluestore(/dev/ceph-b9cc563f-5758-4ead-bbec-74c6aafb7099/osd-block-8331d767-af24-40da-bac0-ccbaf0fcda92) _read_bdev_label failed to open /dev/ceph-b9cc563f-5758-4ead-bbec-74c6aafb7099/osd-block-8331d767-af24-40da-bac0-ccbaf0fcda92: (2) No such file or directory
[2024-03-13 13:19:12,467][ceph_volume][ERROR ] exception caught by decorator
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/ceph_volume/decorators.py", line 59, in newfunc
    return f(*a, **kw)
           ^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/ceph_volume/main.py", line 153, in main
    terminal.dispatch(self.mapper, subcommand_args)
  File "/usr/lib/python3/dist-packages/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/usr/lib/python3/dist-packages/ceph_volume/devices/lvm/main.py", line 46, in main
    terminal.dispatch(self.mapper, self.argv)
  File "/usr/lib/python3/dist-packages/ceph_volume/terminal.py", line 194, in dispatch
    instance.main()
  File "/usr/lib/python3/dist-packages/ceph_volume/decorators.py", line 16, in is_root
    return func(*a, **kw)
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/ceph_volume/devices/lvm/trigger.py", line 70, in main
    Activate(['--auto-detect-objectstore', osd_id, osd_uuid]).main()
  File "/usr/lib/python3/dist-packages/ceph_volume/devices/lvm/activate.py", line 281, in main
    self.activate(args)
  File "/usr/lib/python3/dist-packages/ceph_volume/decorators.py", line 16, in is_root
    return func(*a, **kw)
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/ceph_volume/devices/lvm/activate.py", line 205, in activate
    return activate_bluestore(lvs, args.no_systemd)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/ceph_volume/devices/lvm/activate.py", line 112, in activate_bluestore
    process.run(prime_command)
  File "/usr/lib/python3/dist-packages/ceph_volume/process.py", line 147, in run
    raise RuntimeError(msg)
RuntimeError: command returned non-zero exit status: 1\
wuast94 commented 3 months ago

more context:

lvs --version

  LVM version:     2.03.16(2) (2022-05-18)
  Library version: 1.02.185 (2022-05-18)
  Driver version:  4.48.0
  Configuration:   ./configure --build=aarch64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-option-checking --disable-silent-rules --libdir=${prefix}/lib/aarch64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --libdir=/lib/aarch64-linux-gnu --sbindir=/sbin --with-usrlibdir=/usr/lib/aarch64-linux-gnu --with-optimisation=-O2 --with-cache=internal --with-device-uid=0 --with-device-gid=6 --with-device-mode=0660 --with-default-pid-dir=/run --with-default-run-dir=/run/lvm --with-default-locking-dir=/run/lock/lvm --with-thin=internal --with-thin-check=/usr/sbin/thin_check --with-thin-dump=/usr/sbin/thin_dump --with-thin-repair=/usr/sbin/thin_repair --with-udev-prefix=/ --enable-applib --enable-blkid_wiping --enable-cmdlib --enable-dmeventd --enable-editline --enable-lvmlockd-dlm --enable-lvmlockd-sanlock --enable-lvmpolld --enable-notify-dbus --enable-pkgconfig --enable-udev_rules --enable-udev_sync --disable-readline

lsblk

NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda           8:0    0   1.8T  0 disk 
nvme0n1     259:0    0 238.5G  0 disk

vgchange -ay

1 logical volume(s) in volume group "ceph-b9cc563f-5758-4ead-bbec-74c6aafb7099" now active

lsblk after vgchange -ay

NAME                                                                                                  MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda                                                                                                     8:0    0   1.8T  0 disk 
└─ceph--b9cc563f--5758--4ead--bbec--74c6aafb7099-osd--block--8331d767--af24--40da--bac0--ccbaf0fcda92 254:0    0   1.8T  0 lvm

/var/lib/ceph/osd/ceph-2 is empty

wuast94 commented 3 months ago

Found a workaround: After restart executing vgchange -ay activates the Logical Volumes and then all the automations take over

IF the restart was longer ago and the automations run into problems running ceph-volume lvm activate --all afterwards brings the OSD back up again

adding @reboot /usr/sbin/vgchange -ay >> /var/log/vgchange.log 2>&1 to my crontab fixes the issue for me

this is a workaround that fixes my specific error, but i think there is something off that also impacts hot plugging etc.

i hope all my information helps to get this fixed 😊