Closed trunet closed 1 month ago
Is that happening for all your containers?
How's ceph run on those systems? The Docker backed Ceph has been causing this kind of issue in the past, but I'd still have expected to see the rbd device showing up in someone's mount table.
it happens to some, randomly.
running a cluster of microceph (snap)
# python3
Python 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.open('/dev/rbd0', os.O_EXCL)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
OSError: [Errno 16] Device or resource busy: '/dev/rbd0'
but I can't find what's keeping it busy.
# lsof 2>&1 | grep rbd0 | grep -v 'no pwd entry'
rbd0-task 492026 root cwd DIR 8,194 4096 2 /
rbd0-task 492026 root rtd DIR 8,194 4096 2 /
rbd0-task 492026 root txt unknown /proc/492026/exe
jbd2/rbd0 492043 root cwd DIR 8,194 4096 2 /
jbd2/rbd0 492043 root rtd DIR 8,194 4096 2 /
jbd2/rbd0 492043 root txt unknown /proc/492043/exe
ps shows:
root 492026 2 0 Aug07 ? 00:00:00 [rbd0-tasks]
root 492043 2 0 Aug07 ? 00:00:00 [jbd2/rbd0-8]
# cat /proc/492026/stack
[<0>] rescuer_thread+0x321/0x3c0
[<0>] kthread+0x127/0x150
[<0>] ret_from_fork+0x1f/0x30
# cat /proc/492043/stack
[<0>] kjournald2+0x219/0x280
[<0>] kthread+0x127/0x150
[<0>] ret_from_fork+0x1f/0x30
# rbd info remote/[REDACTED]
rbd image '[REDACTED]':
size 20 GiB in 5120 objects
order 22 (4 MiB objects)
snapshot_count: 0
id: 6a464890b0c9c
block_name_prefix: rbd_data.6a464890b0c9c
format: 2
features: layering
op_features:
flags:
create_timestamp: Wed Aug 7 18:40:22 2024
access_timestamp: Wed Aug 7 18:40:22 2024
modify_timestamp: Wed Aug 7 18:40:22 2024
parent: remote/image_b0127e4d2d45b502024a667dbebb5869327cc7ebfceba529c0ff556d376f9287_ext4@readonly
overlap: 10 GiB
# rbd status -p remote [REDACTED]
Watchers:
watcher=[REDACTED_SAME_SERVER_IP]:0/401004797 client.385955 cookie=18446462598732841706
# cat /sys/kernel/debug/ceph/ad848cbe-c127-4fc9-aeca-4a297799a866.client385955/osdc | grep 6a46
18446462598732841706 osd13 4.68c2cd33 4.13 [13,4,21]/13 [13,4,21]/13 e2479 rbd_header.6a464890b0c9c 0x20 0 WC/0
# rados stat -p remote rbd_header.6a464890b0c9c
remote/rbd_header.6a464890b0c9c mtime 2024-08-07T18:40:28.000000+0000, size 0
That's starting to sound more and more like a kernel bug...
Any chance you can try a newer kernel? Maybe try the 22.04 HWE kernel to get onto 6.5?
I'll upgrade and check
@trunet any update on this one?
looks like this is a microceph issue, but didn't have time to troubleshoot properly.
in any case, we have clusters deployed using ceph natively and it's working fine. we'll re-provision those with normal ceph.
will close it. thanks!
Required information
Issue description
Container stop, but gives an error. And it's impossible to delete without manual ceph workarounds.
Steps to reproduce
incus start my-container
incus stop my-container
incus delete my-container
Information to attach
dmesg
) - NOTHINGincus info NAME --show-log
)Log:
architecture: x86_64 config: cloud-init.user-data: |+
cloud-config
image.aliases: 24.04 image.architecture: amd64 image.description: Ubuntu 24.04 noble (20240729_20:58:30) image.os: Ubuntu image.release: noble image.requirements.cgroup: v2 image.serial: "20240729_20:58:30" image.type: squashfs image.variant: cloud limits.cpu.allowance: 50% limits.memory: 1GiB migration.stateful: "true" volatile.base_image: b0127e4d2d45b502024a667dbebb5869327cc7ebfceba529c0ff556d376f9287 volatile.cloud-init.instance-id: cd4c191d-1a13-4e57-a1f9-f81f97eb65c3 volatile.eth0.hwaddr: 00:16:3e:7d:83:84 volatile.eth0.last_state.ip_addresses: [REDACTED] volatile.eth0.name: eth0 volatile.idmap.base: "0" volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]' volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]' volatile.last_state.idmap: '[]' volatile.last_state.power: STOPPED volatile.last_state.ready: "false" volatile.uuid: d567218b-d0e1-4987-8387-d36aca06f6ae volatile.uuid.generation: d567218b-d0e1-4987-8387-d36aca06f6ae devices: audit: path: /opt/vault-audit pool: remote source: [REDACTED] type: disk data: path: /opt/vault pool: remote source: [REDACTED] type: disk eth0: network: ovn-vault type: nic root: path: / pool: remote size: 20GiB type: disk ephemeral: false profiles: