lxc / incus

Powerful system container and virtual machine manager
https://linuxcontainers.org/incus
Apache License 2.0
2.75k stars 224 forks source link

Can't delete container with ceph storage - exit status 16 (rbd: sysfs write failed rbd: unmap failed: (16) Device or resource busy) #1087

Closed trunet closed 1 month ago

trunet commented 3 months ago

Required information

Issue description

Container stop, but gives an error. And it's impossible to delete without manual ceph workarounds.

Steps to reproduce

  1. incus start my-container
  2. incus stop my-container
    Error: Failed unmounting instance: Failed to run: rbd --id admin --cluster ceph --pool remote unmap container_my-container: exit status 16 (rbd: sysfs write failed
    rbd: unmap failed: (16) Device or resource busy)
    Try `incus info --show-log my-container` for more info
  3. incus delete my-container
    Error: Failed deleting instance "[REDACTED]" in project "default": Error deleting storage volume: Failed to delete volume: Failed to run: rbd --id admin --cluster ceph --pool remote unmap container_my-container: exit status 16 (rbd: sysfs write failed
    rbd: unmap failed: (16) Device or resource busy)

Information to attach

# rbd showmapped
...
0   remote             container_my-container                     -     /dev/rbd0
...

# grep rbd0 /proc/*/mountinfo
[EMPTY]

# grep rbd0 /proc/self/mountinfo
[EMPTY]

Log:

 - [x] Container configuration (`incus config show NAME --expanded`)

architecture: x86_64 config: cloud-init.user-data: |+

cloud-config

write_files:
  - path: /etc/sssd/add_group_access_from_cloudinit.conf
    content: |
      [REDACTED]
    owner: 'root:root'
    permissions: '0600'

image.aliases: 24.04 image.architecture: amd64 image.description: Ubuntu 24.04 noble (20240729_20:58:30) image.os: Ubuntu image.release: noble image.requirements.cgroup: v2 image.serial: "20240729_20:58:30" image.type: squashfs image.variant: cloud limits.cpu.allowance: 50% limits.memory: 1GiB migration.stateful: "true" volatile.base_image: b0127e4d2d45b502024a667dbebb5869327cc7ebfceba529c0ff556d376f9287 volatile.cloud-init.instance-id: cd4c191d-1a13-4e57-a1f9-f81f97eb65c3 volatile.eth0.hwaddr: 00:16:3e:7d:83:84 volatile.eth0.last_state.ip_addresses: [REDACTED] volatile.eth0.name: eth0 volatile.idmap.base: "0" volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]' volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]' volatile.last_state.idmap: '[]' volatile.last_state.power: STOPPED volatile.last_state.ready: "false" volatile.uuid: d567218b-d0e1-4987-8387-d36aca06f6ae volatile.uuid.generation: d567218b-d0e1-4987-8387-d36aca06f6ae devices: audit: path: /opt/vault-audit pool: remote source: [REDACTED] type: disk data: path: /opt/vault pool: remote source: [REDACTED] type: disk eth0: network: ovn-vault type: nic root: path: / pool: remote size: 20GiB type: disk ephemeral: false profiles:

stgraber commented 3 months ago

Is that happening for all your containers?

How's ceph run on those systems? The Docker backed Ceph has been causing this kind of issue in the past, but I'd still have expected to see the rbd device showing up in someone's mount table.

trunet commented 3 months ago

it happens to some, randomly.

running a cluster of microceph (snap)

trunet commented 3 months ago
# python3
Python 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.open('/dev/rbd0', os.O_EXCL)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [Errno 16] Device or resource busy: '/dev/rbd0'

but I can't find what's keeping it busy.

trunet commented 3 months ago
# lsof 2>&1 | grep rbd0 | grep -v 'no pwd entry'
rbd0-task 492026                             root  cwd       DIR              8,194       4096          2 /
rbd0-task 492026                             root  rtd       DIR              8,194       4096          2 /
rbd0-task 492026                             root  txt   unknown                                          /proc/492026/exe
jbd2/rbd0 492043                             root  cwd       DIR              8,194       4096          2 /
jbd2/rbd0 492043                             root  rtd       DIR              8,194       4096          2 /
jbd2/rbd0 492043                             root  txt   unknown                                          /proc/492043/exe

ps shows:

root      492026       2  0 Aug07 ?        00:00:00   [rbd0-tasks]
root      492043       2  0 Aug07 ?        00:00:00   [jbd2/rbd0-8]

# cat /proc/492026/stack
[<0>] rescuer_thread+0x321/0x3c0
[<0>] kthread+0x127/0x150
[<0>] ret_from_fork+0x1f/0x30
# cat /proc/492043/stack
[<0>] kjournald2+0x219/0x280
[<0>] kthread+0x127/0x150
[<0>] ret_from_fork+0x1f/0x30
# rbd info remote/[REDACTED]
rbd image '[REDACTED]':
    size 20 GiB in 5120 objects
    order 22 (4 MiB objects)
    snapshot_count: 0
    id: 6a464890b0c9c
    block_name_prefix: rbd_data.6a464890b0c9c
    format: 2
    features: layering
    op_features:
    flags:
    create_timestamp: Wed Aug  7 18:40:22 2024
    access_timestamp: Wed Aug  7 18:40:22 2024
    modify_timestamp: Wed Aug  7 18:40:22 2024
    parent: remote/image_b0127e4d2d45b502024a667dbebb5869327cc7ebfceba529c0ff556d376f9287_ext4@readonly
    overlap: 10 GiB

# rbd status -p remote [REDACTED]
Watchers:
    watcher=[REDACTED_SAME_SERVER_IP]:0/401004797 client.385955 cookie=18446462598732841706

# cat /sys/kernel/debug/ceph/ad848cbe-c127-4fc9-aeca-4a297799a866.client385955/osdc | grep 6a46
18446462598732841706    osd13   4.68c2cd33  4.13    [13,4,21]/13    [13,4,21]/13    e2479   rbd_header.6a464890b0c9c    0x20    0   WC/0

# rados stat -p remote rbd_header.6a464890b0c9c
remote/rbd_header.6a464890b0c9c mtime 2024-08-07T18:40:28.000000+0000, size 0
stgraber commented 3 months ago

That's starting to sound more and more like a kernel bug...

stgraber commented 3 months ago

Any chance you can try a newer kernel? Maybe try the 22.04 HWE kernel to get onto 6.5?

trunet commented 3 months ago

I'll upgrade and check

stgraber commented 1 month ago

@trunet any update on this one?

trunet commented 1 month ago

looks like this is a microceph issue, but didn't have time to troubleshoot properly.

in any case, we have clusters deployed using ceph natively and it's working fine. we'll re-provision those with normal ceph.

will close it. thanks!