trunet commented 3 months ago

Required information

Distribution: Ubuntu 22.04
Distribution version:

The output of "incus info" or if that fails:

config:
cluster.https_address: [REDACTED]:8443
core.bgp_address: [REDACTED]:179
core.bgp_asn: "[REDACTED]"
core.bgp_routerid: [REDACTED]
core.https_address: [REDACTED]:8443
images.auto_update_interval: "0"
network.ovn.northbound_connection: tcp:[REDACTED]:6641,tcp:[REDACTED]:6641,tcp:[REDACTED]:6641
api_extensions:
[TOO MUCH UNRELATED STUFF]
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
auth_user_name: root
auth_user_method: unix
environment:
addresses:
- [REDACTED]:8443
architectures:
- x86_64
- i686
certificate: |
[REDACTED]
certificate_fingerprint: [REDACTED]
driver: lxc | qemu
driver_version: 6.0.1 | 9.0.1
firewall: nftables
kernel: Linux
kernel_architecture: x86_64
kernel_features:
idmapped_mounts: "true"
netnsid_getifaddrs: "true"
seccomp_listener: "true"
seccomp_listener_continue: "true"
uevent_injection: "true"
unpriv_binfmt: "false"
unpriv_fscaps: "true"
kernel_version: 5.15.0-116-generic
lxc_features:
cgroup2: "true"
core_scheduling: "true"
devpts_fd: "true"
idmapped_mounts_v2: "true"
mount_injection_file: "true"
network_gateway_device_route: "true"
network_ipvlan: "true"
network_l2proxy: "true"
network_phys_macvlan_mtu: "true"
network_veth_router: "true"
pidfd: "true"
seccomp_allow_deny_syntax: "true"
seccomp_notify: "true"
seccomp_proxy_send_notify_fd: "true"
os_name: Ubuntu
os_version: "22.04"
project: default
server: incus
server_clustered: true
server_event_mode: full-mesh
server_name: [REDACTED]
server_pid: 513376
server_version: "6.3"
storage: ceph
storage_version: 17.2.6
storage_supported_drivers:
- name: cephobject
version: 17.2.6
remote: true
- name: dir
version: "1"
remote: false
- name: lvm
version: 2.03.11(2) (2021-01-08) / 1.02.175 (2021-01-08) / 4.45.0
remote: false
- name: lvmcluster
version: 2.03.11(2) (2021-01-08) / 1.02.175 (2021-01-08) / 4.45.0
remote: true
- name: zfs
version: 2.1.5-1ubuntu6~22.04.4
remote: false
- name: btrfs
version: 5.16.2
remote: false
- name: ceph
version: 17.2.6
remote: true
- name: cephfs
version: 17.2.6
remote: true

Issue description

Container stop, but gives an error. And it's impossible to delete without manual ceph workarounds.

Steps to reproduce

incus start my-container

incus stop my-container

Error: Failed unmounting instance: Failed to run: rbd --id admin --cluster ceph --pool remote unmap container_my-container: exit status 16 (rbd: sysfs write failed
rbd: unmap failed: (16) Device or resource busy)
Try `incus info --show-log my-container` for more info

incus delete my-container

Error: Failed deleting instance "[REDACTED]" in project "default": Error deleting storage volume: Failed to delete volume: Failed to run: rbd --id admin --cluster ceph --pool remote unmap container_my-container: exit status 16 (rbd: sysfs write failed
rbd: unmap failed: (16) Device or resource busy)

Information to attach

# rbd showmapped
...
0   remote             container_my-container                     -     /dev/rbd0
...

# grep rbd0 /proc/*/mountinfo
[EMPTY]

# grep rbd0 /proc/self/mountinfo
[EMPTY]

[x] Any relevant kernel output (dmesg) - NOTHING

[x] Container log (incus info NAME --show-log)


❯ incus info --show-log my-container
Name: my-container
Status: STOPPED
Type: container
Architecture: x86_64
Location: [REDACTED]
Created: 2024/08/07 15:40 -03
Last Used: 2024/08/07 17:51 -03

Log:

 - [x] Container configuration (`incus config show NAME --expanded`)

architecture: x86_64 config: cloud-init.user-data: |+

cloud-config

write_files:
  - path: /etc/sssd/add_group_access_from_cloudinit.conf
    content: |
      [REDACTED]
    owner: 'root:root'
    permissions: '0600'

image.aliases: 24.04 image.architecture: amd64 image.description: Ubuntu 24.04 noble (20240729_20:58:30) image.os: Ubuntu image.release: noble image.requirements.cgroup: v2 image.serial: "20240729_20:58:30" image.type: squashfs image.variant: cloud limits.cpu.allowance: 50% limits.memory: 1GiB migration.stateful: "true" volatile.base_image: b0127e4d2d45b502024a667dbebb5869327cc7ebfceba529c0ff556d376f9287 volatile.cloud-init.instance-id: cd4c191d-1a13-4e57-a1f9-f81f97eb65c3 volatile.eth0.hwaddr: 00:16:3e:7d:83:84 volatile.eth0.last_state.ip_addresses: [REDACTED] volatile.eth0.name: eth0 volatile.idmap.base: "0" volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]' volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]' volatile.last_state.idmap: '[]' volatile.last_state.power: STOPPED volatile.last_state.ready: "false" volatile.uuid: d567218b-d0e1-4987-8387-d36aca06f6ae volatile.uuid.generation: d567218b-d0e1-4987-8387-d36aca06f6ae devices: audit: path: /opt/vault-audit pool: remote source: [REDACTED] type: disk data: path: /opt/vault pool: remote source: [REDACTED] type: disk eth0: network: ovn-vault type: nic root: path: / pool: remote size: 20GiB type: disk ephemeral: false profiles:

vault stateful: false description: [REDACTED] instance


- [x] Main daemon log (at /var/log/incus/incusd.log) - NOTHING RELEVANT
- [ ] Output of the client with --debug
- [ ] Output of the daemon with --debug (alternatively output of `incus monitor --pretty` while reproducing the issue)

stgraber commented 3 months ago

Is that happening for all your containers?

How's ceph run on those systems? The Docker backed Ceph has been causing this kind of issue in the past, but I'd still have expected to see the rbd device showing up in someone's mount table.

trunet commented 3 months ago

it happens to some, randomly.

running a cluster of microceph (snap)

trunet commented 3 months ago

# python3
Python 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.open('/dev/rbd0', os.O_EXCL)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [Errno 16] Device or resource busy: '/dev/rbd0'

but I can't find what's keeping it busy.

trunet commented 3 months ago

# lsof 2>&1 | grep rbd0 | grep -v 'no pwd entry'
rbd0-task 492026                             root  cwd       DIR              8,194       4096          2 /
rbd0-task 492026                             root  rtd       DIR              8,194       4096          2 /
rbd0-task 492026                             root  txt   unknown                                          /proc/492026/exe
jbd2/rbd0 492043                             root  cwd       DIR              8,194       4096          2 /
jbd2/rbd0 492043                             root  rtd       DIR              8,194       4096          2 /
jbd2/rbd0 492043                             root  txt   unknown                                          /proc/492043/exe

ps shows:

root      492026       2  0 Aug07 ?        00:00:00   [rbd0-tasks]
root      492043       2  0 Aug07 ?        00:00:00   [jbd2/rbd0-8]

# cat /proc/492026/stack
[<0>] rescuer_thread+0x321/0x3c0
[<0>] kthread+0x127/0x150
[<0>] ret_from_fork+0x1f/0x30
# cat /proc/492043/stack
[<0>] kjournald2+0x219/0x280
[<0>] kthread+0x127/0x150
[<0>] ret_from_fork+0x1f/0x30

# rbd info remote/[REDACTED]
rbd image '[REDACTED]':
    size 20 GiB in 5120 objects
    order 22 (4 MiB objects)
    snapshot_count: 0
    id: 6a464890b0c9c
    block_name_prefix: rbd_data.6a464890b0c9c
    format: 2
    features: layering
    op_features:
    flags:
    create_timestamp: Wed Aug  7 18:40:22 2024
    access_timestamp: Wed Aug  7 18:40:22 2024
    modify_timestamp: Wed Aug  7 18:40:22 2024
    parent: remote/image_b0127e4d2d45b502024a667dbebb5869327cc7ebfceba529c0ff556d376f9287_ext4@readonly
    overlap: 10 GiB

# rbd status -p remote [REDACTED]
Watchers:
    watcher=[REDACTED_SAME_SERVER_IP]:0/401004797 client.385955 cookie=18446462598732841706

# cat /sys/kernel/debug/ceph/ad848cbe-c127-4fc9-aeca-4a297799a866.client385955/osdc | grep 6a46
18446462598732841706    osd13   4.68c2cd33  4.13    [13,4,21]/13    [13,4,21]/13    e2479   rbd_header.6a464890b0c9c    0x20    0   WC/0

# rados stat -p remote rbd_header.6a464890b0c9c
remote/rbd_header.6a464890b0c9c mtime 2024-08-07T18:40:28.000000+0000, size 0

stgraber commented 3 months ago

That's starting to sound more and more like a kernel bug...

stgraber commented 3 months ago

Any chance you can try a newer kernel? Maybe try the 22.04 HWE kernel to get onto 6.5?

trunet commented 3 months ago

I'll upgrade and check

stgraber commented 1 month ago

@trunet any update on this one?

trunet commented 1 month ago

looks like this is a microceph issue, but didn't have time to troubleshoot properly.

in any case, we have clusters deployed using ceph natively and it's working fine. we'll re-provision those with normal ceph.

will close it. thanks!

lxc / incus

Can't delete container with ceph storage - exit status 16 (rbd: sysfs write failed rbd: unmap failed: (16) Device or resource busy) #1087

Required information

Issue description

Steps to reproduce

Information to attach

cloud-config