lxc / incus

Powerful system container and virtual machine manager
https://linuxcontainers.org/incus
Apache License 2.0
2.47k stars 199 forks source link

VM disk hotplug issue (running out of hotplug slots) #1086

Closed srd424 closed 1 week ago

srd424 commented 1 month ago

Required information

Issue description

virtiofs hotplug seems to fail whenever I try to use it:

steved@xubuntu:~$ incus config device add vchost1 ostree3 disk source=/vol/ostree path=/vol/contpool/ostree2
Error: Failed to start device "ostree3": Failed to add the virtiofs device: Bus 'qemu_pcie21' not found

I'm wondering if this is a logic error in the code here:

https://github.com/lxc/incus/blob/3ce8af031db73cba39b9af465ebadd0dbc3c7cff/internal/server/instance/drivers/driver_qemu.go#L2311-L2332

I don't fully understand it, but if I look in the raw qemu config for this VM, this existing entries for all the virtiofs/9p devices show the bus as "qemu_pcie2", which makes me think this code should be doing .. something else!

stgraber commented 1 month ago
stgraber@dakara:~$ incus launch images:ubuntu/24.04 v1 --vm
Launching v1
stgraber@dakara:~$ incus config device add v1 etc disk source=/etc/ path=/mnt/etc
Device etc added to v1
stgraber@dakara:~$ incus exec v1 -- df -h /mnt/etc
Filesystem      Size  Used Avail Use% Mounted on
incus_etc        90G   25G   61G  29% /mnt/etc
stgraber@dakara:~$ 

Can you show a full incus config show --expanded ostree3?

Also, any chance you can test on an up to date version of Incus (6.0.1 for LTS, 6.3 for non-LTS)?

srd424 commented 1 month ago

I'm a bit time/resource constrained at the moment but will have a go. Looking at the code I did wonder if having other virtiofs shares already defined at start up was part of the problem?

On Thu, 8 Aug 2024, 07:10 Stéphane Graber, @.***> wrote:

@.:~$ incus launch images:ubuntu/24.04 v1 --vm Launching v1 @.:~$ incus config device add v1 etc disk source=/etc/ path=/mnt/etc Device etc added to v1 @.:~$ incus exec v1 -- df -h /mnt/etc Filesystem Size Used Avail Use% Mounted on incus_etc 90G 25G 61G 29% /mnt/etc @.:~$

Can you show a full incus config show --expanded ostree3?

Also, any chance you can test on an up to date version of Incus (6.0.1 for LTS, 6.3 for non-LTS)?

— Reply to this email directly, view it on GitHub https://github.com/lxc/incus/issues/1086#issuecomment-2275025926, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4TJUUFUGEZPJQS45DWMALZQMDULAVCNFSM6AAAAABMFEZTAKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZVGAZDKOJSGY . You are receiving this because you authored the thread.Message ID: @.***>

srd424 commented 1 month ago

I can't easily upgrade to 6.0.1 - I have some VMs running I really don't want to stop. I will look at spinning up another incus install somewhere though.

For the moment, I did create a fresh VM, started it, added two virtiofs mounts OK. Shut it down and restarted, added one more OK, then the fourth failed Error: Failed to start device "test2": Failed to add the virtiofs device: Bus 'qemu_pcie9' not found

This might be a question of working backwards from the code to work out what the failing condition is..

srd424 commented 1 month ago

Here's the PCI topology after the tests described above:

           +-01.0-[01]--+-00.0  Red Hat, Inc. Virtio memory balloon [1af4:1045]
           |            +-00.1  Red Hat, Inc. Virtio RNG [1af4:1044]
           |            +-00.2  Red Hat, Inc. Virtio input [1af4:1052]
           |            +-00.3  Red Hat, Inc. Virtio input [1af4:1052]
           |            +-00.4  Red Hat, Inc. Virtio socket [1af4:1053]
           |            +-00.5  Red Hat, Inc. Virtio console [1af4:1043]
           |            \-00.6  Red Hat, Inc. QEMU XHCI Host Controller [1b36:000d]
           +-01.1-[02]----00.0  Red Hat, Inc. Virtio SCSI [1af4:1048]
           +-01.2-[03]--+-00.0  Red Hat, Inc. Virtio filesystem [1af4:1049]
           |            +-00.1  Red Hat, Inc. Virtio filesystem [1af4:1049]
           |            +-00.2  Red Hat, Inc. Virtio file system [1af4:105a]
           |            +-00.3  Red Hat, Inc. Virtio filesystem [1af4:1049]
           |            +-00.4  Red Hat, Inc. Virtio file system [1af4:105a]
           |            \-00.5  Red Hat, Inc. Virtio filesystem [1af4:1049]
           +-01.3-[04]----00.0  Red Hat, Inc. Virtio GPU [1af4:1050]
           +-01.4-[05]----00.0  Red Hat, Inc. Virtio network device [1af4:1041]
           +-01.5-[06]--
           +-01.6-[07]----00.0  Red Hat, Inc. Virtio file system [1af4:105a]
           +-01.7-[08]--
           +-02.0-[09]--
           +-1f.0  Intel Corporation 82801IB (ICH9) LPC Interface Controller [8086:2918]
           +-1f.2  Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] [8086:2922]
           \-1f.3  Intel Corporation 82801I (ICH9 Family) SMBus Controller [8086:2930]
srd424 commented 1 month ago

Can you show a full incus config show --expanded ostree3?

Oops, sorry, missed this. I assume you meant for the VM, not the (failed) virtiofs disk device:

architecture: x86_64
config:
  limits.cpu: "4"
  limits.memory: 8GiB
  raw.qemu.conf: |
    [machine]
    kernel = /var/lib/incus/kernels/vchost-vmlinuz
    initrd = /var/lib/incus/kernels/vchost-initrd.img
    append = "root=LABEL=root rootflags=subvol=/rootA console=ttyS0 SYSTEMD_FSTAB=/config/fstab systemd.log_level=info systemd.hostname=vchost1 ip=192.168.160.161::192.168.128.1:255.255.128.0:vchost1:lan0:off debug=y"
  volatile.cloud-init.instance-id: 95ccb25d-8eaf-495e-b32f-80a28f534d06
  volatile.eth0.host_name: tapc34f7557
  volatile.eth0.hwaddr: 00:16:3e:f4:99:73
  volatile.last_state.power: RUNNING
  volatile.last_state.ready: "false"
  volatile.uuid: b9d00062-f6cb-4779-b7c5-1da216171475
  volatile.uuid.generation: b9d00062-f6cb-4779-b7c5-1da216171475
  volatile.vsock_id: "2799179731"
devices:
  cachepermpool:
    source: /dev/inthdd/cachepermpool
    type: disk
  cpool:
    source: /dev/ssd/vchost1-cpool
    type: disk
  eth0:
    nictype: bridged
    parent: balbr0
    type: nic
  home-data-net:
    source: /dev/inthdd/home-data-net
    type: disk
  iso:
    source: /dev/inthdd/iso
    type: disk
  nixstore:
    source: /dev/inthdd/nixstore-img
    type: disk
  ostree:
    readonly: "true"
    source: /dev/ssd/ostree
    type: disk
  pip-cache:
    source: /dev/inthdd/pip-cache
    type: disk
  root:
    path: /
    pool: vchost
    type: disk
  rootimg:
    readonly: "true"
    source: /dev/inthdd/vchost-rootfs
    type: disk
  sd-dropbox:
    source: /dev/ssd/sd-dropbox
    type: disk
  sharedconf:
    path: /vol/sharedconf
    source: /vol/clusterconf/shared
    type: disk
  user-cache:
    source: /dev/inthdd/user-cache
    type: disk
  vcconfig:
    path: /media/root-ro/config
    source: /vol/clusterconf
    type: disk
  xu-build:
    source: /dev/inthdd/xu-build
    type: disk
  xu-home:
    source: /dev/inthdd/xu-home
    type: disk
  xu-home-ssd:
    source: /dev/inthdd/xu-home-ssd
    type: disk
  xu-spool:
    source: /dev/inthdd/xu-spool
    type: disk
ephemeral: false
profiles:
- default
stateful: false
description: ""
stgraber commented 1 month ago

I can't easily upgrade to 6.0.1 - I have some VMs running I really don't want to stop. I will look at spinning up another incus install somewhere though.

We don't restart workloads on upgrade, only the control plane (API) goes down during the upgrade.

stgraber commented 1 month ago

Anyway, it's likely caused by the high-ish number of devices. We have a reserve of I believe 8 hotplug slots but it's supposed to be a sliding thing, basically allowing to hotplug 8 more devices than whatever you started the VM with. The above suggests that this logic may not be behaving as intended and you're running out of slot somehow.

srd424 commented 1 month ago

We don't restart workloads on upgrade, only the control plane (API) goes down during the upgrade.

Oh, worth knowing, thanks! I did wonder if that was the case but couldn't quickly turn up the right docs.

Anyway, it's likely caused by the high-ish number of devices. We have a reserve of I believe 8 hotplug slots but it's supposed to be a sliding thing, basically allowing to hotplug 8 more devices than whatever you started the VM with. The above suggests that this logic may not be behaving as intended and you're running out of slot somehow.

My brain isn't working brilliantly at the moment, but poking around in the code I did wonder if it should be trying to add devices to one of the existing buses rather than create a new bus? I also noticed the block devices don't seem to end up in qemu.conf - I assume they're now set up using the qemu monitor when the vm is started? I wondered if the virtiofs stuff could take the same approach - at the moment I think there are different code paths for hotplug vs pre-configured mounts? But I very much did get lost in the code ...

srd424 commented 1 month ago

BTW, can confirm this does happen on 6.0.1 too.

stgraber commented 1 month ago

My brain isn't working brilliantly at the moment, but poking around in the code I did wonder if it should be trying to add devices to one of the existing buses rather than create a new bus? I also noticed the block devices don't seem to end up in qemu.conf - I assume they're now set up using the qemu monitor when the vm is started? I wondered if the virtiofs stuff could take the same approach - at the moment I think there are different code paths for hotplug vs pre-configured mounts? But I very much did get lost in the code ...

I'll need to look at the logic again, but I thought we made all the disks be hotplug as we want the ability to add/remove them.

The way things are supposed to work is that at startup we allocate PCIe root addresses for all the stuff that we're going to hotplug through QMP. Then we allocate an additional 8 PCIe root addresses to allow for things to be added later on.

We can't alter the PCIe root once the VM is running so given that limited hotplug/hotremove works, the core of the logic seems fine. I suspect we just have an issue where we're somehow not properly pre-allocating some stuff, basically making your boot time disks already use the "spare" slots, at which point you'd have run out of slot and get the error.

So basically a few different things:

stgraber commented 1 week ago

Looking into this one now

stgraber commented 1 week ago

Starting with an empty VM and attempting to add 10 disks which require PCIe address (using io.bus=nvme to force that), I'm getting:

stgraber@castiana:~$ for i in $(seq -w 10); do incus config device add v1 disk$i disk source=/tmp/test.img io.bus=nvme; done
Device disk01 added to v1
Device disk02 added to v1
Device disk03 added to v1
Error: Failed to start device "disk04": Failed to call monitor hook for block device: Failed adding block device for disk device "disk04": Failed adding device: Bus 'qemu_pcie9' not found
Error: Failed to start device "disk05": Failed to call monitor hook for block device: Failed adding block device for disk device "disk05": Failed adding device: Bus 'qemu_pcie9' not found
Error: Failed to start device "disk06": Failed to call monitor hook for block device: Failed adding block device for disk device "disk06": Failed adding device: Bus 'qemu_pcie9' not found
Error: Failed to start device "disk07": Failed to call monitor hook for block device: Failed adding block device for disk device "disk07": Failed adding device: Bus 'qemu_pcie9' not found
Error: Failed to start device "disk08": Failed to call monitor hook for block device: Failed adding block device for disk device "disk08": Failed adding device: Bus 'qemu_pcie9' not found
Error: Failed to start device "disk09": Failed to call monitor hook for block device: Failed adding block device for disk device "disk09": Failed adding device: Bus 'qemu_pcie9' not found
Error: Failed to start device "disk10": Failed to call monitor hook for block device: Failed adding block device for disk device "disk10": Failed adding device: Bus 'qemu_pcie9' not found
stgraber@castiana:~$ incus restart v1
stgraber@castiana:~$ for i in $(seq -w 10); do incus config device add v1 disk$i disk source=/tmp/test.img io.bus=nvme; done
Error: The device already exists
Error: The device already exists
Error: The device already exists
Device disk04 added to v1
Device disk05 added to v1
Device disk06 added to v1
Error: Failed to start device "disk07": Failed to call monitor hook for block device: Failed adding block device for disk device "disk07": Failed adding device: Bus 'qemu_pcie12' not found
Error: Failed to start device "disk08": Failed to call monitor hook for block device: Failed adding block device for disk device "disk08": Failed adding device: Bus 'qemu_pcie12' not found
Error: Failed to start device "disk09": Failed to call monitor hook for block device: Failed adding block device for disk device "disk09": Failed adding device: Bus 'qemu_pcie12' not found
Error: Failed to start device "disk10": Failed to call monitor hook for block device: Failed adding block device for disk device "disk10": Failed adding device: Bus 'qemu_pcie12' not found
stgraber@castiana:~$ incus restart v1
stgraber@castiana:~$ for i in $(seq -w 10); do incus config device add v1 disk$i disk source=/tmp/test.img io.bus=nvme; done
Error: The device already exists
Error: The device already exists
Error: The device already exists
Error: The device already exists
Error: The device already exists
Error: The device already exists
Device disk07 added to v1
Device disk08 added to v1
Device disk09 added to v1
Error: Failed to start device "disk10": Failed to call monitor hook for block device: Failed adding block device for disk device "disk10": Failed adding device: Bus 'qemu_pcie15' not found
stgraber@castiana:~$ incus restart v1
stgraber@castiana:~$ for i in $(seq -w 10); do incus config device add v1 disk$i disk source=/tmp/test.img io.bus=nvme; done
Error: The device already exists
Error: The device already exists
Error: The device already exists
Error: The device already exists
Error: The device already exists
Error: The device already exists
Error: The device already exists
Error: The device already exists
Error: The device already exists
Device disk10 added to v1
stgraber@castiana:~$ 

So we can see that we can add at most 3 additional devices before running out of slots.

I'll had to tweak things a bit because the number is supposed to be 4, not 3 and we definitely want a much nicer error when hitting the limit of remaining hotplug slots.

stgraber commented 1 week ago

Doing a quick test here after doubling the number of hotplug slots to 8, we can see that there's something off in the logic as the first hotplug slot isn't used:

root@v1:~# lspci -tnnnvvv
-[0000:00]-+-00.0  Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller [8086:29c0]
           +-01.0-[01]--+-00.0  Red Hat, Inc. Virtio 1.0 memory balloon [1af4:1045]
           |            +-00.1  Red Hat, Inc. Virtio 1.0 RNG [1af4:1044]
           |            +-00.2  Red Hat, Inc. Virtio 1.0 input [1af4:1052]
           |            +-00.3  Red Hat, Inc. Virtio 1.0 input [1af4:1052]
           |            +-00.4  Red Hat, Inc. Virtio 1.0 socket [1af4:1053]
           |            +-00.5  Red Hat, Inc. Virtio 1.0 console [1af4:1043]
           |            \-00.6  Red Hat, Inc. QEMU XHCI Host Controller [1b36:000d]
           +-01.1-[02]----00.0  Red Hat, Inc. Virtio 1.0 SCSI [1af4:1048]
           +-01.2-[03]--+-00.0  Red Hat, Inc. Virtio 1.0 filesystem [1af4:1049]
           |            \-00.1  Red Hat, Inc. Virtio 1.0 filesystem [1af4:1049]
           +-01.3-[04]----00.0  Red Hat, Inc. Virtio 1.0 GPU [1af4:1050]
           +-01.4-[05]----00.0  Red Hat, Inc. Virtio 1.0 network device [1af4:1041]
           +-01.5-[06]--
           +-01.6-[07]----00.0  Red Hat, Inc. QEMU NVM Express Controller [1b36:0010]
           +-01.7-[08]----00.0  Red Hat, Inc. QEMU NVM Express Controller [1b36:0010]
           +-02.0-[09]----00.0  Red Hat, Inc. QEMU NVM Express Controller [1b36:0010]
           +-02.1-[0a]----00.0  Red Hat, Inc. QEMU NVM Express Controller [1b36:0010]
           +-02.2-[0b]----00.0  Red Hat, Inc. QEMU NVM Express Controller [1b36:0010]
           +-02.3-[0c]----00.0  Red Hat, Inc. QEMU NVM Express Controller [1b36:0010]
           +-02.4-[0d]----00.0  Red Hat, Inc. QEMU NVM Express Controller [1b36:0010]
           +-1f.0  Intel Corporation 82801IB (ICH9) LPC Interface Controller [8086:2918]
           +-1f.2  Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] [8086:2922]
           \-1f.3  Intel Corporation 82801I (ICH9 Family) SMBus Controller [8086:2930]
root@v1:~# 
stgraber commented 1 week ago

Okay, so that's where we get to what you were pointing out earlier, basically the logic is a bit limited in that it basically assumes that every device in the devices list uses a PCIe slot.

It doesn't consider the fact that devices that were present at boot time don't count towards the hotplug quota nor that a number of devices simply don't need a PCIe address at all.

The cleanest option would be to fetch a list of addresses from QEMU directly, I'm going to look at what query-pci may be able to get us in that regard.

stgraber commented 1 week ago

Interesting, we actually do have a mapping for query-pci already, just not using it anywhere.

stgraber commented 1 week ago

Got a reliable way to handle things which is also much simpler than the current logic, win win.