Closed mitchdz closed 1 month ago
$ fallocate -l 20G image.img
$ make-bcache -C image2.img
This looks like a bogus setup. What backing device is image2.img
caching? What's name of the actual bcache device? If anything, the bcache device (e.g. /dev/bcache0
) would be the device to pass to the VM. It's no surprise that the cache "device" image2.img
can't be opened by multipathd in the guest if it's actually attached to a bcache device in the host.
In general, multipath: error getting device (-EBUSY)
doesn't indicate a lack of "support" on the multipath side. In 99% of cases, like here, it indicates a configuration problem. Specifically, the device that multipath is supposed to open is already open by some other entity (here, bcache in the host).
Hi Martin, good to hear from you.
This is just a test setup to help reproduce the problem, I did the steps a little differently than is encountered in the field. Specifically this is seen in our MAAS setup where we configure the bcache disk on host and then reprovision the machine.
A more field-like test environment would be to send in a regular disk as a multipathd disk and then set that up as bcache in the guest, and then reboot.
This looks like this on boot as example
Where in the system we make it a bcache device. And upon reboot, you see the situation explained above.
Apologies for the wording, I didn't know a better term to use other than support. This seems like something that should maybe work? I know it might be rare, but there might be other people trying to use bcache in a JBOD setup. I attempted to look into the multipath docs to see if there was any mention of how to setup a bcache device that is using multipathing and came up empty handed.
Do you have any advice on things to try to help setup the system?
oh also, I apologize, I realize I made a typo in the description. The real step is
$ fallocate -l 1G image2.img
$ make-bcache -C image2.img
which I've updated in the description now. ctrl+c ctrl+v issues.
As I said multipath: error getting device (-EBUSY)
means that some other entity has opened the device. You may see the opener either in the guest or in the host using lsof
or in the holders
directory in sysfs.
Still, passing image2.img
to the VM is wrong. You need to pass the bache device.
Hi, I’ve looked into this more and I think it is the bcache udev rule running a script on the block device holding it up, it’s annoying because this doesn’t show up in lsof or related tools.
Once I get more time I’ll dig deeper on what the udev rule is really doing.
@mwilck I get that passing image2.img is a weird thing to do, but I believe it’s the correct thing to do for recreating the scenario that we are seeing in MaaS.
I don't understand what you're trying to do. Do you want to set up the bcache device inside the VM, from backend / frontend devices that you created in the host?
The goal here is to simulate configuring a disk that has 2 physical lanes (thus using multipathing) as a bcache device. This is what is happening in our deployments.
I think my current method is a good way to accomplish that scenario, but do let me know if that’s wrong.
Ok, so you have two multipath devices, one as bcache backend and one as actual cache device, and you want to set up the actual bcache device in the VM from these two devices. I'm slowly getting it.
I am also assuming that the messages
[ 8758.157075] device-mapper: table: 252:3: multipath: error getting device (-EBUSY)
[ 8758.158039] device-mapper: ioctl: error adding target to table
are emitted in the VM/guest.
Can you see anything under /sys/block/sd[cd]/holders
, or does lsof /dev/sd[cd]
provide any evidence? Please also check on the host if image2.img
has been opened any process except qemu.
I think you will also be able to verify that the devices simply can't be opened in the guest:
# python
>>> import os
>>> os.open("/dev/sdc", os.O_RDONLY|os.O_EXCL)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
OSError: [Errno 16] Device or resource busy: '/dev/sdc'
Side note: In general, I don't recommend using simple file-backed IO plus file.locking=off
for simulating multipath. It might work, or it might not. It feels like begging for trouble to me.
I recommend to use actual block devices in the host. My personal favourite is using LIO with a vhost-scsi setup with 2 vhost controllers. This will show in the VM as two SCSI controllers seeing the same devices, it's not much harder to set up, and it's clean.
On my host after running the kvm command:
$ ps aux | grep image2 | grep -v grep
mitch 142838 41.7 3.7 4041248 598652 pts/0 Sl+ 18:20 0:21 kvm -m 2048 -boot c -cdrom ./oracular-live-server-amd64.iso -device virtio-scsi-pci,id=scsi -drive file=image.img,if=none,id=sda,format=raw,file.locking=off -device scsi-hd,drive=sda,serial=0001 -drive if=none,id=sdb,file=image.img,format=raw,file.locking=off -device scsi-hd,drive=sdb,serial=0001 -drive file=image2.img,if=none,id=sdc,format=raw,file.locking=off -device scsi-hd,drive=sdc,serial=0002 -drive if=none,id=sdd,file=image2.img,format=raw,file.locking=off -device scsi-hd,drive=sdd,serial=0002 -netdev user,id=net0,hostfwd=tcp::2222-:22 -device virtio-net-pci,netdev=net0
So no process seems to be messing with image2. Also after running KVM:
$ lsof image2.img
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
kvm 142838 mitch 15u REG 252,1 1073741824 5774609 image2.img
kvm 142838 mitch 16u REG 252,1 1073741824 5774609 image2.img
I don't seem to see anything in the holders folder.
I can confirm with your python example that the device is busy,
I'm taking a look now real quick, but /usr/lib/udev/rules.d/69-bcache.rules
runs RUN+="bcache-register $tempnode"
so I'll see exactly what register means for a bcache device. That's what I expect is holding up the device.
Also, thank you for the side note. I love improving how I test and run things, so will definitely be looking into your setup :)
Sorry for delay, I was out traveling, and catching up.
Well the issue is pretty apparent now.
The bcache-tools udev rule runs bcache-register[0] against the block devices.
this does 2 things mainly, it opens the block device in write only mode
fd = open("/sys/fs/bcache/register", O_WRONLY);
And registers the device with dprintf
if (dprintf(fd, "%s\n", argv[1]) < 0)
[0] - https://github.com/g2p/bcache-tools/blob/master/bcache-register.c
Ultimately I think the bcache-rules udev rules should be moved later in the boot process to allow multipath to create the device maps before registering the devices.
Is there a preferred way we like to make other udev rules wait for multipath to create the device map?
ignore my rambling, it seems to be more than just the udev rules. Looking into it more right now.
The bcache rules should refrain from touching the device if SYSTEMD_READY=0 is set. Which is the case for legs of a multipath map.
I know this isn't much help, but I did get bcache working on top of multipath using Fedora. To do this, I installed a fedora 40 VM with two paths to a 20G scsi device and two paths to a 1G scsi device, just like you did. Then I rebuilt the initramfs to start up the multipath devices in early boot. After that I formatted and started up the bcache device. On reboots, the multipath devices get created first, and the bcache device is built on top of them. I suspect that the difference between our experience comes down to the initramfs. If multipath is not claiming the devices in the initramfs, then bcache can use the devices and there's nothing that multipath can do. Are you able to make multipathd start up in the initramfs?
@bmarzins: so you work with locking=off
, too? That's interesting. I really thought it was unsafe. But maybe that's no longer true.
Wrt bcache
, I think the problem might be that 69-bcache.rules
doesn't honor SYSTEMD_READY
. @mitchdz, try adding the following at the beginnig of 69-bcache.rules
:
ENV{SYSTEMD_READY}=="0", GOTO="bcache_end"
@mwilck your suggestion to add the following lines to the top of the udev rules:
ENV{SYSTEMD_READY}=="0", GOTO="bcache_end"
Seems to have worked well for me, after boot I see the mpath rules properly.
$ sudo multipath -ll
mpatha (0QEMU_QEMU_HARDDISK_0001) dm-0 QEMU,QEMU HARDDISK
size=20G features='0' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=1 status=active
| `- 0:0:0:0 sda 8:0 active ready running
`-+- policy='service-time 0' prio=1 status=enabled
`- 0:0:1:0 sdc 8:32 active ready running
mpathb (0QEMU_QEMU_HARDDISK_0002) dm-1 QEMU,QEMU HARDDISK
size=1.0G features='0' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=1 status=active
| `- 0:0:2:0 sdb 8:16 active ready running
`-+- policy='service-time 0' prio=1 status=enabled
`- 0:0:3:0 sdd 8:48 active ready running
In addition to that, it seems the bcache device is still recognized.
$ sudo bcache-super-show /dev/mapper/mpathb
sb.magic ok
sb.first_sector 8 [match]
sb.csum 40DECC2B836701A3 [match]
sb.version 3 [cache device]
dev.label (empty)
dev.uuid 1a629857-fb9e-497b-bc2e-1c81d48687fe
dev.sectors_per_block 1
dev.sectors_per_bucket 1024
dev.cache.first_sector 1024
dev.cache.cache_sectors 2096128
dev.cache.total_sectors 2097152
dev.cache.ordered yes
dev.cache.discard no
dev.cache.pos 0
dev.cache.replacement 0 [lru]
cset.uuid 461c9448-d5a0-4314-8ee8-716264f89d88```
let me also play with the initramfs to see if that could be a solution too as you mentioned @bmarzins
Just for completeness - adding dm_multipath
to /etc/initramfs-tools/modules
and then upgrading the initramfs with update-initramfs -u
did not seem to help.
I suppose it depends on whether you have bcache in the initramfs as well. I generally recommend to use the same setup in the initramfs and in the fully booted system. All else makes booting very fragile.
If the bcache udev rules ignore SYSTEMD_READY
, it's basically random / timing-dependent which subsystem gets to grab a given device first. Only the first one can open the device, all later ones will last will fail to do so. If you aren't lucky, that can have the effect that neither the multipath device nor the bcache device are operational.
SYSTEMD_READY
is the generic mechanism to make sure the block device stack is set up correctly. It's particularly important for multipath, because other block layers (e.g. MD RAID) can be distinguished from bcache by other means (metadata on disk).
@bmarzins: so you work with
locking=off
, too? That's interesting. I really thought it was unsafe. But maybe that's no longer true.I meant "just like you did", in a more general sense of "I pointed two virtual scsi devices at the same underlying storage". To actually set up the machine I used virt-install with --disk options that looked like "--disk path=/var/lib/libvirt/images/image.img,format=raw,bus=scsi,serial=0001,cache=none,shareable=on". I'm not actually sure this is completely safe, but it's fast to set up and works fine for testing.
@mitchdz are we good here? Should someone post a PR to bcache-tools?
Yes, let’s close this. I’ve brought this information to bcache-tools. I’m closing now.
Hi multipath-tools team,
I was interested in what the state of bcache support is?
I'm attempting to use it in Ubuntu where we have seen issues, and I raised an upstream bcache bug. I'm not sure the exact cause of what is going on here, but figured I would make this issue to see if you have any suggestions on things to look into.
These results are from Ubuntu Oracular which is using multipath-tools version 0.9.9-1ubuntu2
The lucky part is I'm able to very easily reproduce these results in a VM. My reproduction steps on Ubuntu are as follows:
Install like any regular installation. Once complete, make your bcache-backed block device like so:
And add that as another value to the kvm invocation. In addition to that, you can boot on the multipath'd disk by changing your boot drive to d.
On boot, you will see
And if you try to force the devmap to reload, you will see: