opensvc / multipath-tools

Other
60 stars 48 forks source link

bcache support #96

Closed mitchdz closed 1 month ago

mitchdz commented 2 months ago

Hi multipath-tools team,

I was interested in what the state of bcache support is?

I'm attempting to use it in Ubuntu where we have seen issues, and I raised an upstream bcache bug. I'm not sure the exact cause of what is going on here, but figured I would make this issue to see if you have any suggestions on things to look into.

These results are from Ubuntu Oracular which is using multipath-tools version 0.9.9-1ubuntu2

The lucky part is I'm able to very easily reproduce these results in a VM. My reproduction steps on Ubuntu are as follows:

$ wget https://cdimage.ubuntu.com/ubuntu-server/daily-live/pending/oracular-live-server-amd64.iso
$ fallocate -l 20G image.img
$ kvm -m 2048 -boot d \
    -cdrom ./oracular-live-server-amd64.iso \
    -device virtio-scsi-pci,id=scsi \
    -drive file=image.img,if=none,id=sda,format=raw,file.locking=off \
    -device scsi-hd,drive=sda,serial=0001 \
    -drive if=none,id=sdb,file=image.img,format=raw,file.locking=off \
    -device scsi-hd,drive=sdb,serial=0001 \
    -netdev user,id=net0 -device virtio-net-pci,netdev=net0

Install like any regular installation. Once complete, make your bcache-backed block device like so:

$ fallocate -l 1G image2.img
$ make-bcache -C image2.img

And add that as another value to the kvm invocation. In addition to that, you can boot on the multipath'd disk by changing your boot drive to d.

$ kvm -m 2048 -boot c \
    -cdrom ./oracular-live-server-amd64.iso \
    -device virtio-scsi-pci,id=scsi \
    -drive file=image.img,if=none,id=sda,format=raw,file.locking=off \
    -device scsi-hd,drive=sda,serial=0001 \
    -drive if=none,id=sdb,file=image.img,format=raw,file.locking=off \
    -device scsi-hd,drive=sdb,serial=0001 \
    -drive file=image2.img,if=none,id=sdc,format=raw,file.locking=off \
    -device scsi-hd,drive=sdc,serial=0002 \
    -drive if=none,id=sdd,file=image2.img,format=raw,file.locking=off \
    -device scsi-hd,drive=sdd,serial=0002 \
    -netdev user,id=net0 -device virtio-net-pci,netdev=net0

On boot, you will see image

And if you try to force the devmap to reload, you will see:

$ sudo multipath -r
[ 8758.157075] device-mapper: table: 252:3: multipath: error getting device (-EBUSY)
[ 8758.158039] device-mapper: ioctl: error adding target to table
[ 8758.256206] device-mapper: table: 252:3: multipath: error getting device (-EBUSY)
[ 8758.256758] device-mapper: ioctl: error adding target to table
mwilck commented 2 months ago
$ fallocate -l 20G image.img
$ make-bcache -C image2.img

This looks like a bogus setup. What backing device is image2.img caching? What's name of the actual bcache device? If anything, the bcache device (e.g. /dev/bcache0) would be the device to pass to the VM. It's no surprise that the cache "device" image2.img can't be opened by multipathd in the guest if it's actually attached to a bcache device in the host.

In general, multipath: error getting device (-EBUSY) doesn't indicate a lack of "support" on the multipath side. In 99% of cases, like here, it indicates a configuration problem. Specifically, the device that multipath is supposed to open is already open by some other entity (here, bcache in the host).

mitchdz commented 2 months ago

Hi Martin, good to hear from you.

This is just a test setup to help reproduce the problem, I did the steps a little differently than is encountered in the field. Specifically this is seen in our MAAS setup where we configure the bcache disk on host and then reprovision the machine.

A more field-like test environment would be to send in a regular disk as a multipathd disk and then set that up as bcache in the guest, and then reboot.

This looks like this on boot as example image

Where in the system we make it a bcache device. image And upon reboot, you see the situation explained above.

image

Apologies for the wording, I didn't know a better term to use other than support. This seems like something that should maybe work? I know it might be rare, but there might be other people trying to use bcache in a JBOD setup. I attempted to look into the multipath docs to see if there was any mention of how to setup a bcache device that is using multipathing and came up empty handed.

Do you have any advice on things to try to help setup the system?

mitchdz commented 2 months ago

oh also, I apologize, I realize I made a typo in the description. The real step is

$ fallocate -l 1G image2.img
$ make-bcache -C image2.img

which I've updated in the description now. ctrl+c ctrl+v issues.

mwilck commented 2 months ago

As I said multipath: error getting device (-EBUSY) means that some other entity has opened the device. You may see the opener either in the guest or in the host using lsof or in the holders directory in sysfs.

mwilck commented 2 months ago

Still, passing image2.img to the VM is wrong. You need to pass the bache device.

mitchdz commented 2 months ago

Hi, I’ve looked into this more and I think it is the bcache udev rule running a script on the block device holding it up, it’s annoying because this doesn’t show up in lsof or related tools.

Once I get more time I’ll dig deeper on what the udev rule is really doing.

@mwilck I get that passing image2.img is a weird thing to do, but I believe it’s the correct thing to do for recreating the scenario that we are seeing in MaaS.

mwilck commented 2 months ago

I don't understand what you're trying to do. Do you want to set up the bcache device inside the VM, from backend / frontend devices that you created in the host?

mitchdz commented 2 months ago

The goal here is to simulate configuring a disk that has 2 physical lanes (thus using multipathing) as a bcache device. This is what is happening in our deployments.

mitchdz commented 2 months ago

I think my current method is a good way to accomplish that scenario, but do let me know if that’s wrong.

mwilck commented 2 months ago

Ok, so you have two multipath devices, one as bcache backend and one as actual cache device, and you want to set up the actual bcache device in the VM from these two devices. I'm slowly getting it.

I am also assuming that the messages

[ 8758.157075] device-mapper: table: 252:3: multipath: error getting device (-EBUSY)
[ 8758.158039] device-mapper: ioctl: error adding target to table

are emitted in the VM/guest.

Can you see anything under /sys/block/sd[cd]/holders, or does lsof /dev/sd[cd] provide any evidence? Please also check on the host if image2.img has been opened any process except qemu.

I think you will also be able to verify that the devices simply can't be opened in the guest:

# python
>>> import os
>>> os.open("/dev/sdc", os.O_RDONLY|os.O_EXCL)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: [Errno 16] Device or resource busy: '/dev/sdc'
mwilck commented 2 months ago

Side note: In general, I don't recommend using simple file-backed IO plus file.locking=off for simulating multipath. It might work, or it might not. It feels like begging for trouble to me.

I recommend to use actual block devices in the host. My personal favourite is using LIO with a vhost-scsi setup with 2 vhost controllers. This will show in the VM as two SCSI controllers seeing the same devices, it's not much harder to set up, and it's clean.

mitchdz commented 2 months ago

On my host after running the kvm command:

$ ps aux | grep image2 | grep -v grep

mitch     142838 41.7  3.7 4041248 598652 pts/0  Sl+  18:20   0:21 kvm -m 2048 -boot c -cdrom ./oracular-live-server-amd64.iso -device virtio-scsi-pci,id=scsi -drive file=image.img,if=none,id=sda,format=raw,file.locking=off -device scsi-hd,drive=sda,serial=0001 -drive if=none,id=sdb,file=image.img,format=raw,file.locking=off -device scsi-hd,drive=sdb,serial=0001 -drive file=image2.img,if=none,id=sdc,format=raw,file.locking=off -device scsi-hd,drive=sdc,serial=0002 -drive if=none,id=sdd,file=image2.img,format=raw,file.locking=off -device scsi-hd,drive=sdd,serial=0002 -netdev user,id=net0,hostfwd=tcp::2222-:22 -device virtio-net-pci,netdev=net0

So no process seems to be messing with image2. Also after running KVM:

$ lsof image2.img 
COMMAND    PID  USER   FD   TYPE DEVICE   SIZE/OFF    NODE NAME
kvm     142838 mitch   15u   REG  252,1 1073741824 5774609 image2.img
kvm     142838 mitch   16u   REG  252,1 1073741824 5774609 image2.img

I don't seem to see anything in the holders folder.

image

I can confirm with your python example that the device is busy, image

I'm taking a look now real quick, but /usr/lib/udev/rules.d/69-bcache.rules runs RUN+="bcache-register $tempnode" so I'll see exactly what register means for a bcache device. That's what I expect is holding up the device.

mitchdz commented 2 months ago

Also, thank you for the side note. I love improving how I test and run things, so will definitely be looking into your setup :)

mitchdz commented 2 months ago

Sorry for delay, I was out traveling, and catching up.

Well the issue is pretty apparent now.

The bcache-tools udev rule runs bcache-register[0] against the block devices.

this does 2 things mainly, it opens the block device in write only mode

    fd = open("/sys/fs/bcache/register", O_WRONLY);

And registers the device with dprintf

    if (dprintf(fd, "%s\n", argv[1]) < 0)

[0] - https://github.com/g2p/bcache-tools/blob/master/bcache-register.c

mitchdz commented 2 months ago

Ultimately I think the bcache-rules udev rules should be moved later in the boot process to allow multipath to create the device maps before registering the devices.

mitchdz commented 2 months ago

Is there a preferred way we like to make other udev rules wait for multipath to create the device map?

mitchdz commented 2 months ago

ignore my rambling, it seems to be more than just the udev rules. Looking into it more right now.

mwilck commented 2 months ago

The bcache rules should refrain from touching the device if SYSTEMD_READY=0 is set. Which is the case for legs of a multipath map.

bmarzins commented 2 months ago

I know this isn't much help, but I did get bcache working on top of multipath using Fedora. To do this, I installed a fedora 40 VM with two paths to a 20G scsi device and two paths to a 1G scsi device, just like you did. Then I rebuilt the initramfs to start up the multipath devices in early boot. After that I formatted and started up the bcache device. On reboots, the multipath devices get created first, and the bcache device is built on top of them. I suspect that the difference between our experience comes down to the initramfs. If multipath is not claiming the devices in the initramfs, then bcache can use the devices and there's nothing that multipath can do. Are you able to make multipathd start up in the initramfs?

mwilck commented 2 months ago

@bmarzins: so you work with locking=off, too? That's interesting. I really thought it was unsafe. But maybe that's no longer true.

Wrt bcache, I think the problem might be that 69-bcache.rules doesn't honor SYSTEMD_READY. @mitchdz, try adding the following at the beginnig of 69-bcache.rules:

ENV{SYSTEMD_READY}=="0", GOTO="bcache_end"
mitchdz commented 2 months ago

@mwilck your suggestion to add the following lines to the top of the udev rules:

ENV{SYSTEMD_READY}=="0", GOTO="bcache_end"

Seems to have worked well for me, after boot I see the mpath rules properly.

$ sudo multipath -ll
mpatha (0QEMU_QEMU_HARDDISK_0001) dm-0 QEMU,QEMU HARDDISK
size=20G features='0' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=1 status=active
| `- 0:0:0:0 sda 8:0  active ready running
`-+- policy='service-time 0' prio=1 status=enabled
  `- 0:0:1:0 sdc 8:32 active ready running
mpathb (0QEMU_QEMU_HARDDISK_0002) dm-1 QEMU,QEMU HARDDISK
size=1.0G features='0' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=1 status=active
| `- 0:0:2:0 sdb 8:16 active ready running
`-+- policy='service-time 0' prio=1 status=enabled
  `- 0:0:3:0 sdd 8:48 active ready running

In addition to that, it seems the bcache device is still recognized.


$ sudo bcache-super-show /dev/mapper/mpathb
sb.magic        ok
sb.first_sector     8 [match]
sb.csum         40DECC2B836701A3 [match]
sb.version      3 [cache device]

dev.label       (empty)
dev.uuid        1a629857-fb9e-497b-bc2e-1c81d48687fe
dev.sectors_per_block   1
dev.sectors_per_bucket  1024
dev.cache.first_sector  1024
dev.cache.cache_sectors 2096128
dev.cache.total_sectors 2097152
dev.cache.ordered   yes
dev.cache.discard   no
dev.cache.pos       0
dev.cache.replacement   0 [lru]

cset.uuid       461c9448-d5a0-4314-8ee8-716264f89d88```
mitchdz commented 2 months ago

let me also play with the initramfs to see if that could be a solution too as you mentioned @bmarzins

mitchdz commented 2 months ago

Just for completeness - adding dm_multipath to /etc/initramfs-tools/modules and then upgrading the initramfs with update-initramfs -u did not seem to help.

mwilck commented 2 months ago

I suppose it depends on whether you have bcache in the initramfs as well. I generally recommend to use the same setup in the initramfs and in the fully booted system. All else makes booting very fragile.

If the bcache udev rules ignore SYSTEMD_READY, it's basically random / timing-dependent which subsystem gets to grab a given device first. Only the first one can open the device, all later ones will last will fail to do so. If you aren't lucky, that can have the effect that neither the multipath device nor the bcache device are operational.

SYSTEMD_READY is the generic mechanism to make sure the block device stack is set up correctly. It's particularly important for multipath, because other block layers (e.g. MD RAID) can be distinguished from bcache by other means (metadata on disk).

bmarzins commented 2 months ago

@bmarzins: so you work with locking=off, too? That's interesting. I really thought it was unsafe. But maybe that's no longer true.

I meant "just like you did", in a more general sense of "I pointed two virtual scsi devices at the same underlying storage". To actually set up the machine I used virt-install with --disk options that looked like "--disk path=/var/lib/libvirt/images/image.img,format=raw,bus=scsi,serial=0001,cache=none,shareable=on". I'm not actually sure this is completely safe, but it's fast to set up and works fine for testing.

mwilck commented 1 month ago

@mitchdz are we good here? Should someone post a PR to bcache-tools?

mitchdz commented 1 month ago

Yes, let’s close this. I’ve brought this information to bcache-tools. I’m closing now.