openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.53k stars 1.74k forks source link

[SOLVED] ZFS doesn't play nice with virtiofs #14932

Closed deajan closed 1 year ago

deajan commented 1 year ago

I'm trying to share some ZFS datasets with a qemu guest using virtiofs. So far, eveytime I try to launch my guest, it shows an empty folder (actually shares the mountpoint directory instead of the ZFS fs).

I had various times where it seemed like zfs would "unmount" itself once I launched my virtual guest.

After having tried the following (on a test system of course)

zfs umount -a
rm /backup/* -rf
zfs mount -a

I can list the my zfs dataset with:

zfs list
[...]
backup/dataset/mydataset     140K  34.8T      140K  /backup/dataset/mydataset
[...]

Content is also visible.

But I cannot use that dataset with virtiofs

virsh start mymachine
error: internal error: the virtiofs export directory '/backup/dataset/mydataset/' does not exist

I'm really sorry for the noise since this is probably a vitiofs bug and not a zfs one, but I do think zfs behaves strangely since it shouldn't just "unmount" my datasets when accessed via virtiofs. Do you guys have any experience with virtiofs ? Do FUSE daemons get to use zfs mountpoints properly ? Anything to configure perhaps ?

Best regards

System information

filename:       /lib/modules/5.14.0-284.11.1.el9_2.x86_64/extra/zfs.ko.xz
version:        2.1.11-2
license:        CDDL
author:         OpenZFS
description:    ZFS
alias:          devname:zfs
alias:          char-major-10-249
rhelversion:    9.2
srcversion:     8081FD700719F8F0FB60578
depends:        spl,znvpair,icp,zlua,zzstd,zunicode,zcommon,zavl

Virtiofs relevant config in virtual guest xml:

    <filesystem type='mount' accessmode='passthrough'>
      <driver type='virtiofs' queue='1024'/>
      <binary path='/usr/libexec/virtiofsd' xattr='on'>
        <cache mode='always'/>
      </binary>
      <source dir='/backup/dataset/mydataset/'/>
      <target dir='sometarget'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1d' function='0x0'/>
    </filesystem>
filip-paczynski commented 1 year ago

Hi, I've never used it, but was looking for something like that some time ago (back then only 9p was available).

Just a guess, but maybe --announce-submunts will help? My reasoning is that from linux POV /backup is a FS, /backup/dataset is another FS mounted within the parent FS and so on... Maybe try the simplest case first: zfs set mountpoint=/backup_test_virtio backup/dataset/mydataset and then use /backup_test_virtio as virtio-fs source.

Also, check if mount lists the dataset in question after zfs mount -a.

GregorKopka commented 1 year ago

Running with 2.1.11-r0-funtoo and have no problems handing ZFS backed directories into qemu VMs, both linux and Windows.

Remember that you have to mount the shared folder in the guest! Else you'll only see an empty mountpoint when looking in the guest, with the content being there when looking on the host.

What is your actual issue: the content vanishing on the host or not showing up in the guest?

deajan commented 1 year ago

So, first, thanks for the answers, and sorry for the delay in response. Here's what I did so far:

There's no --announce-submunts option in libvirt, so I did an ugly patch:

/usr/libexec/virtiofsd_dist $@ --announce-submounts


This way, I get to add `--announce-submounts` to virtiofsd:

ps aux | grep virtiofsd

root 12079 0.0 0.0 6400 2328 pts/19 S+ 18:36 0:00 grep --color=auto virtiof root 63816 0.0 0.0 5776 3892 ? S 18:23 0:00 /usr/libexec/virtiofsd --fd=98 -o source=/backup/restic_stash/,cache=always,xattr --announce-submounts root 63818 0.0 0.0 2238128 5332 ? Sl 18:23 0:00 /usr/libexec/virtiofsd --fd=98 -o source=/backup/restic_stash/,cache=always,xattr --announce-submounts


On the host machine, I created a fresh machine with following libvirt config
<filesystem type='mount' accessmode='passthrough'>
  <driver type='virtiofs' queue='1024'/>
  <binary path='/usr/libexec/virtiofsd' xattr='on'>
    <cache mode='always'/>
  </binary>
  <source dir='/backup/restic_stash/'/>
  <target dir='restic_stash'/>
  <address type='pci' domain='0x0000' bus='0x00' slot='0x1a' function='0x0'/>
</filesystem>

Host config:

zfs list

NAME USED AVAIL REFER MOUNTPOINT backup 871G 34.4T 283G /backup backup/restic_stash 73.6G 34.4T 186K /backup/restic_stash backup/restic_stash/userA 140K 34.4T 140K /backup/restic_stash/userA backup/restic_stash/userB 73.6G 34.4T 73.6G /backup/restic_stash/userB

mount | grep zfs

backup on /backup type zfs (rw,noatime,seclabel,xattr,noacl) backup/restic_stash on /backup/restic_stash type zfs (rw,noatime,seclabel,noxattr,noacl) backup/restic_stash/userB on /backup/restic_stash/userB type zfs (rw,noatime,seclabel,noxattr,noacl) backup/restic_stash/userA on /backup/restic_stash/userA type zfs (rw,noatime,seclabel,noxattr,noacl)

touch /backup/restic_stash/restic_stash_file

touch /backup/restic_stash/userA/userA_file

touch /backup/restic_stash/userB/userB_file


On the VM side

cat /etc/fstab | grep virtiofs

restic_stash /restic_stash virtiofs defaults,noatime,nodiratime,nodev,noexec,nosuid,nofail 0 2

mount | grep virtiofs

restic_stash on /restic_stash type virtiofs (rw,nosuid,nodev,noexec,noatime,nodiratime,seclabel)

ls -alh /restic_stash/

total 512 drwx------. 2 restic root 2 Jul 7 18:18 . dr-xr-xr-x. 20 root root 270 May 31 23:42 ..

umount /restic_stash

ls -alh /restic_stash/

total 0 drwxr-xr-x. 2 root root 6 May 16 13:31 . dr-xr-xr-x. 20 root root 270 May 31 23:42 ..

dmesg | egrep -i "virtiofs|restic_stash" [ 9.119442] systemd-fstab-generator[474]: Checking was requested for "restic_stash", but it is not a device. [ 10.535127] virtiofs virtio0: virtio_fs_setup_dax: No cache capability

As you can see, FS is mounted (since once I unmount it, the ls output changes slightly).
Anyway, no files in there.

Strangely enough, since zfs 2.1.12-1 update, host zfs filesystem doesn't get unmounted anymore when I start the VM.

I've also redone the above tests after disabling SELinux in host and guest systems.
Lastly, I redid the test without the `--announce-submounts` hack I tried.

For all completeness, here's the full cmdline of my VM:

/usr/libexec/qemu-kvm -name guest=mymachine.local,debug-threads=on -S -object {"qom-type":"secret","id":"masterKey0","format":"raw","file":"/var/lib/libvirt/qemu/domain-115-mymachine.lo/master-key.aes"} -machine pc-q35-rhel9.2.0,usb=off,dump-guest-core=off,memory-backend=pc.ram -accel kvm -cpu Icelake-Server,ds=on,ss=on,dtes64=on,vmx=on,pdcm=on,hypervisor=on,tsc-adjust=on,avx512ifma=on,sha-ni=on,rdpid=on,fsrm=on,md-clear=on,stibp=on,arch-capabilities=on,xsaves=on,ibpb=on,ibrs=on,amd-stibp=on,amd-ssbd=on,rdctl-no=on,ibrs-all=on,skip-l1dfl-vmentry=on,mds-no=on,pschange-mc-no=on,tsx-ctrl=on,hle=off,rtm=off,mpx=off,intel-pt=off -m 2048 -object {"qom-type":"memory-backend-memfd","id":"pc.ram","share":true,"x-use-canonical-path-for-ramblock-id":false,"size":2147483648} -overcommit mem-lock=off -smp 2,sockets=2,cores=1,threads=1 -object {"qom-type":"iothread","id":"iothread1"} -object {"qom-type":"iothread","id":"iothread2"} -uuid 41c28c02-482e-4325-becc-23db4642395d -smbios type=0,vendor=npf -smbios type=1,manufacturer=NetPerfect,product=vmv3tls -display none -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=97,server=on,wait=off -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -global ICH9-LPC.disable_s3=1 -global ICH9-LPC.disable_s4=1 -boot strict=on -device {"driver":"pcie-root-port","port":8,"chassis":1,"id":"pci.1","bus":"pcie.0","multifunction":true,"addr":"0x1"} -device {"driver":"pcie-root-port","port":9,"chassis":2,"id":"pci.2","bus":"pcie.0","addr":"0x1.0x1"} -device {"driver":"pcie-root-port","port":10,"chassis":3,"id":"pci.3","bus":"pcie.0","addr":"0x1.0x2"} -device {"driver":"pcie-root-port","port":11,"chassis":4,"id":"pci.4","bus":"pcie.0","addr":"0x1.0x3"} -device {"driver":"pcie-root-port","port":12,"chassis":5,"id":"pci.5","bus":"pcie.0","addr":"0x1.0x4"} -device {"driver":"pcie-root-port","port":13,"chassis":6,"id":"pci.6","bus":"pcie.0","addr":"0x1.0x5"} -device {"driver":"pcie-root-port","port":14,"chassis":7,"id":"pci.7","bus":"pcie.0","addr":"0x1.0x6"} -device {"driver":"pcie-root-port","port":15,"chassis":8,"id":"pci.8","bus":"pcie.0","addr":"0x1.0x7"} -device {"driver":"pcie-root-port","port":16,"chassis":9,"id":"pci.9","bus":"pcie.0","multifunction":true,"addr":"0x2"} -device {"driver":"pcie-root-port","port":17,"chassis":10,"id":"pci.10","bus":"pcie.0","addr":"0x2.0x1"} -device {"driver":"pcie-root-port","port":18,"chassis":11,"id":"pci.11","bus":"pcie.0","addr":"0x2.0x2"} -device {"driver":"pcie-root-port","port":19,"chassis":12,"id":"pci.12","bus":"pcie.0","addr":"0x2.0x3"} -device {"driver":"pcie-root-port","port":20,"chassis":13,"id":"pci.13","bus":"pcie.0","addr":"0x2.0x4"} -device {"driver":"pcie-root-port","port":21,"chassis":14,"id":"pci.14","bus":"pcie.0","addr":"0x2.0x5"} -device {"driver":"pcie-root-port","port":22,"chassis":15,"id":"pci.15","bus":"pcie.0","addr":"0x2.0x6"} -device {"driver":"pcie-pci-bridge","id":"pci.16","bus":"pci.1","addr":"0x0"} -device {"driver":"pcie-root-port","port":23,"chassis":17,"id":"pci.17","bus":"pcie.0","addr":"0x2.0x7"} -device {"driver":"qemu-xhci","p2":15,"p3":15,"id":"usb","bus":"pci.3","addr":"0x0"} -device {"driver":"virtio-serial-pci","id":"virtio-serial0","bus":"pci.4","addr":"0x0"} -blockdev {"driver":"file","filename":"/data/private_vm/mymachine.local-disk0.qcow2","aio":"native","node-name":"libvirt-2-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"} -blockdev {"node-name":"libvirt-2-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"qcow2","file":"libvirt-2-storage","backing":null} -device {"driver":"virtio-blk-pci","iothread":"iothread2","num-queues":2,"bus":"pci.5","addr":"0x0","drive":"libvirt-2-format","id":"virtio-disk0","bootindex":1,"write-cache":"on"} -device {"driver":"ide-cd","bus":"ide.0","id":"sata0-0-0"} -chardev socket,id=chr-vu-fs0,path=/var/lib/libvirt/qemu/domain-115-mymachine.lo/fs0-fs.sock -device {"driver":"vhost-user-fs-pci","id":"fs0","chardev":"chr-vu-fs0","queue-size":1024,"tag":"restic_stash","bus":"pcie.0","addr":"0x1a"} -netdev {"type":"tap","fd":"98","vhost":true,"vhostfd":"106","id":"hostnet0"} -device {"driver":"virtio-net-pci","netdev":"hostnet0","id":"net0","mac":"52:54:00:4b:59:52","bus":"pci.2","addr":"0x0"} -chardev pty,id=charserial0 -device {"driver":"isa-serial","chardev":"charserial0","id":"serial0","index":0} -chardev socket,id=charchannel0,fd=70,server=on,wait=off -device {"driver":"virtserialport","bus":"virtio-serial0.0","nr":1,"chardev":"charchannel0","id":"channel0","name":"org.qemu.guest_agent.0"} -audiodev {"id":"audio1","driver":"none"} -device {"driver":"i6300esb","id":"watchdog0","bus":"pci.16","addr":"0x1"} -watchdog-action reset -device {"driver":"virtio-balloon-pci","id":"balloon0","bus":"pci.6","addr":"0x0"} -object {"qom-type":"rng-random","id":"objrng0","filename":"/dev/urandom"} -device {"driver":"virtio-rng-pci","rng":"objrng0","id":"rng0","bus":"pci.7","addr":"0x0"} -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on



I'm honestly puzzled.
deajan commented 1 year ago

Oh, and for what it's worth, using an xfs directory instead of a zfs one works.

filip-paczynski commented 1 year ago

I tested this on pop os vm :

filip-paczynski commented 1 year ago

...after chacking mount output on host, I noticed that I do not have seclabel option:

# mount | grep app_storage
pool-storage/app_storage on /pool-storage/app_storage type zfs (rw,xattr,noacl)
pool-storage/app_storage/subfs1 on /pool-storage/app_storage/subfs1 type zfs (rw,xattr,noacl)
pool-storage/app_storage/subfs2 on /pool-storage/app_storage/subfs2 type zfs (rw,xattr,noacl)

perhaps seclabel limits visibility of Your mountpoints? Sadly, I know nothing about selinux.

deajan commented 1 year ago

@filip-paczynski Would you mind giving me the full cmdline of virtiofsd host side so I can compare ?

filip-paczynski commented 1 year ago

@filip-paczynski Would you mind giving me the full cmdline of virtiofsd host side so I can compare ?

Sure:

# ps ax | grep virtiofs
131048 ?        S      0:00 /bin/sh /usr/local/bin/virtiofsd-wrapper.sh --fd=37 -o source=/pool-storage/app_storage,xattr
131050 ?        S      0:00  \_ /usr/lib/virtiofsd --announce-submounts --fd=37 -o source=/pool-storage/app_storage,xattr
131054 ?        Sl     0:00      \_ /usr/lib/virtiofsd --announce-submounts --fd=37 -o source=/pool-storage/app_storage,xattr
filip-paczynski commented 1 year ago

The problem with inode numbers being duplicated also occurs with XFS. XFS layout on host side:

# mount | grep xfs_
/home/filip.paczynski/tmp/xfs-root on /xfs_test type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/home/filip.paczynski/tmp/xfs-subfs1 on /xfs_test/subfs1 type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
/home/filip.paczynski/tmp/xfs-subfs2 on /xfs_test/subfs2 type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)

Testing find on VM:

# mount -t virtiofs xfs_test /xfs_test/
# find /xfs_test/
/xfs_test/
find: File system loop detected; ‘/xfs_test/subfs1’ is part of the same file system loop as ‘/xfs_test/’.
find: File system loop detected; ‘/xfs_test/subfs2’ is part of the same file system loop as ‘/xfs_test/’.
deajan commented 1 year ago

Well, I have no idea why, but after updating and restarting both host and guests, I got to the same results as you, without any special behavior I had before.

# find /restic_stash
/restic_stash/
/restic_stash/restic_stash_file
find: File system loop detected; ‘/restic_stash/userA’ is part of the same file system loop as ‘/restic_stash/’.
find: File system loop detected; ‘/restic_stash/userB’ is part of the same file system loop as ‘/restic_stash/’.

Since you were able to reproduce the same problem at XFS level, I guess there's no need to keep this issue open. Sorry for the noise.

Any idea where to open a new issue ?

filip-paczynski commented 1 year ago

I guess the initial issue with ZFS filesystems not being visible on VM side was related to selinux, or seclabel option, or similar security-related feature. This is no issue with ZFS.

The problem with duplicated inode numbers, which makes find consider subfs* as a hardlink to parent directory is another matter. Theoretically, such a configuration should be handled by virtiofsd --announce-submonuts. However, I do not fully understand what they mean by "device number", or how should this by handled on VM side:

from virtiofsd docs:

--announce-submounts solves that problem because it reports a different device number for every submount it encounters.

I have no experience with virtiofsd and I only run KVM on my "local" machine (Xen on servers).

I guess one could ask on virtiofsd forums/discussions about --announce-submounts and why it duplicates inode numbers on VM side.

Glad I could help.

deajan commented 1 year ago

@filip-paczynski From what I understand, --announce-submounts only allows sync operations to be sent to each FS, in order to avoid inconsistencies when unmounting ?

Strangely enough, I didn't change anything about seclabel. But a zfs + kernel upgrade seemed to have resolved the issue, at least the zfs related one.

I've also tried to run virtiofsd with --inode-file-handles=mandatory option, but the find output stays the same;

Anyway, thanks for your help.

deajan commented 1 year ago

Side notes, using xattr=on divides IOPS by 3 on my tests. --inode-file-handles=never does not resolve any of the above loop problems.

Not usable for me right now. Thank your for your time, and sorry for having made that noise in the ZFS issues.

zeigerpuppy commented 1 year ago

May be related... I have noticed that virtiofs with caching enabled leaves open file handles, eventually resulting in "too many files open". I'm not sure if this is due to an interaction with ZFS.

Disabling caching has a significant performance hit, maybe could be remedied by using the upcoming direct_io options?

https://gitlab.com/virtio-fs/virtiofsd/-/issues/121

deajan commented 1 year ago

For what it's worth, I found the issue. It's not virtiofs related, but zfs related.

In my setup, when I use zfs mount mydataset, it mounts for the current user only. If I happen to use systemctl restart zfs-mount, it mounts for all users, including the virtiofs bridge, which actually works fine.

Not sure whether this is a bug or a feature yet, but I've opened a discussion here