Closed darkbasic closed 1 year ago
The qemu config includes a specific blocksize:
<blockio physical_block_size='16384'/>
while the zfs create
command doesn't specify blocksize (lacks -b
, I'm not sure what the default is, might also be inherited).
Maybe this is a case of mismatch between qemus opinion vs zfs blocksize? Please check actual zvol blocksize by cmd:
zfs get volblocksize rpool/zvols/ubuntu-23.10
AFAIK <blockio physical_block_size='16384'/>
lets the guest know that writes are best with 16k, but it still can write in 512 byte sectors. E.g. ext4 with block size 4k is mounted with stripe=4.
while the zfs create command doesn't specify blocksize (lacks -b, I'm not sure what the default is, might also be inherited)
zvols don't inherit the blocksize and they default to 16K, so it matches.
AFAIK
lets the guest know that writes are best with 16k, but it still can write in 512 byte sectors. E.g. ext4 with block size 4k is mounted with stripe=4.
I admit I'm not very familiar with qemu/kvm way of doing things, but in case of xen and linux guest volblocksize
gets translated into physical sector size on the guest VM side.
zvols don't inherit the blocksize and they default to 16K, so it matches.
yes, they do, I use it on a "swap" dataset under which all the swaps reside and they do inherit volblocksize
Additionally, UEFI might have a problem with sector sizes different than usual 512 or 4096.
The way to test it, excluding ZFS, would be to create an image file and set it up via losetup -b 16384
PS. I'm not ZFS developer/contributor, just someone with experience with ZFS zvols + xen vms.
yes, they do, I use it on a "swap" dataset under which all the swaps reside and they do inherit volblocksize
I don't know how this is possible because datasets don't have any volblocksize
property, they have recordsize
from which zvols don't inherit:
[root@arch-phoenix ~]# zfs get all rpool/zvols | grep size
rpool/zvols recordsize 128K default
rpool/zvols dnodesize auto inherited from rpool
[root@arch-phoenix ~]# zfs create -s -V 100G -o compression=lz4 rpool/zvols/ubuntu-23.10
[root@arch-phoenix ~]# zfs get all rpool/zvols/ubuntu-23.10 | grep size
rpool/zvols/ubuntu-23.10 volsize 100G local
rpool/zvols/ubuntu-23.10 volblocksize 16K default
Additionally, UEFI might have a problem with sector sizes different than usual 512 or 4096.
I've already tried with CSM (BIOS) and it doesn't work either. Also 16K is the default volblocksize
so it is supposed to work (and works if 16K is the recordsize of the dataset containing a raw image).
Even removing <blockio physical_block_size='16384'/>
leads to the same unbootable system.
Next time I reboot I will try with the (default) zvol_use_blk_mq=0
.
Sorry, you're right about inheriting volblocksize
, I thought it uses recordsize
, but apparenty it doesn't. Once again - my apologies for the mistake
NAME PROPERTY VALUE SOURCE
belt/swaps recordsize 16K local
belt/swaps volblocksize - -
belt/swaps/vmtemplates/debian11 recordsize - -
belt/swaps/vmtemplates/debian11 volblocksize 8K default
However, it seems that 8K is the default for zvols - at least on 2.1.X
However, it seems that 8K is the default for zvols - at least on 2.1.X
It changed in 2.2.
I've just tried with zvol_use_blk_mq=0
and it works flawlessly so the issue is definitely with zvol_use_blk_mq=1
.
Thanks for opening this issue - I just self-assigned to take a look into this. The <blockio physical_block_size='16384'>
is interesting, as I was only able to test on 4k drives when I was writing the blk-mq code.
I might wildly speculate that this and #14533 are the same bug, since it seems like they're related to having mismatched sector sizes triggering an edge case.
zvol defaults changed between 2.1 and the yet-unreleased 2.2 (in 72f0521aba) because the tradeoff for raidz/draid/compression improvements was viewed as worth it over the larger size, I think. (I don't believe even 2.1.13 has that backported?)
The
is interesting, as I was only able to test on 4k drives when I was writing the blk-mq code.
So you only tested it with volblocksize=4K
? Because <blockio physical_block_size='16384'>
is not the issue here, it doesn't work either if I omit it.
P.S. My NVMe block size is 4K as well, I decided to stick to the default volblocksize=16K
because it's a good compromise between performance/compression/etc.
I might wildly speculate that this and https://github.com/openzfs/zfs/issues/14533 are the same bug, since it seems like they're related to having mismatched sector sizes triggering an edge case.
Uhm, I'm not sure:
stripe=4
.recordsize=16K
work flawlessly.recordsize=16K
works flawlessly either if I omit the zvol_use_blk_mq=1
zfs module parameter.It must be something specific to the multi queue code.
By the way here is some more context:
[niko@arch-phoenix ~]$ zpool get all rpool
NAME PROPERTY VALUE SOURCE
rpool size 3.62T -
rpool capacity 6% -
rpool altroot - default
rpool health ONLINE -
rpool guid 5183011630881450907 -
rpool version - default
rpool bootfs rpool/ROOT/archlinux local
rpool delegation on default
rpool autoreplace off default
rpool cachefile - default
rpool failmode wait default
rpool listsnapshots off default
rpool autoexpand off default
rpool dedupratio 1.00x -
rpool free 3.40T -
rpool allocated 228G -
rpool readonly off -
rpool ashift 12 local
rpool comment - default
rpool expandsize - -
rpool freeing 0 -
rpool fragmentation 0% -
rpool leaked 0 -
rpool multihost off default
rpool checkpoint - -
rpool load_guid 806659593871550708 -
rpool autotrim on local
rpool compatibility openzfs-2.2-linux local
rpool bcloneused 206M -
rpool bclonesaved 227M -
rpool bcloneratio 2.10x -
rpool feature@async_destroy enabled local
rpool feature@empty_bpobj active local
rpool feature@lz4_compress active local
rpool feature@multi_vdev_crash_dump enabled local
rpool feature@spacemap_histogram active local
rpool feature@enabled_txg active local
rpool feature@hole_birth active local
rpool feature@extensible_dataset active local
rpool feature@embedded_data active local
rpool feature@bookmarks enabled local
rpool feature@filesystem_limits enabled local
rpool feature@large_blocks enabled local
rpool feature@large_dnode active local
rpool feature@sha512 enabled local
rpool feature@skein enabled local
rpool feature@edonr enabled local
rpool feature@userobj_accounting active local
rpool feature@encryption active local
rpool feature@project_quota active local
rpool feature@device_removal enabled local
rpool feature@obsolete_counts enabled local
rpool feature@zpool_checkpoint enabled local
rpool feature@spacemap_v2 active local
rpool feature@allocation_classes enabled local
rpool feature@resilver_defer enabled local
rpool feature@bookmark_v2 enabled local
rpool feature@redaction_bookmarks enabled local
rpool feature@redacted_datasets enabled local
rpool feature@bookmark_written enabled local
rpool feature@log_spacemap active local
rpool feature@livelist active local
rpool feature@device_rebuild enabled local
rpool feature@zstd_compress active local
rpool feature@draid enabled local
rpool feature@zilsaxattr active local
rpool feature@head_errlog active local
rpool feature@blake3 enabled local
rpool feature@block_cloning active local
rpool feature@vdev_zaps_v2 active local
[niko@arch-phoenix ~]$ zfs get all rpool/zvols
NAME PROPERTY VALUE SOURCE
rpool/zvols type filesystem -
rpool/zvols creation Wed Oct 4 16:21 2023 -
rpool/zvols used 192K -
rpool/zvols available 3.29T -
rpool/zvols referenced 192K -
rpool/zvols compressratio 1.00x -
rpool/zvols mounted no -
rpool/zvols quota none default
rpool/zvols reservation none default
rpool/zvols recordsize 128K default
rpool/zvols mountpoint none inherited from rpool
rpool/zvols sharenfs off default
rpool/zvols checksum on default
rpool/zvols compression zstd inherited from rpool
rpool/zvols atime on default
rpool/zvols devices off inherited from rpool
rpool/zvols exec on default
rpool/zvols setuid on default
rpool/zvols readonly off default
rpool/zvols zoned off default
rpool/zvols snapdir hidden default
rpool/zvols aclmode discard default
rpool/zvols aclinherit restricted default
rpool/zvols createtxg 45982 -
rpool/zvols canmount off local
rpool/zvols xattr sa inherited from rpool
rpool/zvols copies 1 default
rpool/zvols version 5 -
rpool/zvols utf8only on -
rpool/zvols normalization formD -
rpool/zvols casesensitivity sensitive -
rpool/zvols vscan off default
rpool/zvols nbmand off default
rpool/zvols sharesmb off default
rpool/zvols refquota none default
rpool/zvols refreservation none default
rpool/zvols guid 8359952347687084503 -
rpool/zvols primarycache all default
rpool/zvols secondarycache all default
rpool/zvols usedbysnapshots 0B -
rpool/zvols usedbydataset 192K -
rpool/zvols usedbychildren 0B -
rpool/zvols usedbyrefreservation 0B -
rpool/zvols logbias latency default
rpool/zvols objsetid 184 -
rpool/zvols dedup off default
rpool/zvols mlslabel none default
rpool/zvols sync standard default
rpool/zvols dnodesize auto inherited from rpool
rpool/zvols refcompressratio 1.00x -
rpool/zvols written 192K -
rpool/zvols logicalused 69K -
rpool/zvols logicalreferenced 69K -
rpool/zvols volmode default default
rpool/zvols filesystem_limit none default
rpool/zvols snapshot_limit none default
rpool/zvols filesystem_count none default
rpool/zvols snapshot_count none default
rpool/zvols snapdev hidden default
rpool/zvols acltype posix inherited from rpool
rpool/zvols context none default
rpool/zvols fscontext none default
rpool/zvols defcontext none default
rpool/zvols rootcontext none default
rpool/zvols relatime on inherited from rpool
rpool/zvols redundant_metadata all default
rpool/zvols overlay on default
rpool/zvols encryption aes-256-gcm -
rpool/zvols keylocation none default
rpool/zvols keyformat passphrase -
rpool/zvols pbkdf2iters 350000 -
rpool/zvols encryptionroot rpool -
rpool/zvols keystatus available -
rpool/zvols special_small_blocks 0 default
rpool/zvols org.zfsbootmenu:keysource roool/ROOT/archlinux inherited from rpool
It might be worth exploring if this has something to do with either:
autotrim=on
and guest having discard="unmap"
@darkbasic
So you only tested it with volblocksize=4K
By 4k drives I mean I only tested on drives with a 4k physical sector size. I did test with lots of different volblocksize values.
Because blockio physical_block_size='16384' is not the issue here, it doesn't work either if I omit it.
If you omit it, does the VM default the virtual block device to having 4k physical sectors then? Or more specifically, what does the VM report as the physical sector size it sees for it's underlying block device? (which under the covers is the zvol). For example, on my VM for /dev/vda:
$ cat /sys/class/block/vda/queue/{hw_sector_size,physical_block_size}
512
512
I'm also confused about something - you listed the properties for rpool/zvols
, which from the name sounds like a zvol, but it's being reported as a filesystem:
[niko@arch-phoenix ~]$ zfs get all rpool/zvols
NAME PROPERTY VALUE SOURCE
rpool/zvols type filesystem -
...
Are you expecting rpool/zvols
to be a volume?
Are you expecting
rpool/zvols
to be a volume?
As of https://github.com/openzfs/zfs/issues/15351#issuecomment-1749486202 I would say that he didn't, the output for rpool/zvols likely is to show non-default properties on the parent datasets that'll be inherited into the zvol one.
@darkbasic: maybe you could try to determine the on-disk differences between fresh installs with both settings of zvol_use_blk_mq
? Something like diffs between the filesystems in the VM images, to detect broken files... which could provide some clues to what is going wrong...
As of #15351 (comment) I would say that he didn't, the output for rpool/zvols likely is to show non-default properties on the parent datasets that'll be inherited into the zvol one.
Exactly.
If you omit it, does the VM default the virtual block device to having 4k physical sectors then? Or more specifically, what does the VM report as the physical sector size it sees for it's underlying block device?
This is the output with <blockio physical_block_size='16384'/>
:
niko@niko-pc-q35-8-1:~$ cat /sys/class/block/vda/queue/{hw_sector_size,physical_block_size}
512
16384
This is the output without it:
ubuntu@ubuntu:~$ cat /sys/class/block/vda/queue/{hw_sector_size,physical_block_size}
512
512
maybe you could try to determine the on-disk differences between fresh installs with both settings of zvol_use_blk_mq? Something like diffs between the filesystems in the VM images, to detect broken files... which could provide some clues to what is going wrong...
I will give it a try. If I can't find any pattern I will upload the first GB of both images and you can have a look too.
@darkbasic out of an abundance of caution, I put out a PR (#15378) to not allow the user to enable blk-mq until we can nail down this issue. I'm going to try to reproduce it with a brd ramdisk today.
Also - what did you use for your pool configuration? Is the pool for your zvol backed by disks or file-based vdevs?
Also - what did you use for your pool configuration? Is the pool for your zvol backed by disks or file-based vdevs?
It's backed by a single NVMe drive: a 4 TB WD Black SN850X formatted with a 4K logical block address size instead of the default 512 bytes.
@darkbasic I was able to reproduce the failure using a pool consisting of a 16GB brd (ramdisk) block device. I exported a zvol from the pool and ran my tests:
non-blk-mq 16k physical block
blk-mq 16k physical block
blk-mq 4k physical block
I'm now going to check the MBR on the working vs non-working installs.
The MBRs are the byte-for-byte the same. Partitioning looks the same:
Device Start End Sectors Size Type
/dev/zd0p1 2048 4095 2048 1M BIOS boot
/dev/zd0p2 4096 23066623 23062528 11G Linux filesystem
Partition 1 is byte-for-byte the same. Partition 2 differs.
Just speculating - maybe there's some problem with our blk-mq code related to partitioning? I'll keep looking.
@tonyhutter sorry for the cross posting (I'll delete the message afterwards) but I would like to avoid opening another bug report if it's some kind of hardware fault. I have another server (Debian 12 + zfs 2.1) where one of its VMs exhibited the following behavior twice in the past month:
Oct 11 03:04:26 kernel: blk_update_request: I/O error, dev vda, sector 51670824 op 0x1:(WRITE) flags 0x800 phys_seg 2 prio class 0
Oct 11 03:04:26 kernel: Aborting journal on device vda2-8.
Oct 11 03:04:26 kernel: blk_update_request: I/O error, dev vda, sector 1050624 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
Oct 11 03:04:26 kernel: blk_update_request: I/O error, dev vda, sector 51644416 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
Oct 11 03:04:26 kernel: blk_update_request: I/O error, dev vda, sector 1050624 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
Oct 11 03:04:26 kernel: blk_update_request: I/O error, dev vda, sector 51644416 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
Oct 11 03:04:26 kernel: Buffer I/O error on dev vda2, logical block 0, lost sync page write
Oct 11 03:04:26 kernel: Buffer I/O error on dev vda2, logical block 6324224, lost sync page write
Oct 11 03:04:26 kernel: EXT4-fs (vda2): previous I/O error to superblock detected
Oct 11 03:04:26 kernel: JBD2: Error -5 detected when updating journal superblock for vda2-8.
Oct 11 03:04:26 kernel: blk_update_request: I/O error, dev vda, sector 1050624 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
Oct 11 03:04:26 kernel: blk_update_request: I/O error, dev vda, sector 1050624 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
Oct 11 03:04:26 kernel: Buffer I/O error on dev vda2, logical block 0, lost sync page write
Oct 11 03:04:26 kernel: EXT4-fs (vda2): I/O error while writing superblock
Oct 11 03:04:26 kernel: EXT4-fs error (device vda2): ext4_journal_check_start:83: Detected aborted journal
Oct 11 03:04:26 kernel: EXT4-fs (vda2): Remounting filesystem read-only
Oct 11 03:04:26 kernel: EXT4-fs error (device vda2): ext4_journal_check_start:83: Detected aborted journal
The vm is backed by raw files stored in a dataset. The zfs pool is on a single Optane drive which has quite some write endurance so it should hardly fail. The vm uses <blockio physical_block_size="4096"/>
, the pool has ashift 9 and the dataset recordsize 4K, encryption and lz4 compression.
Scrubbing doesn't find any error. Maybe the memory? But shouldn't it corrupt the filesystem and result in scrub errors?
I didn't have any problem before upgrading the host operating system (and thus the zfs version) a few months ago, but I've also upgraded the BIOS through several major revisions which might have changed the memory timings/speeds or anything else.
EDIT: it's a bug but in qemu https://gitlab.com/qemu-project/qemu/-/issues/1404
Some interesting results today - I installed Ubuntu on a blk-mq-enabled zvol and tried to boot it. It failed as expected. I then exported the pool, enabled blk-mq, re-imported the pool, and was able to boot Ubunutu from the previously-installed-to zvol. So it appears the writes to the zvol are correct under blk-mq, it's just the reads back that are the issue. I'm still digging.
@darkbasic this fix works for me: https://github.com/openzfs/zfs/pull/15439 Can you please give it a try on your system?
@tonyhutter I confirm it works, thanks!
Is there a way to set zvol_use_blk_mq=0
on an existing zvol? This bug effects TrueNAS 23.10.1, and so far, it looks like my options are:
It would be ideal to set zvol_use_blk_mq=0
now and then flip zvol_use_blk_mq
back to 1
once a version of TrueNAS with a fixed version of ZFS is released.
@seanthegeek zvol_use_blk_mq
is read at import time, so you can export your pool, write zvol_use_blk_mq=0
and then re-import it, and it will use the old BIO codepath.
@seanthegeek As I see, the patch for this issue was merged into ZFS 2.2.1. Current versions of TrueNAS should already include 2.2.2+. But I wonder if corruption could happen earlier and just be noticed during upgrade?
@amotin The corruption seems on read, not write, because I can switch boot environments to downgrade TrueNAS and the VMs will boot again.
@seanthegeek Could you run zfs -V
to be sure what ZFS version we are talking about?
BTW, there is TrueNAS 23.10.1.1 hot-fix release now.
@amotin TrueNAS SCALE 23.10.1.1 sudo zfs -V
output:
zfs-2.2.2-1
zfs-kmod-2.2.2-1
It has the same issue with Debian or Ubuntu VMs not booting starting in TrueNAS SCALE 23.10.1, which had the same ZFS version.
The last version that boots the VMs properly is TrueNAS SCALE 23.10.0.1.
zfs-2.2.0-rc4
zfs-kmod-2.2.0-rc4
Each of my VM Zvols are stored in an unencrypted Zpool of mirrored data SSDs with ZSTD-5 compression. Each Zvol was created using the Add Zvol wizard in the dataset view, with passphrase encryption configured. The VMs are created using the Create Virtual Machine web UI. The boot method is set at the default (UEFI). Once Debain is installed using the standard Debain install process, the VMs can reboot successfully until the host system us upgraded beyond TrueNAS SCALE 23.10.0.1. this bug reported in the TrueNAS JIRA has been introduced sometime between zfs-2.2.0-rc4 and zfs-2.2.2-1. I don;t think it is related to this bug, so I'll keep the discussion about that going in JIRA.
System information
Describe the problem you're observing
Installing a distro in virt-manager leads to an unbootable virtual machine if the storage is backed by ZVOLs. I'm using
zvol_use_blk_mq=1
~but I'm not sure if this is the culprit because I didn't try disabling it yet.~ _I've just tried with zvol_use_blk_mq=0 and it works flawlessly so the culprit is definitely zvol_use_blkmq=1This is the libvirt disk:
Describe how to reproduce the problem
Download latest snapshot of ubuntu 23.10. Create a vol:
Install ubuntu in the zvol using virt-manager. Reboot the vm and watch it failing to boot:
Using raw images instead leads to a perfectly bootable system.
Include any warning/errors/backtraces from the system logs