ZVOL data corruption with zvol_use_blk_mq=1

darkbasic commented 1 year ago

System information

Type	Version/Name
Distribution Name	Arch Linux
Distribution Version
Kernel Version	6.6.0-rc4
Architecture	amd64
OpenZFS Version	git branch zfs-2.2-release (229ca7d738ccbf4c55076977467ee93e20b6f01b) + 6.6 compatibility patches

Describe the problem you're observing

Installing a distro in virt-manager leads to an unbootable virtual machine if the storage is backed by ZVOLs. I'm using zvol_use_blk_mq=1 ~but I'm not sure if this is the culprit because I didn't try disabling it yet.~ _I've just tried with zvol_use_blk_mq=0 and it works flawlessly so the culprit is definitely zvol_use_blkmq=1

This is the libvirt disk:

<disk type="block" device="disk">
  <driver name="qemu" type="raw" cache="none" error_policy="stop" discard="unmap" io="native"/>
  <source dev="/dev/zvol/rpool/zvols/ubuntu-23.10"/>
  <blockio physical_block_size='16384'/>
  <target dev="vda" bus="virtio"/>
  <boot order="2"/>
</disk>

Describe how to reproduce the problem

Download latest snapshot of ubuntu 23.10. Create a vol:

zfs create -s -V 100G -o compression=lz4 rpool/zvols/ubuntu-23.10

Install ubuntu in the zvol using virt-manager. Reboot the vm and watch it failing to boot:

Using raw images instead leads to a perfectly bootable system.

Include any warning/errors/backtraces from the system logs

filip-paczynski commented 1 year ago

The qemu config includes a specific blocksize:

  <blockio physical_block_size='16384'/>

while the zfs create command doesn't specify blocksize (lacks -b, I'm not sure what the default is, might also be inherited). Maybe this is a case of mismatch between qemus opinion vs zfs blocksize? Please check actual zvol blocksize by cmd:

zfs get volblocksize rpool/zvols/ubuntu-23.10

darkbasic commented 1 year ago

AFAIK <blockio physical_block_size='16384'/> lets the guest know that writes are best with 16k, but it still can write in 512 byte sectors. E.g. ext4 with block size 4k is mounted with stripe=4.

while the zfs create command doesn't specify blocksize (lacks -b, I'm not sure what the default is, might also be inherited)

zvols don't inherit the blocksize and they default to 16K, so it matches.

filip-paczynski commented 1 year ago

AFAIK lets the guest know that writes are best with 16k, but it still can write in 512 byte sectors. E.g. ext4 with block size 4k is mounted with stripe=4.

I admit I'm not very familiar with qemu/kvm way of doing things, but in case of xen and linux guest volblocksize gets translated into physical sector size on the guest VM side.

zvols don't inherit the blocksize and they default to 16K, so it matches.

yes, they do, I use it on a "swap" dataset under which all the swaps reside and they do inherit volblocksize

Additionally, UEFI might have a problem with sector sizes different than usual 512 or 4096.

The way to test it, excluding ZFS, would be to create an image file and set it up via losetup -b 16384

PS. I'm not ZFS developer/contributor, just someone with experience with ZFS zvols + xen vms.

darkbasic commented 1 year ago

yes, they do, I use it on a "swap" dataset under which all the swaps reside and they do inherit volblocksize

I don't know how this is possible because datasets don't have any volblocksize property, they have recordsize from which zvols don't inherit:

[root@arch-phoenix ~]# zfs get all rpool/zvols | grep size
rpool/zvols  recordsize                 128K                       default
rpool/zvols  dnodesize                  auto                       inherited from rpool
[root@arch-phoenix ~]# zfs create -s -V 100G -o compression=lz4 rpool/zvols/ubuntu-23.10
[root@arch-phoenix ~]# zfs get all rpool/zvols/ubuntu-23.10 | grep size
rpool/zvols/ubuntu-23.10  volsize                    100G                       local
rpool/zvols/ubuntu-23.10  volblocksize               16K                        default

Additionally, UEFI might have a problem with sector sizes different than usual 512 or 4096.

I've already tried with CSM (BIOS) and it doesn't work either. Also 16K is the default volblocksize so it is supposed to work (and works if 16K is the recordsize of the dataset containing a raw image). Even removing <blockio physical_block_size='16384'/> leads to the same unbootable system.

Next time I reboot I will try with the (default) zvol_use_blk_mq=0.

filip-paczynski commented 1 year ago

Sorry, you're right about inheriting volblocksize, I thought it uses recordsize, but apparenty it doesn't. Once again - my apologies for the mistake

NAME                             PROPERTY      VALUE     SOURCE
belt/swaps                       recordsize    16K       local
belt/swaps                       volblocksize  -         -
belt/swaps/vmtemplates/debian11  recordsize    -         -
belt/swaps/vmtemplates/debian11  volblocksize  8K        default

However, it seems that 8K is the default for zvols - at least on 2.1.X

darkbasic commented 1 year ago

However, it seems that 8K is the default for zvols - at least on 2.1.X

It changed in 2.2.

darkbasic commented 1 year ago

I've just tried with zvol_use_blk_mq=0 and it works flawlessly so the issue is definitely with zvol_use_blk_mq=1.

tonyhutter commented 1 year ago

Thanks for opening this issue - I just self-assigned to take a look into this. The <blockio physical_block_size='16384'> is interesting, as I was only able to test on 4k drives when I was writing the blk-mq code.

rincebrain commented 1 year ago

I might wildly speculate that this and #14533 are the same bug, since it seems like they're related to having mismatched sector sizes triggering an edge case.

zvol defaults changed between 2.1 and the yet-unreleased 2.2 (in 72f0521aba) because the tradeoff for raidz/draid/compression improvements was viewed as worth it over the larger size, I think. (I don't believe even 2.1.13 has that backported?)

darkbasic commented 1 year ago

The is interesting, as I was only able to test on 4k drives when I was writing the blk-mq code.

So you only tested it with volblocksize=4K? Because <blockio physical_block_size='16384'> is not the issue here, it doesn't work either if I omit it.

P.S. My NVMe block size is 4K as well, I decided to stick to the default volblocksize=16K because it's a good compromise between performance/compression/etc.

I might wildly speculate that this and https://github.com/openzfs/zfs/issues/14533 are the same bug, since it seems like they're related to having mismatched sector sizes triggering an edge case.

Uhm, I'm not sure:

Sector size is not really mismatched because the guest mounts with stripe=4.
If the guest writes in 512/1K/2K/4K/8K blocks it still ends up being aligned.
Raw images with recordsize=16K work flawlessly.
Zvol with recordsize=16K works flawlessly either if I omit the zvol_use_blk_mq=1 zfs module parameter.

It must be something specific to the multi queue code.

darkbasic commented 1 year ago

By the way here is some more context:

[niko@arch-phoenix ~]$ zpool get all rpool
NAME   PROPERTY                       VALUE                          SOURCE
rpool  size                           3.62T                          -
rpool  capacity                       6%                             -
rpool  altroot                        -                              default
rpool  health                         ONLINE                         -
rpool  guid                           5183011630881450907            -
rpool  version                        -                              default
rpool  bootfs                         rpool/ROOT/archlinux           local
rpool  delegation                     on                             default
rpool  autoreplace                    off                            default
rpool  cachefile                      -                              default
rpool  failmode                       wait                           default
rpool  listsnapshots                  off                            default
rpool  autoexpand                     off                            default
rpool  dedupratio                     1.00x                          -
rpool  free                           3.40T                          -
rpool  allocated                      228G                           -
rpool  readonly                       off                            -
rpool  ashift                         12                             local
rpool  comment                        -                              default
rpool  expandsize                     -                              -
rpool  freeing                        0                              -
rpool  fragmentation                  0%                             -
rpool  leaked                         0                              -
rpool  multihost                      off                            default
rpool  checkpoint                     -                              -
rpool  load_guid                      806659593871550708             -
rpool  autotrim                       on                             local
rpool  compatibility                  openzfs-2.2-linux              local
rpool  bcloneused                     206M                           -
rpool  bclonesaved                    227M                           -
rpool  bcloneratio                    2.10x                          -
rpool  feature@async_destroy          enabled                        local
rpool  feature@empty_bpobj            active                         local
rpool  feature@lz4_compress           active                         local
rpool  feature@multi_vdev_crash_dump  enabled                        local
rpool  feature@spacemap_histogram     active                         local
rpool  feature@enabled_txg            active                         local
rpool  feature@hole_birth             active                         local
rpool  feature@extensible_dataset     active                         local
rpool  feature@embedded_data          active                         local
rpool  feature@bookmarks              enabled                        local
rpool  feature@filesystem_limits      enabled                        local
rpool  feature@large_blocks           enabled                        local
rpool  feature@large_dnode            active                         local
rpool  feature@sha512                 enabled                        local
rpool  feature@skein                  enabled                        local
rpool  feature@edonr                  enabled                        local
rpool  feature@userobj_accounting     active                         local
rpool  feature@encryption             active                         local
rpool  feature@project_quota          active                         local
rpool  feature@device_removal         enabled                        local
rpool  feature@obsolete_counts        enabled                        local
rpool  feature@zpool_checkpoint       enabled                        local
rpool  feature@spacemap_v2            active                         local
rpool  feature@allocation_classes     enabled                        local
rpool  feature@resilver_defer         enabled                        local
rpool  feature@bookmark_v2            enabled                        local
rpool  feature@redaction_bookmarks    enabled                        local
rpool  feature@redacted_datasets      enabled                        local
rpool  feature@bookmark_written       enabled                        local
rpool  feature@log_spacemap           active                         local
rpool  feature@livelist               active                         local
rpool  feature@device_rebuild         enabled                        local
rpool  feature@zstd_compress          active                         local
rpool  feature@draid                  enabled                        local
rpool  feature@zilsaxattr             active                         local
rpool  feature@head_errlog            active                         local
rpool  feature@blake3                 enabled                        local
rpool  feature@block_cloning          active                         local
rpool  feature@vdev_zaps_v2           active                         local

[niko@arch-phoenix ~]$ zfs get all rpool/zvols
NAME         PROPERTY                   VALUE                      SOURCE
rpool/zvols  type                       filesystem                 -
rpool/zvols  creation                   Wed Oct  4 16:21 2023      -
rpool/zvols  used                       192K                       -
rpool/zvols  available                  3.29T                      -
rpool/zvols  referenced                 192K                       -
rpool/zvols  compressratio              1.00x                      -
rpool/zvols  mounted                    no                         -
rpool/zvols  quota                      none                       default
rpool/zvols  reservation                none                       default
rpool/zvols  recordsize                 128K                       default
rpool/zvols  mountpoint                 none                       inherited from rpool
rpool/zvols  sharenfs                   off                        default
rpool/zvols  checksum                   on                         default
rpool/zvols  compression                zstd                       inherited from rpool
rpool/zvols  atime                      on                         default
rpool/zvols  devices                    off                        inherited from rpool
rpool/zvols  exec                       on                         default
rpool/zvols  setuid                     on                         default
rpool/zvols  readonly                   off                        default
rpool/zvols  zoned                      off                        default
rpool/zvols  snapdir                    hidden                     default
rpool/zvols  aclmode                    discard                    default
rpool/zvols  aclinherit                 restricted                 default
rpool/zvols  createtxg                  45982                      -
rpool/zvols  canmount                   off                        local
rpool/zvols  xattr                      sa                         inherited from rpool
rpool/zvols  copies                     1                          default
rpool/zvols  version                    5                          -
rpool/zvols  utf8only                   on                         -
rpool/zvols  normalization              formD                      -
rpool/zvols  casesensitivity            sensitive                  -
rpool/zvols  vscan                      off                        default
rpool/zvols  nbmand                     off                        default
rpool/zvols  sharesmb                   off                        default
rpool/zvols  refquota                   none                       default
rpool/zvols  refreservation             none                       default
rpool/zvols  guid                       8359952347687084503        -
rpool/zvols  primarycache               all                        default
rpool/zvols  secondarycache             all                        default
rpool/zvols  usedbysnapshots            0B                         -
rpool/zvols  usedbydataset              192K                       -
rpool/zvols  usedbychildren             0B                         -
rpool/zvols  usedbyrefreservation       0B                         -
rpool/zvols  logbias                    latency                    default
rpool/zvols  objsetid                   184                        -
rpool/zvols  dedup                      off                        default
rpool/zvols  mlslabel                   none                       default
rpool/zvols  sync                       standard                   default
rpool/zvols  dnodesize                  auto                       inherited from rpool
rpool/zvols  refcompressratio           1.00x                      -
rpool/zvols  written                    192K                       -
rpool/zvols  logicalused                69K                        -
rpool/zvols  logicalreferenced          69K                        -
rpool/zvols  volmode                    default                    default
rpool/zvols  filesystem_limit           none                       default
rpool/zvols  snapshot_limit             none                       default
rpool/zvols  filesystem_count           none                       default
rpool/zvols  snapshot_count             none                       default
rpool/zvols  snapdev                    hidden                     default
rpool/zvols  acltype                    posix                      inherited from rpool
rpool/zvols  context                    none                       default
rpool/zvols  fscontext                  none                       default
rpool/zvols  defcontext                 none                       default
rpool/zvols  rootcontext                none                       default
rpool/zvols  relatime                   on                         inherited from rpool
rpool/zvols  redundant_metadata         all                        default
rpool/zvols  overlay                    on                         default
rpool/zvols  encryption                 aes-256-gcm                -
rpool/zvols  keylocation                none                       default
rpool/zvols  keyformat                  passphrase                 -
rpool/zvols  pbkdf2iters                350000                     -
rpool/zvols  encryptionroot             rpool                      -
rpool/zvols  keystatus                  available                  -
rpool/zvols  special_small_blocks       0                          default
rpool/zvols  org.zfsbootmenu:keysource  roool/ROOT/archlinux       inherited from rpool

darkbasic commented 1 year ago

It might be worth exploring if this has something to do with either:

zvol being sparse
host having autotrim=on and guest having discard="unmap"

tonyhutter commented 1 year ago

@darkbasic

So you only tested it with volblocksize=4K

By 4k drives I mean I only tested on drives with a 4k physical sector size. I did test with lots of different volblocksize values.

Because blockio physical_block_size='16384' is not the issue here, it doesn't work either if I omit it.

If you omit it, does the VM default the virtual block device to having 4k physical sectors then? Or more specifically, what does the VM report as the physical sector size it sees for it's underlying block device? (which under the covers is the zvol). For example, on my VM for /dev/vda:

$ cat /sys/class/block/vda/queue/{hw_sector_size,physical_block_size}
512
512

I'm also confused about something - you listed the properties for rpool/zvols, which from the name sounds like a zvol, but it's being reported as a filesystem:

[niko@arch-phoenix ~]$ zfs get all rpool/zvols
NAME         PROPERTY                   VALUE                      SOURCE
rpool/zvols  type                       filesystem                 -
...

Are you expecting rpool/zvols to be a volume?

GregorKopka commented 1 year ago

Are you expecting rpool/zvols to be a volume?

As of https://github.com/openzfs/zfs/issues/15351#issuecomment-1749486202 I would say that he didn't, the output for rpool/zvols likely is to show non-default properties on the parent datasets that'll be inherited into the zvol one.

@darkbasic: maybe you could try to determine the on-disk differences between fresh installs with both settings of zvol_use_blk_mq? Something like diffs between the filesystems in the VM images, to detect broken files... which could provide some clues to what is going wrong...

darkbasic commented 1 year ago

As of #15351 (comment) I would say that he didn't, the output for rpool/zvols likely is to show non-default properties on the parent datasets that'll be inherited into the zvol one.

Exactly.

If you omit it, does the VM default the virtual block device to having 4k physical sectors then? Or more specifically, what does the VM report as the physical sector size it sees for it's underlying block device?

This is the output with <blockio physical_block_size='16384'/>:

niko@niko-pc-q35-8-1:~$ cat /sys/class/block/vda/queue/{hw_sector_size,physical_block_size}
512
16384

This is the output without it:

ubuntu@ubuntu:~$ cat /sys/class/block/vda/queue/{hw_sector_size,physical_block_size}
512
512

maybe you could try to determine the on-disk differences between fresh installs with both settings of zvol_use_blk_mq? Something like diffs between the filesystems in the VM images, to detect broken files... which could provide some clues to what is going wrong...

I will give it a try. If I can't find any pattern I will upload the first GB of both images and you can have a look too.

tonyhutter commented 1 year ago

@darkbasic out of an abundance of caution, I put out a PR (#15378) to not allow the user to enable blk-mq until we can nail down this issue. I'm going to try to reproduce it with a brd ramdisk today.

Also - what did you use for your pool configuration? Is the pool for your zvol backed by disks or file-based vdevs?

darkbasic commented 1 year ago

Also - what did you use for your pool configuration? Is the pool for your zvol backed by disks or file-based vdevs?

It's backed by a single NVMe drive: a 4 TB WD Black SN850X formatted with a 4K logical block address size instead of the default 512 bytes.

tonyhutter commented 1 year ago

@darkbasic I was able to reproduce the failure using a pool consisting of a 16GB brd (ramdisk) block device. I exported a zvol from the pool and ran my tests:

non-blk-mq 16k physical block

Installed and booted Fedora 38
Installed and booted Ubuntu 23.10

blk-mq 16k physical block

Installed but could not boot Ubuntu 23.10

blk-mq 4k physical block

Installed but could not boot Ubuntu 23.10

I'm now going to check the MBR on the working vs non-working installs.

tonyhutter commented 1 year ago

The MBRs are the byte-for-byte the same. Partitioning looks the same:

Device     Start      End  Sectors Size Type
/dev/zd0p1  2048     4095     2048   1M BIOS boot
/dev/zd0p2  4096 23066623 23062528  11G Linux filesystem

Partition 1 is byte-for-byte the same. Partition 2 differs.

Just speculating - maybe there's some problem with our blk-mq code related to partitioning? I'll keep looking.

darkbasic commented 1 year ago

@tonyhutter sorry for the cross posting (I'll delete the message afterwards) but I would like to avoid opening another bug report if it's some kind of hardware fault. I have another server (Debian 12 + zfs 2.1) where one of its VMs exhibited the following behavior twice in the past month:

Oct 11 03:04:26 kernel: blk_update_request: I/O error, dev vda, sector 51670824 op 0x1:(WRITE) flags 0x800 phys_seg 2 prio class 0
Oct 11 03:04:26 kernel: Aborting journal on device vda2-8.
Oct 11 03:04:26 kernel: blk_update_request: I/O error, dev vda, sector 1050624 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
Oct 11 03:04:26 kernel: blk_update_request: I/O error, dev vda, sector 51644416 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
Oct 11 03:04:26 kernel: blk_update_request: I/O error, dev vda, sector 1050624 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
Oct 11 03:04:26 kernel: blk_update_request: I/O error, dev vda, sector 51644416 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
Oct 11 03:04:26 kernel: Buffer I/O error on dev vda2, logical block 0, lost sync page write
Oct 11 03:04:26 kernel: Buffer I/O error on dev vda2, logical block 6324224, lost sync page write
Oct 11 03:04:26 kernel: EXT4-fs (vda2): previous I/O error to superblock detected
Oct 11 03:04:26 kernel: JBD2: Error -5 detected when updating journal superblock for vda2-8.
Oct 11 03:04:26 kernel: blk_update_request: I/O error, dev vda, sector 1050624 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
Oct 11 03:04:26 kernel: blk_update_request: I/O error, dev vda, sector 1050624 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
Oct 11 03:04:26 kernel: Buffer I/O error on dev vda2, logical block 0, lost sync page write
Oct 11 03:04:26 kernel: EXT4-fs (vda2): I/O error while writing superblock
Oct 11 03:04:26 kernel: EXT4-fs error (device vda2): ext4_journal_check_start:83: Detected aborted journal
Oct 11 03:04:26 kernel: EXT4-fs (vda2): Remounting filesystem read-only
Oct 11 03:04:26 kernel: EXT4-fs error (device vda2): ext4_journal_check_start:83: Detected aborted journal

The vm is backed by raw files stored in a dataset. The zfs pool is on a single Optane drive which has quite some write endurance so it should hardly fail. The vm uses <blockio physical_block_size="4096"/>, the pool has ashift 9 and the dataset recordsize 4K, encryption and lz4 compression. Scrubbing doesn't find any error. Maybe the memory? But shouldn't it corrupt the filesystem and result in scrub errors? I didn't have any problem before upgrading the host operating system (and thus the zfs version) a few months ago, but I've also upgraded the BIOS through several major revisions which might have changed the memory timings/speeds or anything else.

EDIT: it's a bug but in qemu https://gitlab.com/qemu-project/qemu/-/issues/1404

tonyhutter commented 1 year ago

Some interesting results today - I installed Ubuntu on a blk-mq-enabled zvol and tried to boot it. It failed as expected. I then exported the pool, enabled blk-mq, re-imported the pool, and was able to boot Ubunutu from the previously-installed-to zvol. So it appears the writes to the zvol are correct under blk-mq, it's just the reads back that are the issue. I'm still digging.

tonyhutter commented 1 year ago

@darkbasic this fix works for me: https://github.com/openzfs/zfs/pull/15439 Can you please give it a try on your system?

darkbasic commented 1 year ago

@tonyhutter I confirm it works, thanks!

seanthegeek commented 9 months ago

Is there a way to set zvol_use_blk_mq=0 on an existing zvol? This bug effects TrueNAS 23.10.1, and so far, it looks like my options are:

Stay on an outdated version of TrueNAS (13.10.0.0.1) until a version of TrueNAS is released with a newer version of ZFS
Modify the GRUB installation on my VMs (not appealing)

It would be ideal to set zvol_use_blk_mq=0 now and then flip zvol_use_blk_mq back to 1 once a version of TrueNAS with a fixed version of ZFS is released.

tonyhutter commented 9 months ago

@seanthegeek zvol_use_blk_mq is read at import time, so you can export your pool, write zvol_use_blk_mq=0 and then re-import it, and it will use the old BIO codepath.

amotin commented 9 months ago

@seanthegeek As I see, the patch for this issue was merged into ZFS 2.2.1. Current versions of TrueNAS should already include 2.2.2+. But I wonder if corruption could happen earlier and just be noticed during upgrade?

seanthegeek commented 9 months ago

@amotin The corruption seems on read, not write, because I can switch boot environments to downgrade TrueNAS and the VMs will boot again.

amotin commented 9 months ago

@seanthegeek Could you run zfs -V to be sure what ZFS version we are talking about?

BTW, there is TrueNAS 23.10.1.1 hot-fix release now.

seanthegeek commented 9 months ago

@amotin TrueNAS SCALE 23.10.1.1 sudo zfs -V output:

zfs-2.2.2-1
zfs-kmod-2.2.2-1

It has the same issue with Debian or Ubuntu VMs not booting starting in TrueNAS SCALE 23.10.1, which had the same ZFS version.

The last version that boots the VMs properly is TrueNAS SCALE 23.10.0.1.

zfs-2.2.0-rc4
zfs-kmod-2.2.0-rc4

Each of my VM Zvols are stored in an unencrypted Zpool of mirrored data SSDs with ZSTD-5 compression. Each Zvol was created using the Add Zvol wizard in the dataset view, with passphrase encryption configured. The VMs are created using the Create Virtual Machine web UI. The boot method is set at the default (UEFI). Once Debain is installed using the standard Debain install process, the VMs can reboot successfully until the host system us upgraded beyond TrueNAS SCALE 23.10.0.1. this bug reported in the TrueNAS JIRA has been introduced sometime between zfs-2.2.0-rc4 and zfs-2.2.2-1. I don;t think it is related to this bug, so I'll keep the discussion about that going in JIRA.

openzfs / zfs