openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.32k stars 1.72k forks source link

ZVOL ignores special_small_blocks #16101

Open tcpluess opened 2 months ago

tcpluess commented 2 months ago
System information: Type Version/Name
Distribution Name Debian
Distribution Version 12
Kernel Version Linux pve1 6.5.13-1-pve #1 SMP PREEMPT_DYNAMIC PMX 6.5.13-1 (2024-02-05T13:50Z) x86_64 GNU/Linux
Architecture
OpenZFS Version zfs-2.2.3

Describe the problem you're observing

My ZFS pool contains an SSD special device. I use recordsize=1M. Now, I created a new dataset that has special_small_blocks=1M set. Therefore, any files that I put into this dataset will be entirely placed on the SSD. Indeed, I can confirm, looking at zpool list -v, that files put into this dataset are indeed located only on the SSD. This is good if one wants particularly fast access for certain files.

Now I created a ZFS volume, also inside this dataset. The volume uses 16K blocksize. I expected that, when I mount the volume and write to it, as the blocks are 16K and therefore much smaller than the record size, should also end up only on the special device. However, this is not the case; instead, the data in the ZVOL is handled like any other data, i.e. it is distributed amongst the hard disks and I only see some metadata being put onto the SSD.

I thought that, using special_small_blocks=<recordsize>, is one elegant way of dividing a pool into "fast" datasets that entirely live on the special device, and "normal" datasets with just their metadata being held on the special device. While this is indeed true for ordinary files, it does not work for ZVOLs. Why is this?

Describe how to reproduce the problem

How to reproduce the problem:

I created a pool with the following config (ashift=12 for all devices, compression=zstd, special_small_blocks=0K):

# zpool status
  pool: tank
 state: ONLINE
config:

    NAME        STATE     READ WRITE CKSUM
    tank        ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        sdg     ONLINE       0     0     0
        sdh     ONLINE       0     0     0
    special   
      mirror-1  ONLINE       0     0     0
        sdc     ONLINE       0     0     0
        sdd     ONLINE       0     0     0

errors: No known data errors

# zfs get recordsize tank
NAME  PROPERTY    VALUE    SOURCE
tank  recordsize  128K     default

next, I created two datasets. The "ssdonly" dataset has the special_small_blocks set equal to the record size, therefore I expect everything that is put into this dataset will end up only on the special device. On the other hand, the "hdd" dataset has special_small_bocks=0K and therefore, absolutely only the metadata ends up on the special device. To see how data is distributed amongst the individual vdevs, I also add the output of zpool list -v below:

# zfs create -o special_small_blocks=128K tank/ssdonly
# zfs create tank/hdd
# zfs get special_small_blocks tank/ssdonly
NAME          PROPERTY              VALUE                 SOURCE
tank/ssdonly  special_small_blocks  128K                  local
# zfs get special_small_blocks tank/hdd
NAME      PROPERTY              VALUE                 SOURCE
tank/hdd  special_small_blocks  0                     default
# zpool list -v
NAME         SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
tank        2.18T   952K  2.18T        -         -     0%     0%  1.00x    ONLINE  -
  mirror-0  1.81T      0  1.81T        -         -     0%  0.00%      -    ONLINE
    sdg     1.82T      -      -        -         -      -      -      -    ONLINE
    sdh     1.82T      -      -        -         -      -      -      -    ONLINE
special         -      -      -        -         -      -      -      -         -
  mirror-1   372G   952K   372G        -         -     0%  0.00%      -    ONLINE
    sdc      373G      -      -        -         -      -      -      -    ONLINE
    sdd      372G      -      -        -         -      -      -      -    ONLINE

Good. Now I create a bunch of random test files and put them into the "ssdonly" dataset. We can clearly see that ALL data is put into the special device, and none is on the regular devices, as expected because the "ssdonly" dataset is configured to do so:

# dd if=/dev/urandom of=/tank/ssdonly/1k.txt bs=1k count=1
# dd if=/dev/urandom of=/tank/ssdonly/1k.txt bs=1k count=2
# dd if=/dev/urandom of=/tank/ssdonly/1k.txt bs=1k count=4
# dd if=/dev/urandom of=/tank/ssdonly/1k.txt bs=1k count=8
# dd if=/dev/urandom of=/tank/ssdonly/1k.txt bs=1k count=16
# dd if=/dev/urandom of=/tank/ssdonly/1k.txt bs=1k count=32
# dd if=/dev/urandom of=/tank/ssdonly/1k.txt bs=1k count=64
# dd if=/dev/urandom of=/tank/ssdonly/1k.txt bs=1k count=128
# dd if=/dev/urandom of=/tank/ssdonly/1k.txt bs=1k count=256
# dd if=/dev/urandom of=/tank/ssdonly/1k.txt bs=1k count=1M
# zpool list -v
NAME         SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
tank        2.18T   644M  2.18T        -         -     0%     0%  1.00x    ONLINE  -
  mirror-0  1.81T      0  1.81T        -         -     0%  0.00%      -    ONLINE
    sdg     1.82T      -      -        -         -      -      -      -    ONLINE
    sdh     1.82T      -      -        -         -      -      -      -    ONLINE
special         -      -      -        -         -      -      -      -         -
  mirror-1   372G   644M   371G        -         -     0%  0.16%      -    ONLINE
    sdc      373G      -      -        -         -      -      -      -    ONLINE
    sdd      372G      -      -        -         -      -      -      -    ONLINE

The same tests are repeated with the "hdd" dataset, and it can be verified that the "hdd" data is being stored on the regular vdevs.

Now I create a zvol inside the ssdonly dataset and, using the above tests, expect that the entire data of the zvol is being held only on the special device, which can be verified is not the case.

I wonder if this is a bug or a feature.

rincebrain commented 2 months ago

cf. #14876

tcpluess commented 2 months ago

thanks. This is exactly what I need. I did look for the ZVOL issue, but unfortunately I didn't find this particular #14876 .