openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.7k stars 1.76k forks source link

zpool create ignores specified ashift when creating mirror on mixed sector size drives #13557

Open notr1ch opened 2 years ago

notr1ch commented 2 years ago

System information

Type Version/Name
Distribution Name Debian
Distribution Version Bullseye
Kernel Version 5.16.0-0.bpo.4-amd64
Architecture amd64
OpenZFS Version zfs-2.1.4-1~bpo11+1

Describe the problem you're observing

I ran into an issue with root on ZFS where grub-probe could not identify my 4-way mirror boot pool. After some examination, it turns out the pool was created with ashift=15 despite my specifying ashift=12 on the command line, and grub only supports up to ashift=12. Two of the drives report 4k sectors and two of them report 32k (whether that's correct is a question for another time...) so ZFS likely detected this and overrode the manually specified ashift.

Creating it with a single drive and attaching the additional drives works around the problem (thanks PMT on the #openzfs IRC channel).

Describe how to reproduce the problem

truncate -s 128M disk0
truncate -s 128M disk1
losetup -f -P -b 4096 disk0
losetup -f -P -b 512 disk1
zpool create -o ashift=9 test mirror /dev/loop0 /dev/loop1
zdb

Example Output

root:~# zpool create -o ashift=9 test mirror /dev/loop0 /dev/loop1
root:~# zdb
test:
    version: 5000
    name: 'test'
    state: 0
    txg: 4
    pool_guid: 2393696871689285793
    errata: 0
    hostid: 2536109787
    hostname: 'test'
    com.delphix:has_per_vdev_zaps
    vdev_children: 1
    vdev_tree:
        type: 'root'
        id: 0
        guid: 2393696871689285793
        create_txg: 4
        children[0]:
            type: 'mirror'
            id: 0
            guid: 9127851438672249768
            metaslab_array: 256
            metaslab_shift: 24
            ashift: 12
            asize: 129499136
            is_log: 0
            create_txg: 4
            com.delphix:vdev_zap_top: 129
            children[0]:
                type: 'disk'
                id: 0
                guid: 1021110918612612113
                path: '/dev/loop0'
                whole_disk: 0
                create_txg: 4
                com.delphix:vdev_zap_leaf: 130
            children[1]:
                type: 'disk'
                id: 1
                guid: 10349381568537212099
                path: '/dev/loop1'
                whole_disk: 0
                create_txg: 4
                com.delphix:vdev_zap_leaf: 131
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data

root:~# zpool status test
  pool: test
 state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        test        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            loop0   ONLINE       0     0     0
            loop1   ONLINE       0     0     0

errors: No known data errors

root:~# zpool destroy test
root:~# zpool create -o ashift=9 test /dev/loop0
root:~# zpool attach -o ashift=9 test /dev/loop0 /dev/loop1
root:~# zdb
test:
    version: 5000
    name: 'test'
    state: 0
    txg: 14
    pool_guid: 9228862698534542279
    errata: 0
    hostid: 2536109787
    hostname: 'test'
    com.delphix:has_per_vdev_zaps
    vdev_children: 1
    vdev_tree:
        type: 'root'
        id: 0
        guid: 9228862698534542279
        create_txg: 4
        children[0]:
            type: 'mirror'
            id: 0
            guid: 12311208393131411085
            whole_disk: 0
            metaslab_array: 256
            metaslab_shift: 24
            ashift: 9
            asize: 129499136
            is_log: 0
            create_txg: 4
            com.delphix:vdev_zap_top: 130
            children[0]:
                type: 'disk'
                id: 0
                guid: 13609141078303940359
                path: '/dev/loop0'
                whole_disk: 0
                create_txg: 4
                com.delphix:vdev_zap_leaf: 129
            children[1]:
                type: 'disk'
                id: 1
                guid: 4815405119137788479
                path: '/dev/loop1'
                whole_disk: 0
                DTL: 386
                create_txg: 4
                com.delphix:vdev_zap_leaf: 384
                resilver_txg: 11
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data

root:~# zpool status test
  pool: test
 state: ONLINE
status: One or more devices are configured to use a non-native block size.
        Expect reduced performance.
action: Replace affected devices with devices that support the
        configured block size, or migrate data to a properly configured
        pool.
  scan: resilvered 178K in 00:00:00 with 0 errors on Tue Jun 14 20:01:23 2022
config:

        NAME        STATE     READ WRITE CKSUM
        test        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            loop0   ONLINE       0     0     0  block size: 512B configured, 4096B native
            loop1   ONLINE       0     0     0

errors: No known data errors
stale[bot] commented 1 year ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

notr1ch commented 1 year ago

Still occuring with 2.1.12 following reproduction steps in OP.

# zfs version
zfs-2.1.12-2~bpo12+1
zfs-kmod-2.1.12-2~bpo12+1

# truncate -s 128M disk0
# truncate -s 128M disk1
# losetup -f -P -b 4096 disk0
# losetup -f -P -b 512 disk1
# zpool create -o ashift=9 test mirror /dev/loop0 /dev/loop1
# zdb

test:
    version: 5000
    name: 'test'
    state: 0
    txg: 4
    pool_guid: 981678666251598749
    errata: 0
    hostid: 2536109787
    hostname: 'backupsrv2'
    com.delphix:has_per_vdev_zaps
    vdev_children: 1
    vdev_tree:
        type: 'root'
        id: 0
        guid: 981678666251598749
        create_txg: 4
        children[0]:
            type: 'mirror'
            id: 0
            guid: 10351813791893030260
            metaslab_array: 260
            metaslab_shift: 24
            ashift: 12
            asize: 129499136
            is_log: 0
            create_txg: 4
            com.delphix:vdev_zap_top: 257
            children[0]:
                type: 'disk'
                id: 0
                guid: 8975606871540584722
                path: '/dev/loop0'
                whole_disk: 0
                create_txg: 4
                com.delphix:vdev_zap_leaf: 258
            children[1]:
                type: 'disk'
                id: 1
                guid: 8897050594311932012
                path: '/dev/loop1'
                whole_disk: 0
                create_txg: 4
                com.delphix:vdev_zap_leaf: 259
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data
amotin commented 1 year ago

ZFS can not set vdev ashift smaller than logical_ashift reported by vdev. It is just impossible. It may be lower than physical_ashift, but not logical_ashift. And according to losetup man page -b sets logical blocks size.

notr1ch commented 1 year ago

How does the second test work if that's the case? Or is that a bug?

amotin commented 1 year ago

Weird. According to "block size: 512B configured, 4096B native", ZFS sees at least the physical_ashift of 12. I wonder what logical_ashift is reported for the loop0 device. Device with logical sector size of 4K should immediately fail any requests not aligned to 4K, that would happen in case of ashift=9. But if logical sector size is 512 in both cases, then it may work, and I guess it may indeed be a bug in case of mirror, needs deeper look.

notr1ch commented 1 year ago
# cat /sys/class/block/loop0/queue/physical_block_size
4096
# cat /sys/class/block/loop0/queue/logical_block_size
4096
# cat /sys/class/block/loop1/queue/physical_block_size
512
# cat /sys/class/block/loop1/queue/logical_block_size
512

It does seem that the logical block size is indeed 4K. Does ZFS handle this internally somewhere by "emulating" the smaller logical block size in a similar way to how 4K format HDDs function?

amotin commented 1 year ago

ZFS itself does not have such code AFAIK. In Linux page cache may implement such functionality. I would expect ZFS to bypass it, though I don't know.

amotin commented 1 year ago

Speaking about emulation by HDDs, 512e HDDs have logical_block_size of 512 and physical_block_size of 4096.