400% space waste because the -o ashift=9 not working in the ZOL 2.x.x

homerl commented 2 years ago

System information

Type	Version/Name
Distribution Name	CentOS
Distribution Version	7.9
Kernel Version	3.10.0-1160.49.1.el7_lustre.x86_64
Architecture	AMD64
OpenZFS Version	2.0.7

Describe the problem you're observing

Because there are some extended attributes in the lustre filesystem.
4KiB block will waste too much (400%)
Here is after I replicated from ashfit=12(test_0) to the ashfit=9(test_1)
The test_0 has allocated 3.98T, after replicating the test_1 only 971G

NAME    SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
test_0  6.94T  3.98T  2.96T         -    80%    57%  1.00x  ONLINE  -
test_1  6.94T   971G  5.99T         -     5%    13%  1.00x  ONLINE  -

When I test the ZOL 2.0.7. the ashfit=9 does not work......

dmesg  | grep "ZFS pool version"
[ 4634.117936] ZFS: Loaded module v2.0.7-1, ZFS pool version 5000, ZFS filesystem version 5

zpool create -o ashift=9 tank raidz3 /dev/sd{a..p}

zdb -l /dev/sda1 | grep shi
        metaslab_shift: 34
        ashift: 12

The same openzfs/zfs/issues/13557
Hope the ashfit=9 could work in the future version.
Thank you.

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

behlendorf commented 2 years ago

You can improve the space efficiency of the xattrs stored by Lustre by setting the property dnodesize=1k and xattr=sa. This should provide enough space for the xattrs to be co-located with the dnodes on disk which is also good for performance. Unfortunately, due to a bug this isn't currently the default Lustre behavior as it should be, https://jira.whamcloud.com/browse/LU-16017.

Regarding setting the ashift you'll want to verify that all your disks support a logical sector size of 512. If even one is a native 4k drive then ZFS won't be able to use a smaller ashift. You can check this by reading /sys/block/*/queue/logical_block_size.

homerl commented 2 years ago

Hi Behlendorf Thank you All HDDs are the 512e

# cat /sys/block/sd{a..q}/queue/physical_block_size
4096
4096
4096
4096
4096
4096
4096
4096
4096
4096
4096
4096
4096
4096
4096
4096
4096

# cat /sys/block/sd{a..q}/queue/logical_block_size
512
512
512
512
512
512
512
512
512
512
512
512
512
512
512
512
512

# arc_summary | grep ashif
    vdev_file_logical_ashift                          9
    vdev_file_physical_ashift                         9
    zfs_vdev_max_auto_ashift                          16
    zfs_vdev_min_auto_ashift                          9

# for i in {a..q}
> do
> zdb -l /dev/sd${i}1 | grep ashift
> done 
        ashift: 12
        ashift: 12
        ashift: 12
        ashift: 12
        ashift: 12
        ashift: 12
        ashift: 12
        ashift: 12
        ashift: 12
        ashift: 12
        ashift: 12
        ashift: 12
        ashift: 12
        ashift: 12
        ashift: 12
        ashift: 12

# zpool status 
  pool: tank
 state: ONLINE
config:

    NAME        STATE     READ WRITE CKSUM
    tank        ONLINE       0     0     0
      raidz3-0  ONLINE       0     0     0
        sda     ONLINE       0     0     0
        sdb     ONLINE       0     0     0
        sdc     ONLINE       0     0     0
        sdd     ONLINE       0     0     0
        sde     ONLINE       0     0     0
        sdf     ONLINE       0     0     0
        sdg     ONLINE       0     0     0
        sdh     ONLINE       0     0     0
        sdi     ONLINE       0     0     0
        sdj     ONLINE       0     0     0
        sdk     ONLINE       0     0     0
        sdl     ONLINE       0     0     0
        sdm     ONLINE       0     0     0
        sdn     ONLINE       0     0     0
        sdo     ONLINE       0     0     0
        sdp     ONLINE       0     0     0

errors: No known data errors

KungFuJesus commented 2 years ago

Isn't 512e basically 4k sectors with firmware absorbing the RMW write cycles (poorly)? I think zfs is doing the right thing here, you really want 512n drives if you have a lot of small writes like this. Anything else is just a really bad fiction.

Now, zpool not listening to ashift=9 even when the device clearly supports it, well maybe that's a bug.

behlendorf commented 2 years ago

Indeed they are according to /sys/block output above. The physical_block_size=4k and logical_block_size=512 so I think ZFS is doing the right think by defaulting to 4k (ashift=12) in order avoid a lot of nasty performance killing RMW on the drive. Still, it does look like a bug that you can't explicitly request it when the drive does support it.

homerl commented 2 years ago

Hi Jesus, Behlendorf Thanks in advance
The bug means if I add "-o ashift=9" it does not work.
Eg: when I register the GitHub account, I could agree to or give up the User Agreement.
I mean I understand there is a lot of nasty performance killing on the drive.
In the performance case, I will switch to the 4KiB block.
I have three reasons why I need the ashift=9 in my production env

Too many tiny file case
- 5.5/11.5=47.8%, if you have tons of tiny files, oh No......
The application benchmark
The more(useful capacity) the better case, it 'does not care about the performance
- The 164TiB/182TiB=90%, lost 10% useful capacity

Here is the test result in 0.7.13 because I can't switch to ashift=9 in 2.X.X

here is the 16x 16TB raidz3 test case
# zpool create -f tank -o ashift=9 raidz3 /dev/sd{a..p}
# df -h /tank
Filesystem      Size  Used Avail Use% Mounted on
tank            182T     0  182T   0% /tank

# cd /tank
# openssl rand -out 4K.file 4096
# ls -lhs 
total 5.5K
5.5K -rw-r--r-- 1 root root 4.0K Jul 27 09:26 4K.file
  |
  ------------dsize

# zdb -v -O tank 4K.file

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
         2    1   128K     4K  5.00K     512     4K  100.00  ZFS plain file
                                               168   bonus  System attributes
    dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED 
    dnode maxblkid: 0

------------------------------to test ashift=12----------------------------------

# zpool destroy tank
# zpool create -f tank -o ashift=12 raidz3 /dev/sd{a..p}
# df -h /tank
Filesystem      Size  Used Avail Use% Mounted on
tank            164T  256K  164T   1% /tank

# cd /tank
# openssl rand -out 4K.file 4096
# ls -lhs 
total 12K
12K -rw-r--r-- 1 root root 4.0K Jul 27 09:23 4K.file
  |
  ----------dsize

# zdb -v -O tank test_4K

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
         4    1   128K    512  11.5K     512    512  100.00  ZFS plain file
                                               168   bonus  System attributes
    dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED 
    dnode maxblkid: 0

KungFuJesus commented 2 years ago

Arguably your point 3 would maybe be better addressed with compression on a dataset, letting zfs burn some CPU cycles decompressing the metadata rather than letting drive firmware grab the emulated block. You still effectively get an inflated read but the extra bytes are then in ARC rather than sitting on some more transient buffer on the drive's controller. The same is even more true in the other direction for writes, though you still might encounter a RMW scenario in the event of synchronous transactions.

But, your point still stands, if the device is capable of a 512b write and you tell it to do so against your best interest, it probably should let you, with some loud warnings at least.

homerl commented 2 years ago

Hi Jesus Yes, hope the option could work. with some loud warnings is OK.

homerl commented 2 years ago

The easy way to test ashift=9 in 2.0.7

- ./module/os/linux/zfs/vdev_disk.c:334:    *physical_ashift = highbit64(MAX(physical_block_size,
+ ./module/os/linux/zfs/vdev_disk.c:334:    *physical_ashift = highbit64(MIN(physical_block_size,

In my ashfit=9 test, Severe performance decreased in one HDD vendor
Another one 512B is OK
If you want to upgrade to 20+TB HDD with ashfit=9, zfs 0.7.X is not appropriate, must be 0.8 or higher

And if you add special class for offloading, two vendors could work well in our work loading
If you want to get the high performance, ashfit=12 is the only choice

homerl commented 1 year ago

Here is the directory test under 2.1.11. It appears that ashift 12 also wastes a lot of space.
If I move data from the ashift 9 zpool to the ashift 12 zpool, the ashift 12 zpool may not be able to save all of the data if they are the same capacity.

[   31.457200] ZFS: Loaded module v2.1.11-1, ZFS pool version 5000, ZFS filesystem version 5

There are 2 x raidz3(16+3),the test_ost_1-4K set by ashift=12,the test_ost_0 is ashift=9

zdb -v -O test_ost_1-4K xxx/xxx/xxx/dir_0 | head 

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
      1290    2   128K    16K  12.9M      1K  4.02M  100.00  ZFS directory
                                               176   bonus  System attributes
    dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED 
    dnode maxblkid: 256
    uid     0
    gid     0
    atime   Wed Jun 21 14:38:17 2023
    mtime   Wed Jun 21 12:52:42 2023

zdb -v -O test_ost_0 xxx/xxx/xxx/dir_0 | head 
    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
      2182    2   128K    16K  3.37M      1K  4.02M  100.00  ZFS directory
                                               176   bonus  System attributes
    dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED 
    dnode maxblkid: 256
    uid     0
    gid     0
    atime   Wed Jun 21 14:39:12 2023
    mtime   Wed Jun 21 12:52:45 2023

3453 drwxr-xr-x 2 root root 20003 Jun 21 12:52 0 <---ashift=9
13166 drwxr-xr-x 2 root root 20003 Jun 21 12:52 0 <---ashift=12

zpool list
NAME            SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
test_ost_0      207T   556G   207T        -         -     0%     0%  1.00x    ONLINE  -
test_ost_1-4K   207T   698G   207T        -         -     0%     0%  1.00x    ONLINE  -

df -i /test_ost_1-4K /test_ost_0
Filesystem           Inodes    IUsed        IFree IUse% Mounted on
test_ost_1-4K  354295106749 14401498 354280705251    1% /test_ost_1-4K
test_ost_0     373616575248 14401498 373602173750    1% /test_ost_0

df -B 1 /test_ost_1-4K /test_ost_0
Filesystem           1B-blocks         Used       Available Use% Mounted on
test_ost_1-4K  181990511607808 598790635520 181391720972288   1% /test_ost_1-4K
test_ost_0     191787165548544 502852747264 191284312801280   1% /test_ost_0

1B-blocks
- ashift 9 = 181990511607808 Bytes
- ashift 12 = 191787165548544 Bytes
- 181990511607808/191787165548544 = 94.89%
Used
- test script will write(each append about 16K data) 20001 x 32K file to single directory)
- ashift 9 = 502852747264 Bytes
- ashift 12 = 598790635520 Bytes
- 502852747264/598790635520 = 83.97%
dir dsize
- ashift 9 = 3.37 M
- ashift 12 = 12.9M
- 3.37/12.9 = 26.12%
file dsize
- As previous reply

zpool create cmd

zpool create test_ost_0 -O canmount=on -O xattr=sa -O acltype=posixacl -O recordsize=256k -o ashift=9 -o multihost=on raidz3 /dev/disk/by-id/scsi-xxxxxxx
zpool create test_ost_1-4K -O canmount=on -O xattr=sa -O acltype=posixacl -O recordsize=256k -o ashift=12 -o multihost=on raidz3 /dev/disk/by-id/scsi-xxxxxxx

openzfs / zfs