Draid and available space

openzfs / zfs

OpenZFS on Linux and FreeBSD

https://openzfs.github.io/openzfs-docs

Other

10.43k stars 1.73k forks source link

Draid and available space #13727

Closed gleb-shchavlev closed 1 year ago

gleb-shchavlev commented 2 years ago

System information

Type	Version/Name
Distribution Name	AlmaLinux
Distribution Version	8.6 (Sky Tiger)
Kernel Version	5.15.47-1
Architecture	x86_64
OpenZFS Version	zfs-2.1.5-2

Describe the problem you're observing

We're going to use draid feature on a server with 36 disks.

At first we create two pools with raidz1 and raidz2:

Two raidz1 pools with 17 disks in each:

zpool create tank raidz1 /dev/sd[c-s]
zpool add tank raidz1 /dev/sd[t-z] /dev/sda[a-j]

And we have 523T disk space.

Two raidz2 pools with 17 disks in each:

zpool create tank raidz2 /dev/sd[c-s]
zpool add tank raidz2 /dev/sd[t-z] /dev/sda[a-j]

And we have 457T disk space.

We're create some tests draid pool with various parameters to understand how many space we can have.

First draid pool

2 x raidz1 analogue.

draid1:16d:36c:2s

(16 disks + 1 per parity) * 2 group + 2 spare = 36 disks

$ zpool create draid1:16d:36c:2s-0 /dev/disk/by-id/ata-ST18000NM000J-2TV103_????????

$ df -h | grep tank
tank            523T  1.5M  523T   1% /tank

$ zfs get available -p tank
NAME  PROPERTY   VALUE  SOURCE
tank  available  574804246033080  -

$ zpool status
  pool: tank
 state: ONLINE
config:

        NAME                                   STATE     READ WRITE CKSUM
        tank                                   ONLINE       0     0     0
          draid1:16d:36c:2s-0                  ONLINE       0     0     0
            ...
        spares
          draid1-0-0                           AVAIL
          draid1-0-1                           AVAIL

errors: No known data errors

$ zdb
tank:
    version: 5000
    name: 'tank'
    state: 0
    txg: 4
    pool_guid: 3746007179150748009
    errata: 0
    hostname: 'secret'
    com.delphix:has_per_vdev_zaps
    vdev_children: 1
    vdev_tree:
        type: 'root'
        id: 0
        guid: 3746007179150748009
        create_txg: 4
        children[0]:
            type: 'draid'
            id: 0
            guid: 16792025462555593435
            nparity: 1
            draid_ndata: 16
            draid_nspares: 2
            draid_ngroups: 2
            metaslab_array: 167
            metaslab_shift: 34
            ashift: 12
            asize: 612005358600192
            is_log: 0
            create_txg: 4
            com.delphix:vdev_zap_top: 129
            children[0]:
                type: 'disk'
                id: 0
                guid: 4228436014277023140
                path: '/dev/disk/by-id/ata-ST18000NM000J-2TV103_WR5018SP-part1'
                devid: 'ata-ST18000NM000J-2TV103_WR5018SP-part1'
                phys_path: 'pci-0000:83:00.0-sas-exp0x5003048001b13abf-phy10-lun-0'
                vdev_enc_sysfs_path: '/sys/class/enclosure/3:0:24:0/Slot10'
                whole_disk: 1
                create_txg: 4
                com.delphix:vdev_zap_leaf: 130
            ...
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data
        org.openzfs:draid

Looks as expected:

            nparity: 1
            draid_ndata: 16
            draid_nspares: 2
            draid_ngroups: 2

Same available space as with 2 x raidz1.

Second draid pool

draid2:15d:36c:2s

(15 disks + 2 per parity) * 2 group + 2 spare = 36 disks

$ zpool create draid2:15d:36c:2s/dev/disk/by-id/ata-ST18000NM000J-2TV103_????????

$ df -h | grep tank
tank            349T  1.0M  349T   1% /tank

$ zfs get available -p tank
NAME  PROPERTY   VALUE  SOURCE
tank  available  383555452559392  -

$ zpool status
  pool: tank
 state: ONLINE
config:

        NAME                                   STATE     READ WRITE CKSUM
        tank                                   ONLINE       0     0     0
          draid2:15d:36c:2s-0                  ONLINE       0     0     0
            ...
        spares
          draid1-0-0                           AVAIL
          draid1-0-1                           AVAIL

errors: No known data errors

$ zdb
tank:
    version: 5000
    name: 'tank'
    state: 0
    txg: 4
    pool_guid: 15712017354345188615
    errata: 0
    hostname: 'secret'
    com.delphix:has_per_vdev_zaps
    vdev_children: 1
    vdev_tree:
        type: 'root'
        id: 0
        guid: 15712017354345188615
        create_txg: 4
        children[0]:
            type: 'draid'
            id: 0
            guid: 17598263047459221327
            nparity: 2
            draid_ndata: 15
            draid_nspares: 2
            draid_ngroups: 2
            metaslab_array: 256
            metaslab_shift: 34
            ashift: 12
            asize: 612005358600192
            is_log: 0
            create_txg: 4
            com.delphix:vdev_zap_top: 129
        children[0]:
            type: 'draid'
            id: 0
            guid: 18240333632945198541
            nparity: 1
            draid_ndata: 15
            draid_nspares: 2
            draid_ngroups: 17
            metaslab_array: 256
            metaslab_shift: 34
            ashift: 12
            asize: 612005190828032
            is_log: 0
            create_txg: 4
            com.delphix:vdev_zap_top: 129
            children[0]:
                type: 'disk'
                id: 0
                guid: 5555842598218329482
                path: '/dev/disk/by-id/ata-ST18000NM000J-2TV103_WR5018SP-part1'
                devid: 'ata-ST18000NM000J-2TV103_WR5018SP-part1'
                phys_path: 'pci-0000:83:00.0-sas-exp0x5003048001b13abf-phy10-lun-0'
                vdev_enc_sysfs_path: '/sys/class/enclosure/3:0:24:0/Slot10'
                whole_disk: 1
                create_txg: 4
                com.delphix:vdev_zap_leaf: 130
            ...
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data
        org.openzfs:draid

Why only 349T space?

We expected same space as 2 x raidz2: 457T

Third draid pool

draid2:16d:36c:2s

(16 disks + 2 per parity) * 2 group + 2 spare = 38 disks (!!!)

$ zpool create draid2:16d:36c:2s/dev/disk/by-id/ata-ST18000NM000J-2TV103_????????

$ df -h | grep tank
tank            495T  1.5M  495T   1% /tank

$ zfs get available -p tank
NAME  PROPERTY   VALUE  SOURCE
tank  available  543726316543408  -

$ zpool status
  pool: tank
 state: ONLINE
config:

        NAME                                   STATE     READ WRITE CKSUM
        tank                                   ONLINE       0     0     0
          draid2:16d:36c:2s-0                  ONLINE       0     0     0
            ...
        spares
          draid2-0-0                           AVAIL
          draid2-0-1                           AVAIL

errors: No known data errors

$ zdb
tank:
    version: 5000
    name: 'tank'
    state: 0
    txg: 4
    pool_guid: 18285880759413474066
    errata: 0
    hostname: 'secret'
    com.delphix:has_per_vdev_zaps
    vdev_children: 1
    vdev_tree:
        type: 'root'
        id: 0
        guid: 18285880759413474066
        create_txg: 4
        children[0]:
            type: 'draid'
            id: 0
            guid: 16495656738685295863
            nparity: 2
            draid_ndata: 16
            draid_nspares: 2
            draid_ngroups: 17
            metaslab_array: 256
            metaslab_shift: 34
            ashift: 12
            asize: 612005157273600
            is_log: 0
            create_txg: 4
            com.delphix:vdev_zap_top: 129
            children[0]:
                type: 'disk'
                id: 0
                guid: 5555842598218329482
                path: '/dev/disk/by-id/ata-ST18000NM000J-2TV103_WR5018SP-part1'
                devid: 'ata-ST18000NM000J-2TV103_WR5018SP-part1'
                phys_path: 'pci-0000:83:00.0-sas-exp0x5003048001b13abf-phy10-lun-0'
                vdev_enc_sysfs_path: '/sys/class/enclosure/3:0:24:0/Slot10'
                whole_disk: 1
                create_txg: 4
                com.delphix:vdev_zap_leaf: 130
            ...
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data
        org.openzfs:draid

495T it's ok, but why draid with such strange parameters can be created?

Why we have 495T space? We expected same space as 2 x raidz2: 457T More space is better, but why we have more space?

draid_ngroups: 17

What does it mean?

Describe how to reproduce the problem

Just create pools as described above.

behlendorf commented 2 years ago

That is surprising, the space usage should be close to what's reported for a similar raidz config. Using the 2.1.5 release I wasn't able to reproduce this. Can you double check that all of the drives are the expected size?

$ truncate -s 16T /var/tmp/vdev{1..36}
$ zpool create tank draid2:15d:36c:2s /var/tmp/vdev{1..36}
$ zpool status tank
  pool: tank
 state: ONLINE
config:

    NAME                   STATE     READ WRITE CKSUM
    tank                   ONLINE       0     0     0
      draid2:15d:36c:2s-0  ONLINE       0     0     0
        /var/tmp/vdev1     ONLINE       0     0     0
        ...
        /var/tmp/vdev36    ONLINE       0     0     0
    spares
      draid2-0-0           AVAIL   
      draid2-0-1           AVAIL   

 $ zpool list
NAME   SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
tank   544T   816K   544T        -         -     0%     0%  1.00x    ONLINE  -

$ zfs  list
NAME   USED  AVAIL     REFER  MOUNTPOINT
tank   682K   455T      171K  /tank

(16 disks + 2 per parity) * 2 group + 2 spare = 38 disks (!!!) draid_ngroups: 17

With dRAID your allowed to independently select the parity level, number of data drives, spares, and total children. You don't need to worry about the total number of groups, ZFS will calculate the optimal number to best utilize the capacity. If you're interested in the details there's a nice [comment which described the layout.

gleb-shchavlev commented 2 years ago

I checked

$ zpool destroy tank

$ zpool list
no pools available

$ zpool create tank draid2:15d:36c:2s /dev/disk/by-id/ata-ST18000NM000J-2TV103_????????

$ zpool status tank
  pool: tank
 state: ONLINE
config:

        NAME                                   STATE     READ WRITE CKSUM
        tank                                   ONLINE       0     0     0
          draid2:15d:36c:2s-0                  ONLINE       0     0     0
            ata-ST18000NM000J-2TV103_WR5018SP  ONLINE       0     0     0
            ata-ST18000NM000J-2TV103_WR501PPB  ONLINE       0     0     0
            ata-ST18000NM000J-2TV103_WR501VZV  ONLINE       0     0     0
            ata-ST18000NM000J-2TV103_WR501WJ6  ONLINE       0     0     0
            ata-ST18000NM000J-2TV103_WR5021YA  ONLINE       0     0     0
            ata-ST18000NM000J-2TV103_WR503567  ONLINE       0     0     0
            ata-ST18000NM000J-2TV103_ZR502F07  ONLINE       0     0     0
            ata-ST18000NM000J-2TV103_ZR50FAPC  ONLINE       0     0     0
            ata-ST18000NM000J-2TV103_ZR50G6ML  ONLINE       0     0     0
            ata-ST18000NM000J-2TV103_ZR51TF4F  ONLINE       0     0     0
            ata-ST18000NM000J-2TV103_ZR51W9J5  ONLINE       0     0     0
            ata-ST18000NM000J-2TV103_ZR51WQPZ  ONLINE       0     0     0
            ata-ST18000NM000J-2TV103_ZR51WWZ8  ONLINE       0     0     0
            ata-ST18000NM000J-2TV103_ZR51XKFL  ONLINE       0     0     0
            ata-ST18000NM000J-2TV103_ZR51XLC9  ONLINE       0     0     0
            ata-ST18000NM000J-2TV103_ZR51XY7D  ONLINE       0     0     0
            ata-ST18000NM000J-2TV103_ZR51YFW6  ONLINE       0     0     0
            ata-ST18000NM000J-2TV103_ZR524C4Y  ONLINE       0     0     0
            ata-ST18000NM000J-2TV103_ZR5265KZ  ONLINE       0     0     0
            ata-ST18000NM000J-2TV103_ZR526J39  ONLINE       0     0     0
            ata-ST18000NM000J-2TV103_ZR526L4R  ONLINE       0     0     0
            ata-ST18000NM000J-2TV103_ZR526PPS  ONLINE       0     0     0
            ata-ST18000NM000J-2TV103_ZR527G8X  ONLINE       0     0     0
            ata-ST18000NM000J-2TV103_ZR527GKT  ONLINE       0     0     0
            ata-ST18000NM000J-2TV103_ZR527JFB  ONLINE       0     0     0
            ata-ST18000NM000J-2TV103_ZR527JMM  ONLINE       0     0     0
            ata-ST18000NM000J-2TV103_ZR527KRA  ONLINE       0     0     0
            ata-ST18000NM000J-2TV103_ZR528AQY  ONLINE       0     0     0
            ata-ST18000NM000J-2TV103_ZR529BBQ  ONLINE       0     0     0
            ata-ST18000NM000J-2TV103_ZR52A1CA  ONLINE       0     0     0
            ata-ST18000NM000J-2TV103_ZR52AB7P  ONLINE       0     0     0
            ata-ST18000NM000J-2TV103_ZR52B6C6  ONLINE       0     0     0
            ata-ST18000NM000J-2TV103_ZR52B84L  ONLINE       0     0     0
            ata-ST18000NM000J-2TV103_ZR52B8WM  ONLINE       0     0     0
            ata-ST18000NM000J-2TV103_ZR52BFTC  ONLINE       0     0     0
            ata-ST18000NM000J-2TV103_ZR53G81R  ONLINE       0     0     0
        spares
          draid2-0-0                           AVAIL
          draid2-0-1                           AVAIL

errors: No known data errors

$ zpool list
NAME   SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
tank   557T  6.57M   557T        -         -     0%     0%  1.00x    ONLINE  -

$ zfs  list
NAME   USED  AVAIL     REFER  MOUNTPOINT
tank  4.12M   349T     1023K  /tank

$ df -h | grep tank
tank            349T  1.0M  349T   1% /tank

$ smartctl -a /dev/disk/by-id/ata-ST18000NM000J-2TV103_WR5018SP
smartctl 7.1 2020-08-23 r5080 [x86_64-linux-5.15.47-1.el8.x86_64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     ST18000NM000J-2TV103
Serial Number:    secret
LU WWN Device Id: secret
Firmware Version: SN02
User Capacity:    18,000,207,937,536 bytes [18.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-4 (minor revision not indicated)
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Aug  3 23:06:13 2022 MSK
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
...

And on 16 TB files:

$ zpool destroy tank

$ truncate -s 16T /var/tmp/vdev{1..36}

$ zpool create tank draid2:15d:36c:2s /var/tmp/vdev{1..36}

$ zpool list
NAME   SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
tank   544T   842K   544T        -         -     0%     0%  1.00x    ONLINE  -

$ zfs  list
NAME   USED  AVAIL     REFER  MOUNTPOINT
tank   703K   455T      171K  /tank

And on 18 TB files:

$ zpool destroy tank

$ truncate -s 18T /var/tmp/vdev{1..36}

$ zpool create tank draid2:15d:36c:2s /var/tmp/vdev{1..36}

$ zpool list
NAME   SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
tank   612T   842K   612T        -         -     0%     0%  1.00x    ONLINE  -

$ zfs  list
NAME   USED  AVAIL     REFER  MOUNTPOINT
tank   703K   511T      171K  /tank

gleb-shchavlev commented 2 years ago

And information about disks

$ lsblk -b
NAME    MAJ:MIN RM           SIZE RO TYPE MOUNTPOINT
sda       8:0    0   256060514304  0 disk
├─sda1    8:1    0     1073741824  0 part
├─sda2    8:2    0     4294967296  0 part
└─sda3    8:3    0   250690756608  0 part
sdb       8:16   0   256060514304  0 disk
├─sdb1    8:17   0     1073741824  0 part /boot
├─sdb2    8:18   0     4294967296  0 part [SWAP]
├─sdb3    8:19   0    75161927680  0 part /
├─sdb4    8:20   0           1024  0 part
└─sdb5    8:21   0   175527428096  0 part /home
sdc       8:32   0 18000207937536  0 disk
├─sdc1    8:33   0 18000197451776  0 part
└─sdc9    8:41   0        8388608  0 part
sdd       8:48   0 18000207937536  0 disk
├─sdd1    8:49   0 18000197451776  0 part
└─sdd9    8:57   0        8388608  0 part
sde       8:64   0 18000207937536  0 disk
├─sde1    8:65   0 18000197451776  0 part
└─sde9    8:73   0        8388608  0 part
sdf       8:80   0 18000207937536  0 disk
├─sdf1    8:81   0 18000197451776  0 part
└─sdf9    8:89   0        8388608  0 part
sdg       8:96   0 18000207937536  0 disk
├─sdg1    8:97   0 18000197451776  0 part
└─sdg9    8:105  0        8388608  0 part
sdh       8:112  0 18000207937536  0 disk
├─sdh1    8:113  0 18000197451776  0 part
└─sdh9    8:121  0        8388608  0 part
sdi       8:128  0 18000207937536  0 disk
├─sdi1    8:129  0 18000197451776  0 part
└─sdi9    8:137  0        8388608  0 part
sdj       8:144  0 18000207937536  0 disk
├─sdj1    8:145  0 18000197451776  0 part
└─sdj9    8:153  0        8388608  0 part
sdk       8:160  0 18000207937536  0 disk
├─sdk1    8:161  0 18000197451776  0 part
└─sdk9    8:169  0        8388608  0 part
sdl       8:176  0 18000207937536  0 disk
├─sdl1    8:177  0 18000197451776  0 part
└─sdl9    8:185  0        8388608  0 part
sdm       8:192  0 18000207937536  0 disk
├─sdm1    8:193  0 18000197451776  0 part
└─sdm9    8:201  0        8388608  0 part
sdn       8:208  0 18000207937536  0 disk
├─sdn1    8:209  0 18000197451776  0 part
└─sdn9    8:217  0        8388608  0 part
sdo       8:224  0 18000207937536  0 disk
├─sdo1    8:225  0 18000197451776  0 part
└─sdo9    8:233  0        8388608  0 part
sdp       8:240  0 18000207937536  0 disk
├─sdp1    8:241  0 18000197451776  0 part
└─sdp9    8:249  0        8388608  0 part
sdq      65:0    0 18000207937536  0 disk
├─sdq1   65:1    0 18000197451776  0 part
└─sdq9   65:9    0        8388608  0 part
sdr      65:16   0 18000207937536  0 disk
├─sdr1   65:17   0 18000197451776  0 part
└─sdr9   65:25   0        8388608  0 part
sds      65:32   0 18000207937536  0 disk
├─sds1   65:33   0 18000197451776  0 part
└─sds9   65:41   0        8388608  0 part
sdt      65:48   0 18000207937536  0 disk
├─sdt1   65:49   0 18000197451776  0 part
└─sdt9   65:57   0        8388608  0 part
sdu      65:64   0 18000207937536  0 disk
├─sdu1   65:65   0 18000197451776  0 part
└─sdu9   65:73   0        8388608  0 part
sdv      65:80   0 18000207937536  0 disk
├─sdv1   65:81   0 18000197451776  0 part
└─sdv9   65:89   0        8388608  0 part
sdw      65:96   0 18000207937536  0 disk
├─sdw1   65:97   0 18000197451776  0 part
└─sdw9   65:105  0        8388608  0 part
sdx      65:112  0 18000207937536  0 disk
├─sdx1   65:113  0 18000197451776  0 part
└─sdx9   65:121  0        8388608  0 part
sdy      65:128  0 18000207937536  0 disk
├─sdy1   65:129  0 18000197451776  0 part
└─sdy9   65:137  0        8388608  0 part
sdz      65:144  0 18000207937536  0 disk
├─sdz1   65:145  0 18000197451776  0 part
└─sdz9   65:153  0        8388608  0 part
sdaa     65:160  0 18000207937536  0 disk
├─sdaa1  65:161  0 18000197451776  0 part
└─sdaa9  65:169  0        8388608  0 part
sdab     65:176  0 18000207937536  0 disk
├─sdab1  65:177  0 18000197451776  0 part
└─sdab9  65:185  0        8388608  0 part
sdac     65:192  0 18000207937536  0 disk
├─sdac1  65:193  0 18000197451776  0 part
└─sdac9  65:201  0        8388608  0 part
sdad     65:208  0 18000207937536  0 disk
├─sdad1  65:209  0 18000197451776  0 part
└─sdad9  65:217  0        8388608  0 part
sdae     65:224  0 18000207937536  0 disk
├─sdae1  65:225  0 18000197451776  0 part
└─sdae9  65:233  0        8388608  0 part
sdaf     65:240  0 18000207937536  0 disk
├─sdaf1  65:241  0 18000197451776  0 part
└─sdaf9  65:249  0        8388608  0 part
sdag     66:0    0 18000207937536  0 disk
├─sdag1  66:1    0 18000197451776  0 part
└─sdag9  66:9    0        8388608  0 part
sdah     66:16   0 18000207937536  0 disk
├─sdah1  66:17   0 18000197451776  0 part
└─sdah9  66:25   0        8388608  0 part
sdai     66:32   0 18000207937536  0 disk
├─sdai1  66:33   0 18000197451776  0 part
└─sdai9  66:41   0        8388608  0 part
sdaj     66:48   0 18000207937536  0 disk
├─sdaj1  66:49   0 18000197451776  0 part
└─sdaj9  66:57   0        8388608  0 part
sdak     66:64   0 18000207937536  0 disk
├─sdak1  66:65   0 18000197451776  0 part
└─sdak9  66:73   0        8388608  0 part
sdal     66:80   0 18000207937536  0 disk
├─sdal1  66:81   0 18000197451776  0 part
└─sdal9  66:89   0        8388608  0 part

gleb-shchavlev commented 2 years ago

@behlendorf I can hold this hardware for a day if you have some thoughts to test, otherwise I'll have to use raidz for now to start production usage, unfortunately

behlendorf commented 2 years ago

@gleb-shchavlev I looked in to this a bit and the decrease in reported usable capacity is caused by:

1) relatively wide RAID stripe width 16d+2p, and 2) the 4k sector size (ashift=12), and 3) 3) variable stripe sizes

With dRAID variable stripe widths are not supported which is differs from RAIDZ. This means every RAID stripe will be padded out to the full stripe width if needed. For a 16d+2p configuration with 4k sectors that makes the minimum allocation size 16*4k=64K. If the pool is primarily storing large files (>1M) this overhead is minimal, however if you'll be storing small files (<64k) it will be significant. This is the fundamental tradeoff which needed to be made in order to support sequential resilvering for dRAID, and why this feature can't be supported with RAIDZ.

Which vdev configuration is right for you will depend on your expected workload. If you'd like to use dRAID for the faster rebuild times, then using either a narrower stripe width (say 8d+2p) or a smaller 512 sector size (ashift=9) will let you reduce the minimum allocation size, and with it reported available capacity.

zpool create tank draid2:8d:36c:2s

zpool create -o ashift=9 tank draid2:16d:36c:2s

gleb-shchavlev commented 2 years ago

Many thanks for the help!

We will storing large files (>1M). This is backup S3 storage with minio on top of zfs.

I tried to create all possible pools and evaluate free space.

Command

zpool create -o ashift=9 [type_from_table] /dev/disk/by-id/ata-ST18000NM000J-2TV103_????????

Results

type	zfs list size
draid1:1d:36c:2s	279T
draid1:2d:36c:2s	371T
draid1:3d:36c:2s	405T
draid1:4d:36c:2s	445T
draid1:5d:36c:2s	424T
draid1:6d:36c:2s	424T
draid1:7d:36c:2s	445T
draid1:8d:36c:2s	495T
draid1:9d:36c:2s	445T
draid1:10d:36c:2s	405T
draid1:11d:36c:2s	495T
draid1:12d:36c:2s	457T
draid1:13d:36c:2s	424T
draid1:14d:36c:2s	396T
draid1:15d:36c:2s	371T
draid1:16d:36c:2s	523T
draid1:17d:36c:2s	495T
draid1:18d:36c:2s	469T
draid1:19d:36c:2s	445T
draid1:20d:36c:2s	424T
draid1:21d:36c:2s	405T
draid1:22d:36c:2s	387T
draid1:23d:36c:2s	371T
draid1:24d:36c:2s	356T
draid1:25d:36c:2s	343T
draid1:26d:36c:2s	330T
draid1:27d:36c:2s	318T
draid1:28d:36c:2s	307T
draid1:29d:36c:2s	297T
draid1:30d:36c:2s	287T
draid1:31d:36c:2s	279T
draid1:32d:36c:2s	540T
draid1:33d:36c:2s	523T

draid1 with 34d cannot be created:

requested number of dRAID data disks per group 34 is too high,
at most 33 disks are available for data

type	zfs list size
draid2:1d:36c:2s	185T
draid2:2d:36c:2s	279T
draid2:3d:36c:2s	323T
draid2:4d:36c:2s	371T
draid2:5d:36c:2s	363T
draid2:6d:36c:2s	371T
draid2:7d:36c:2s	396T
draid2:8d:36c:2s	445T
draid2:9d:36c:2s	405T
draid2:10d:36c:2s	371T
draid2:11d:36c:2s	457T
draid2:12d:36c:2s	424T
draid2:13d:36c:2s	396T
draid2:14d:36c:2s	371T
draid2:15d:36c:2s	349T
draid2:16d:36c:2s	495T
draid2:17d:36c:2s	469T
draid2:18d:36c:2s	445T
draid2:19d:36c:2s	424T
draid2:20d:36c:2s	405T
draid2:21d:36c:2s	387T
draid2:22d:36c:2s	371T
draid2:23d:36c:2s	356T
draid2:24d:36c:2s	343T
draid2:25d:36c:2s	330T
draid2:26d:36c:2s	318T
draid2:27d:36c:2s	307T
draid2:28d:36c:2s	297T
draid2:29d:36c:2s	287T
draid2:30d:36c:2s	279T
draid2:31d:36c:2s	270T
draid2:32d:36c:2s	523T

draid2 with 33d cannot be created:

requested number of dRAID data disks per group 33 is too high,
at most 32 disks are available for data

type	zfs list size
draid2:1d:36c:1s	191T
draid2:2d:36c:1s	287T
draid2:3d:36c:1s	333T
draid2:4d:36c:1s	382T
draid2:5d:36c:1s	374T
draid2:6d:36c:1s	382T
draid2:7d:36c:1s	408T
draid2:8d:36c:1s	458T
draid2:9d:36c:1s	417T
draid2:10d:36c:1s	382T
draid2:11d:36c:1s	470T
draid2:12d:36c:1s	437T
draid2:13d:36c:1s	408T
draid2:14d:36c:1s	382T
draid2:15d:36c:1s	360T
draid2:16d:36c:1s	510T
draid2:17d:36c:1s	483T
draid2:18d:36c:1s	458T
draid2:19d:36c:1s	437T
draid2:20d:36c:1s	417T
draid2:21d:36c:1s	399T
draid2:22d:36c:1s	382T
draid2:23d:36c:1s	366T
draid2:24d:36c:1s	353T
draid2:25d:36c:1s	339T
draid2:26d:36c:1s	327T
draid2:27d:36c:1s	316T
draid2:28d:36c:1s	306T
draid2:29d:36c:1s	296T
draid2:30d:36c:1s	287T
draid2:31d:36c:1s	278T
draid2:32d:36c:1s	539T
draid2:33d:36c:1s	524T

Summary

Maximum free space:

type	zfs list size
draid1:16d:36c:2s	523T
draid1:32d:36c:2s	540T
draid1:33d:36c:2s	523T
draid2:16d:36c:2s	495T
draid2:32d:36c:2s	523T
draid2:16d:36c:1s	510T
draid2:32d:36c:1s	539T
draid2:33d:36c:1s	524T

Is it correct to create a draid with so many (32d) disks?

Which option is more safe: draid2 with one spare or draid1 with two spares?

draid2:32d:36c:2s and draid2:33d:36c:1s gives practicaly equal space, why?

How many disks can fail to keep the pool running? Does it depend on the "d" option?

I want to thank you again for your help to understand how to calculate free space for draid.

behlendorf commented 2 years ago

Is it correct to create a draid with so many (32d) disks?

Generally I'd recommend against going larger than about 16 data disks. As you can see from your free space table going wider doesn't directly equate to more capacity, but it will absolutely reduce performance and slow down distributed rebuilds. I find a draid2:8d config strikes a pretty reasonable balance between usable capacity, performance, and rebuild speed.

How many disks can fail to keep the pool running? Does it depend on the "d" option?

With dRAID you can lose up to the number of parity devices all at the same time. It does not depend on the "d" option.

Which option is more safe: draid2 with one spare or draid1 with two spares?

Definitely draid2 with a single spare. With this configuration you can lose any two devices, then after the pool has rebuilt to the distributed spare the pool will be resilient to another failure. Meaning you could lose up to 3 devices depending on exactly when they fail.

draid2:32d:36c:2s and draid2:33d:36c:1s gives practically equal space, why?

When comparing the reported available capacity one thing worth keeping in mind is that it's an estimate based on some reasonable assumptions (expected average recordsize, reserved capacity, dRAID layout, etc). Depending on exactly what you store in the pool your mileage will vary.

Specifically for 32d vs 33d it's because ZFS assumes an average recordsize of 128K which would require writing one 4K sector to each of the 32 data drives. Which works out nicely on paper. Increasing to 33d data drives doesn't effect things to much since the RAID stripe would only be padded out a single extra 4k sector. Conversely dropping to 31d vdevs means we'd need to write 8K to each disk since the data no longer fits in a single RAID row, which is why you see the drop in capacity (down to 278T). But again this is an estimation which assumes things like a 128K recordsize, 100% incompressible data, etc and in reality things down fallout quite so simply.

heinowalther commented 1 year ago

I'm sorry to open this can of worms again, but I am in much the same pickle as described here.. I have 24 x 18TB SAS drives which I am trying to configure, and I would like to use draid for it. I also ran into the same issue where you are allowed to do some "funny" options on the draid definitions which in my mind doesn't make sense... For example: draid2:9d:24c:2s, which I would guess creates two redundancy groups with each 9+2 and then adds a spare to each? This gives me 262T via zfs list. The disks are 16,4TB in lsblk... so that is now cut down to 14,5TB (262/18) that is 33TB gone? I have tried with ashift, 9 and 12 which doesn't make any difference. I have test tested with something similar, two draid2 vdevs with one spare each, draid2:9d:12c:1s, which also give me 262TB. With a 2 vdev raidz2 and two hot spares give me 274TB so somewhat better, yet still 21,2TB gone...

Only when I go all the way down to something like raidz1 with 5 disks I get 65,3TB which is 16,3TB per disk and only a loss of 300GB.

I would like to know if this major size difference is due to configuration, or if it is related to zfs 2.1.4 (you mentioned at first that you was unable to replicate this on 2.1.5)

If you have any recommendations to setup this better, they are welcome. Our goal was to use draid with distributed spares because of the faster resilvering on these large disks... but if it comes at the price of 11% capacity los it may not seem so great after all... bare in mind that this los is after we have set aside 6 disks capacity already...

behlendorf commented 1 year ago

It's a good question, and I can understand the concern. What probably needs to be better explained is that zfs list has never really reported the maximum available capacity. Instead it reports an estimated capacity based on some assumptions about how the pool will be used. Specifically, that 1) all data will be stored uncompressed and 2) the average block size will be 128K (based on the default 128K record size property). For raidz and mirrors vdevs these assumptions result in an estimate close to what you'd expect. As you mentioned above it was only off by 300GB for you raidz1 configuration.

For a dRAID configuration this estimate may be lower than you'd expect because, unlike raidz, dRAID must always write a full stripe using every data drive. This constraint is what's makes a fully sequential rebuild to a distributed spare possible, but it does also mean some capacity is lost to padding.

Let's look at your draid2:9d:24c:2s config and where that 262T estimate comes from for a pool with 16.4TB drives and 4k sectors. We need to make some assumptions for the estimate, and zfs list as mentioned above presumes an average block size of 128K. That may or may not be the case but it's a reasonable middle ground for an estimate.

Now to store these 128K uncompressed blocks they will be effectively broken in to 32 - 4K sector size pieces and then spread over all the disks. Where A-K are drives, 1-32 are the data sectors, P1/P2 are parity sectors, and those 4 XXs sectors are padding which was added. That means this block was 32 / 36 = ~89% space efficient. Extrapolating that out to the whole pool, ignoring spare and parity drives, works out to 16.4TB (24 - 6 drives) 0.89 = 262.4TB. Which is what's reported by zfs list.

A  B  C  D  E  F  G  H  I  J  K  
---------------------------------
 1  5  9 13 17 21 24 27 30 P1 P2
 2  6 10 14 18 22 25 28 31 P1 P2
 3  7 11 15 19 23 26 29 32 P1 P2
 4  8 12 16 20 xx xx xx xx P1 P2

But if we run the same calculation but instead assume an average recordsize=1M things look a little different. In this case, we'll need to add five 4K sectors of padding, for a space efficiency of 256/261 ~= 98%. In which case, zfs list would report an estimate of 16.4TB (24 - 6 drives) 0.98 = 289.5TB. This definitely looks better, but it's really just a more optimistic estimate. No actual space was gained or lost in the dRAID config.

It's also worth mentioning that in part it's because of this required padding that the man page recommends adding a special mirror vdev to your dRAID pool. Not only is this good for performance since all the pool metadata will be stored on faster storage, but it's also more space efficient. I hope this helps answer some of your questions.

heinowalther commented 1 year ago

Thank you very much for the quick response, and very good explanation, which makes sense. For this particular filesystem, most of the files will be pretty large (1-10 GB) and of cause we will have a special metadata device which will not only hold the metadata, but also smaller files on a 3way mirrored nvme vdev. We are so lucky that we have an existing dataset that we have run some of L1's Wendel scripts on, which gave us an idea at to how large this metadata device should be and where to set the size limit to what is stored there. The crazy thing is that we kinda have to take your word for it :-) I was thinking of creating the final pool, and then load it with some large files to see how the free avail value behave, but I guess that the more free space you have, the more "wrong" this free space calculation is? ;-) ? One thing I find odd, is that I read somewhere that the recordsize has been 1M since 2015, yet the zfs list assumes 128K? ;-)

heinowalther commented 1 year ago

One follow up question, and I am sorry if this is not directly related. Once we have created oew pool with a larger recordsize of 1M, we have to replicate data from the old zpool where I can see that the largest record size if 128K (based on this command zdb -Lbbbs ) Will the zfs send/receive keep the 128K record size on the new pool where we have set 1M ? Because then we might want to use something like rsync instead...

behlendorf commented 1 year ago

Support for 1M record sizes was added way back in 2015, but the default was left at 128K.

As for send/receive it won't increase the original block size, so in this case you'll want to use something like rsync.

It sounds like you already have some scripts to determine how large to size the special devices. Another nice way to do this is with zdb -bbb <pool> which will generate a histogram of used capacity by block size.

heinowalther commented 1 year ago

Yes I can confirm that zfs send was not able to utilize the larger record size on the destination pool... Which is sad, because zfs send can move about 1GB/sec. while rsync is roughly half :-( I guess I will also have to configure the backup destinations to accept larger records.... will it actually fail or what exactly will happen? ;-)

behlendorf commented 1 year ago

Closing. The available space is being reported correctly. That said, I completely agree it's not intuitive why the value may be lower than expected and we should probably consider assuming a 1M block size instead for dRAID pools.