Closed gleb-shchavlev closed 1 year ago
That is surprising, the space usage should be close to what's reported for a similar raidz config. Using the 2.1.5 release I wasn't able to reproduce this. Can you double check that all of the drives are the expected size?
$ truncate -s 16T /var/tmp/vdev{1..36}
$ zpool create tank draid2:15d:36c:2s /var/tmp/vdev{1..36}
$ zpool status tank
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
draid2:15d:36c:2s-0 ONLINE 0 0 0
/var/tmp/vdev1 ONLINE 0 0 0
...
/var/tmp/vdev36 ONLINE 0 0 0
spares
draid2-0-0 AVAIL
draid2-0-1 AVAIL
$ zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
tank 544T 816K 544T - - 0% 0% 1.00x ONLINE -
$ zfs list
NAME USED AVAIL REFER MOUNTPOINT
tank 682K 455T 171K /tank
(16 disks + 2 per parity) * 2 group + 2 spare = 38 disks (!!!) draid_ngroups: 17
With dRAID your allowed to independently select the parity level, number of data drives, spares, and total children. You don't need to worry about the total number of groups, ZFS will calculate the optimal number to best utilize the capacity. If you're interested in the details there's a nice [comment which described the layout.
I checked
$ zpool destroy tank
$ zpool list
no pools available
$ zpool create tank draid2:15d:36c:2s /dev/disk/by-id/ata-ST18000NM000J-2TV103_????????
$ zpool status tank
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
draid2:15d:36c:2s-0 ONLINE 0 0 0
ata-ST18000NM000J-2TV103_WR5018SP ONLINE 0 0 0
ata-ST18000NM000J-2TV103_WR501PPB ONLINE 0 0 0
ata-ST18000NM000J-2TV103_WR501VZV ONLINE 0 0 0
ata-ST18000NM000J-2TV103_WR501WJ6 ONLINE 0 0 0
ata-ST18000NM000J-2TV103_WR5021YA ONLINE 0 0 0
ata-ST18000NM000J-2TV103_WR503567 ONLINE 0 0 0
ata-ST18000NM000J-2TV103_ZR502F07 ONLINE 0 0 0
ata-ST18000NM000J-2TV103_ZR50FAPC ONLINE 0 0 0
ata-ST18000NM000J-2TV103_ZR50G6ML ONLINE 0 0 0
ata-ST18000NM000J-2TV103_ZR51TF4F ONLINE 0 0 0
ata-ST18000NM000J-2TV103_ZR51W9J5 ONLINE 0 0 0
ata-ST18000NM000J-2TV103_ZR51WQPZ ONLINE 0 0 0
ata-ST18000NM000J-2TV103_ZR51WWZ8 ONLINE 0 0 0
ata-ST18000NM000J-2TV103_ZR51XKFL ONLINE 0 0 0
ata-ST18000NM000J-2TV103_ZR51XLC9 ONLINE 0 0 0
ata-ST18000NM000J-2TV103_ZR51XY7D ONLINE 0 0 0
ata-ST18000NM000J-2TV103_ZR51YFW6 ONLINE 0 0 0
ata-ST18000NM000J-2TV103_ZR524C4Y ONLINE 0 0 0
ata-ST18000NM000J-2TV103_ZR5265KZ ONLINE 0 0 0
ata-ST18000NM000J-2TV103_ZR526J39 ONLINE 0 0 0
ata-ST18000NM000J-2TV103_ZR526L4R ONLINE 0 0 0
ata-ST18000NM000J-2TV103_ZR526PPS ONLINE 0 0 0
ata-ST18000NM000J-2TV103_ZR527G8X ONLINE 0 0 0
ata-ST18000NM000J-2TV103_ZR527GKT ONLINE 0 0 0
ata-ST18000NM000J-2TV103_ZR527JFB ONLINE 0 0 0
ata-ST18000NM000J-2TV103_ZR527JMM ONLINE 0 0 0
ata-ST18000NM000J-2TV103_ZR527KRA ONLINE 0 0 0
ata-ST18000NM000J-2TV103_ZR528AQY ONLINE 0 0 0
ata-ST18000NM000J-2TV103_ZR529BBQ ONLINE 0 0 0
ata-ST18000NM000J-2TV103_ZR52A1CA ONLINE 0 0 0
ata-ST18000NM000J-2TV103_ZR52AB7P ONLINE 0 0 0
ata-ST18000NM000J-2TV103_ZR52B6C6 ONLINE 0 0 0
ata-ST18000NM000J-2TV103_ZR52B84L ONLINE 0 0 0
ata-ST18000NM000J-2TV103_ZR52B8WM ONLINE 0 0 0
ata-ST18000NM000J-2TV103_ZR52BFTC ONLINE 0 0 0
ata-ST18000NM000J-2TV103_ZR53G81R ONLINE 0 0 0
spares
draid2-0-0 AVAIL
draid2-0-1 AVAIL
errors: No known data errors
$ zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
tank 557T 6.57M 557T - - 0% 0% 1.00x ONLINE -
$ zfs list
NAME USED AVAIL REFER MOUNTPOINT
tank 4.12M 349T 1023K /tank
$ df -h | grep tank
tank 349T 1.0M 349T 1% /tank
$ smartctl -a /dev/disk/by-id/ata-ST18000NM000J-2TV103_WR5018SP
smartctl 7.1 2020-08-23 r5080 [x86_64-linux-5.15.47-1.el8.x86_64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: ST18000NM000J-2TV103
Serial Number: secret
LU WWN Device Id: secret
Firmware Version: SN02
User Capacity: 18,000,207,937,536 bytes [18.0 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-4 (minor revision not indicated)
SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Wed Aug 3 23:06:13 2022 MSK
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
...
And on 16 TB files:
$ zpool destroy tank
$ truncate -s 16T /var/tmp/vdev{1..36}
$ zpool create tank draid2:15d:36c:2s /var/tmp/vdev{1..36}
$ zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
tank 544T 842K 544T - - 0% 0% 1.00x ONLINE -
$ zfs list
NAME USED AVAIL REFER MOUNTPOINT
tank 703K 455T 171K /tank
And on 18 TB files:
$ zpool destroy tank
$ truncate -s 18T /var/tmp/vdev{1..36}
$ zpool create tank draid2:15d:36c:2s /var/tmp/vdev{1..36}
$ zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
tank 612T 842K 612T - - 0% 0% 1.00x ONLINE -
$ zfs list
NAME USED AVAIL REFER MOUNTPOINT
tank 703K 511T 171K /tank
And information about disks
$ lsblk -b
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 256060514304 0 disk
├─sda1 8:1 0 1073741824 0 part
├─sda2 8:2 0 4294967296 0 part
└─sda3 8:3 0 250690756608 0 part
sdb 8:16 0 256060514304 0 disk
├─sdb1 8:17 0 1073741824 0 part /boot
├─sdb2 8:18 0 4294967296 0 part [SWAP]
├─sdb3 8:19 0 75161927680 0 part /
├─sdb4 8:20 0 1024 0 part
└─sdb5 8:21 0 175527428096 0 part /home
sdc 8:32 0 18000207937536 0 disk
├─sdc1 8:33 0 18000197451776 0 part
└─sdc9 8:41 0 8388608 0 part
sdd 8:48 0 18000207937536 0 disk
├─sdd1 8:49 0 18000197451776 0 part
└─sdd9 8:57 0 8388608 0 part
sde 8:64 0 18000207937536 0 disk
├─sde1 8:65 0 18000197451776 0 part
└─sde9 8:73 0 8388608 0 part
sdf 8:80 0 18000207937536 0 disk
├─sdf1 8:81 0 18000197451776 0 part
└─sdf9 8:89 0 8388608 0 part
sdg 8:96 0 18000207937536 0 disk
├─sdg1 8:97 0 18000197451776 0 part
└─sdg9 8:105 0 8388608 0 part
sdh 8:112 0 18000207937536 0 disk
├─sdh1 8:113 0 18000197451776 0 part
└─sdh9 8:121 0 8388608 0 part
sdi 8:128 0 18000207937536 0 disk
├─sdi1 8:129 0 18000197451776 0 part
└─sdi9 8:137 0 8388608 0 part
sdj 8:144 0 18000207937536 0 disk
├─sdj1 8:145 0 18000197451776 0 part
└─sdj9 8:153 0 8388608 0 part
sdk 8:160 0 18000207937536 0 disk
├─sdk1 8:161 0 18000197451776 0 part
└─sdk9 8:169 0 8388608 0 part
sdl 8:176 0 18000207937536 0 disk
├─sdl1 8:177 0 18000197451776 0 part
└─sdl9 8:185 0 8388608 0 part
sdm 8:192 0 18000207937536 0 disk
├─sdm1 8:193 0 18000197451776 0 part
└─sdm9 8:201 0 8388608 0 part
sdn 8:208 0 18000207937536 0 disk
├─sdn1 8:209 0 18000197451776 0 part
└─sdn9 8:217 0 8388608 0 part
sdo 8:224 0 18000207937536 0 disk
├─sdo1 8:225 0 18000197451776 0 part
└─sdo9 8:233 0 8388608 0 part
sdp 8:240 0 18000207937536 0 disk
├─sdp1 8:241 0 18000197451776 0 part
└─sdp9 8:249 0 8388608 0 part
sdq 65:0 0 18000207937536 0 disk
├─sdq1 65:1 0 18000197451776 0 part
└─sdq9 65:9 0 8388608 0 part
sdr 65:16 0 18000207937536 0 disk
├─sdr1 65:17 0 18000197451776 0 part
└─sdr9 65:25 0 8388608 0 part
sds 65:32 0 18000207937536 0 disk
├─sds1 65:33 0 18000197451776 0 part
└─sds9 65:41 0 8388608 0 part
sdt 65:48 0 18000207937536 0 disk
├─sdt1 65:49 0 18000197451776 0 part
└─sdt9 65:57 0 8388608 0 part
sdu 65:64 0 18000207937536 0 disk
├─sdu1 65:65 0 18000197451776 0 part
└─sdu9 65:73 0 8388608 0 part
sdv 65:80 0 18000207937536 0 disk
├─sdv1 65:81 0 18000197451776 0 part
└─sdv9 65:89 0 8388608 0 part
sdw 65:96 0 18000207937536 0 disk
├─sdw1 65:97 0 18000197451776 0 part
└─sdw9 65:105 0 8388608 0 part
sdx 65:112 0 18000207937536 0 disk
├─sdx1 65:113 0 18000197451776 0 part
└─sdx9 65:121 0 8388608 0 part
sdy 65:128 0 18000207937536 0 disk
├─sdy1 65:129 0 18000197451776 0 part
└─sdy9 65:137 0 8388608 0 part
sdz 65:144 0 18000207937536 0 disk
├─sdz1 65:145 0 18000197451776 0 part
└─sdz9 65:153 0 8388608 0 part
sdaa 65:160 0 18000207937536 0 disk
├─sdaa1 65:161 0 18000197451776 0 part
└─sdaa9 65:169 0 8388608 0 part
sdab 65:176 0 18000207937536 0 disk
├─sdab1 65:177 0 18000197451776 0 part
└─sdab9 65:185 0 8388608 0 part
sdac 65:192 0 18000207937536 0 disk
├─sdac1 65:193 0 18000197451776 0 part
└─sdac9 65:201 0 8388608 0 part
sdad 65:208 0 18000207937536 0 disk
├─sdad1 65:209 0 18000197451776 0 part
└─sdad9 65:217 0 8388608 0 part
sdae 65:224 0 18000207937536 0 disk
├─sdae1 65:225 0 18000197451776 0 part
└─sdae9 65:233 0 8388608 0 part
sdaf 65:240 0 18000207937536 0 disk
├─sdaf1 65:241 0 18000197451776 0 part
└─sdaf9 65:249 0 8388608 0 part
sdag 66:0 0 18000207937536 0 disk
├─sdag1 66:1 0 18000197451776 0 part
└─sdag9 66:9 0 8388608 0 part
sdah 66:16 0 18000207937536 0 disk
├─sdah1 66:17 0 18000197451776 0 part
└─sdah9 66:25 0 8388608 0 part
sdai 66:32 0 18000207937536 0 disk
├─sdai1 66:33 0 18000197451776 0 part
└─sdai9 66:41 0 8388608 0 part
sdaj 66:48 0 18000207937536 0 disk
├─sdaj1 66:49 0 18000197451776 0 part
└─sdaj9 66:57 0 8388608 0 part
sdak 66:64 0 18000207937536 0 disk
├─sdak1 66:65 0 18000197451776 0 part
└─sdak9 66:73 0 8388608 0 part
sdal 66:80 0 18000207937536 0 disk
├─sdal1 66:81 0 18000197451776 0 part
└─sdal9 66:89 0 8388608 0 part
@behlendorf I can hold this hardware for a day if you have some thoughts to test, otherwise I'll have to use raidz for now to start production usage, unfortunately
@gleb-shchavlev I looked in to this a bit and the decrease in reported usable capacity is caused by:
1) relatively wide RAID stripe width 16d+2p, and 2) the 4k sector size (ashift=12), and 3) 3) variable stripe sizes
With dRAID variable stripe widths are not supported which is differs from RAIDZ. This means every RAID stripe will be padded out to the full stripe width if needed. For a 16d+2p configuration with 4k sectors that makes the minimum allocation size 16*4k=64K. If the pool is primarily storing large files (>1M) this overhead is minimal, however if you'll be storing small files (<64k) it will be significant. This is the fundamental tradeoff which needed to be made in order to support sequential resilvering for dRAID, and why this feature can't be supported with RAIDZ.
Which vdev configuration is right for you will depend on your expected workload. If you'd like to use dRAID for the faster rebuild times, then using either a narrower stripe width (say 8d+2p) or a smaller 512 sector size (ashift=9) will let you reduce the minimum allocation size, and with it reported available capacity.
zpool create tank draid2:8d:36c:2s
or
zpool create -o ashift=9 tank draid2:16d:36c:2s
Many thanks for the help!
We will storing large files (>1M). This is backup S3 storage with minio on top of zfs.
I tried to create all possible pools and evaluate free space.
Command
zpool create -o ashift=9 [type_from_table] /dev/disk/by-id/ata-ST18000NM000J-2TV103_????????
Results
type | zfs list size |
---|---|
draid1:1d:36c:2s | 279T |
draid1:2d:36c:2s | 371T |
draid1:3d:36c:2s | 405T |
draid1:4d:36c:2s | 445T |
draid1:5d:36c:2s | 424T |
draid1:6d:36c:2s | 424T |
draid1:7d:36c:2s | 445T |
draid1:8d:36c:2s | 495T |
draid1:9d:36c:2s | 445T |
draid1:10d:36c:2s | 405T |
draid1:11d:36c:2s | 495T |
draid1:12d:36c:2s | 457T |
draid1:13d:36c:2s | 424T |
draid1:14d:36c:2s | 396T |
draid1:15d:36c:2s | 371T |
draid1:16d:36c:2s | 523T |
draid1:17d:36c:2s | 495T |
draid1:18d:36c:2s | 469T |
draid1:19d:36c:2s | 445T |
draid1:20d:36c:2s | 424T |
draid1:21d:36c:2s | 405T |
draid1:22d:36c:2s | 387T |
draid1:23d:36c:2s | 371T |
draid1:24d:36c:2s | 356T |
draid1:25d:36c:2s | 343T |
draid1:26d:36c:2s | 330T |
draid1:27d:36c:2s | 318T |
draid1:28d:36c:2s | 307T |
draid1:29d:36c:2s | 297T |
draid1:30d:36c:2s | 287T |
draid1:31d:36c:2s | 279T |
draid1:32d:36c:2s | 540T |
draid1:33d:36c:2s | 523T |
draid1 with 34d cannot be created:
requested number of dRAID data disks per group 34 is too high,
at most 33 disks are available for data
type | zfs list size |
---|---|
draid2:1d:36c:2s | 185T |
draid2:2d:36c:2s | 279T |
draid2:3d:36c:2s | 323T |
draid2:4d:36c:2s | 371T |
draid2:5d:36c:2s | 363T |
draid2:6d:36c:2s | 371T |
draid2:7d:36c:2s | 396T |
draid2:8d:36c:2s | 445T |
draid2:9d:36c:2s | 405T |
draid2:10d:36c:2s | 371T |
draid2:11d:36c:2s | 457T |
draid2:12d:36c:2s | 424T |
draid2:13d:36c:2s | 396T |
draid2:14d:36c:2s | 371T |
draid2:15d:36c:2s | 349T |
draid2:16d:36c:2s | 495T |
draid2:17d:36c:2s | 469T |
draid2:18d:36c:2s | 445T |
draid2:19d:36c:2s | 424T |
draid2:20d:36c:2s | 405T |
draid2:21d:36c:2s | 387T |
draid2:22d:36c:2s | 371T |
draid2:23d:36c:2s | 356T |
draid2:24d:36c:2s | 343T |
draid2:25d:36c:2s | 330T |
draid2:26d:36c:2s | 318T |
draid2:27d:36c:2s | 307T |
draid2:28d:36c:2s | 297T |
draid2:29d:36c:2s | 287T |
draid2:30d:36c:2s | 279T |
draid2:31d:36c:2s | 270T |
draid2:32d:36c:2s | 523T |
draid2 with 33d cannot be created:
requested number of dRAID data disks per group 33 is too high,
at most 32 disks are available for data
type | zfs list size |
---|---|
draid2:1d:36c:1s | 191T |
draid2:2d:36c:1s | 287T |
draid2:3d:36c:1s | 333T |
draid2:4d:36c:1s | 382T |
draid2:5d:36c:1s | 374T |
draid2:6d:36c:1s | 382T |
draid2:7d:36c:1s | 408T |
draid2:8d:36c:1s | 458T |
draid2:9d:36c:1s | 417T |
draid2:10d:36c:1s | 382T |
draid2:11d:36c:1s | 470T |
draid2:12d:36c:1s | 437T |
draid2:13d:36c:1s | 408T |
draid2:14d:36c:1s | 382T |
draid2:15d:36c:1s | 360T |
draid2:16d:36c:1s | 510T |
draid2:17d:36c:1s | 483T |
draid2:18d:36c:1s | 458T |
draid2:19d:36c:1s | 437T |
draid2:20d:36c:1s | 417T |
draid2:21d:36c:1s | 399T |
draid2:22d:36c:1s | 382T |
draid2:23d:36c:1s | 366T |
draid2:24d:36c:1s | 353T |
draid2:25d:36c:1s | 339T |
draid2:26d:36c:1s | 327T |
draid2:27d:36c:1s | 316T |
draid2:28d:36c:1s | 306T |
draid2:29d:36c:1s | 296T |
draid2:30d:36c:1s | 287T |
draid2:31d:36c:1s | 278T |
draid2:32d:36c:1s | 539T |
draid2:33d:36c:1s | 524T |
Summary
Maximum free space:
type | zfs list size |
---|---|
draid1:16d:36c:2s | 523T |
draid1:32d:36c:2s | 540T |
draid1:33d:36c:2s | 523T |
draid2:16d:36c:2s | 495T |
draid2:32d:36c:2s | 523T |
draid2:16d:36c:1s | 510T |
draid2:32d:36c:1s | 539T |
draid2:33d:36c:1s | 524T |
Is it correct to create a draid with so many (32d) disks?
Which option is more safe: draid2 with one spare or draid1 with two spares?
draid2:32d:36c:2s and draid2:33d:36c:1s gives practicaly equal space, why?
How many disks can fail to keep the pool running? Does it depend on the "d" option?
I want to thank you again for your help to understand how to calculate free space for draid.
Is it correct to create a draid with so many (32d) disks?
Generally I'd recommend against going larger than about 16 data disks. As you can see from your free space table going wider doesn't directly equate to more capacity, but it will absolutely reduce performance and slow down distributed rebuilds. I find a draid2:8d config strikes a pretty reasonable balance between usable capacity, performance, and rebuild speed.
How many disks can fail to keep the pool running? Does it depend on the "d" option?
With dRAID you can lose up to the number of parity devices all at the same time. It does not depend on the "d" option.
Which option is more safe: draid2 with one spare or draid1 with two spares?
Definitely draid2 with a single spare. With this configuration you can lose any two devices, then after the pool has rebuilt to the distributed spare the pool will be resilient to another failure. Meaning you could lose up to 3 devices depending on exactly when they fail.
draid2:32d:36c:2s and draid2:33d:36c:1s gives practically equal space, why?
When comparing the reported available capacity one thing worth keeping in mind is that it's an estimate based on some reasonable assumptions (expected average recordsize
, reserved capacity, dRAID layout, etc). Depending on exactly what you store in the pool your mileage will vary.
Specifically for 32d vs 33d it's because ZFS assumes an average recordsize
of 128K which would require writing one 4K sector to each of the 32 data drives. Which works out nicely on paper. Increasing to 33d data drives doesn't effect things to much since the RAID stripe would only be padded out a single extra 4k sector. Conversely dropping to 31d vdevs means we'd need to write 8K to each disk since the data no longer fits in a single RAID row, which is why you see the drop in capacity (down to 278T). But again this is an estimation which assumes things like a 128K recordsize, 100% incompressible data, etc and in reality things down fallout quite so simply.
I'm sorry to open this can of worms again, but I am in much the same pickle as described here.. I have 24 x 18TB SAS drives which I am trying to configure, and I would like to use draid for it. I also ran into the same issue where you are allowed to do some "funny" options on the draid definitions which in my mind doesn't make sense... For example: draid2:9d:24c:2s, which I would guess creates two redundancy groups with each 9+2 and then adds a spare to each? This gives me 262T via zfs list. The disks are 16,4TB in lsblk... so that is now cut down to 14,5TB (262/18) that is 33TB gone? I have tried with ashift, 9 and 12 which doesn't make any difference. I have test tested with something similar, two draid2 vdevs with one spare each, draid2:9d:12c:1s, which also give me 262TB. With a 2 vdev raidz2 and two hot spares give me 274TB so somewhat better, yet still 21,2TB gone...
Only when I go all the way down to something like raidz1 with 5 disks I get 65,3TB which is 16,3TB per disk and only a loss of 300GB.
I would like to know if this major size difference is due to configuration, or if it is related to zfs 2.1.4 (you mentioned at first that you was unable to replicate this on 2.1.5)
If you have any recommendations to setup this better, they are welcome. Our goal was to use draid with distributed spares because of the faster resilvering on these large disks... but if it comes at the price of 11% capacity los it may not seem so great after all... bare in mind that this los is after we have set aside 6 disks capacity already...
It's a good question, and I can understand the concern. What probably needs to be better explained is that zfs list
has never really reported the maximum available capacity. Instead it reports an estimated capacity based on some assumptions about how the pool will be used. Specifically, that 1) all data will be stored uncompressed and 2) the average block size will be 128K (based on the default 128K record size property). For raidz and mirrors vdevs these assumptions result in an estimate close to what you'd expect. As you mentioned above it was only off by 300GB for you raidz1 configuration.
For a dRAID configuration this estimate may be lower than you'd expect because, unlike raidz, dRAID must always write a full stripe using every data drive. This constraint is what's makes a fully sequential rebuild to a distributed spare possible, but it does also mean some capacity is lost to padding.
Let's look at your draid2:9d:24c:2s config and where that 262T estimate comes from for a pool with 16.4TB drives and 4k sectors. We need to make some assumptions for the estimate, and zfs list
as mentioned above presumes an average block size of 128K. That may or may not be the case but it's a reasonable middle ground for an estimate.
Now to store these 128K uncompressed blocks they will be effectively broken in to 32 - 4K sector size pieces and then spread over all the disks. Where A-K are drives, 1-32 are the data sectors, P1/P2 are parity sectors, and those 4 XXs sectors are padding which was added. That means this block was 32 / 36 = ~89% space efficient. Extrapolating that out to the whole pool, ignoring spare and parity drives, works out to 16.4TB (24 - 6 drives) 0.89 = 262.4TB. Which is what's reported by zfs list
.
A B C D E F G H I J K
---------------------------------
1 5 9 13 17 21 24 27 30 P1 P2
2 6 10 14 18 22 25 28 31 P1 P2
3 7 11 15 19 23 26 29 32 P1 P2
4 8 12 16 20 xx xx xx xx P1 P2
But if we run the same calculation but instead assume an average recordsize=1M
things look a little different. In this case, we'll need to add five 4K sectors of padding, for a space efficiency of 256/261 ~= 98%. In which case, zfs list
would report an estimate of 16.4TB (24 - 6 drives) 0.98 = 289.5TB. This definitely looks better, but it's really just a more optimistic estimate. No actual space was gained or lost in the dRAID config.
It's also worth mentioning that in part it's because of this required padding that the man page recommends adding a special
mirror vdev to your dRAID pool. Not only is this good for performance since all the pool metadata will be stored on faster storage, but it's also more space efficient. I hope this helps answer some of your questions.
Thank you very much for the quick response, and very good explanation, which makes sense. For this particular filesystem, most of the files will be pretty large (1-10 GB) and of cause we will have a special metadata device which will not only hold the metadata, but also smaller files on a 3way mirrored nvme vdev. We are so lucky that we have an existing dataset that we have run some of L1's Wendel scripts on, which gave us an idea at to how large this metadata device should be and where to set the size limit to what is stored there. The crazy thing is that we kinda have to take your word for it :-) I was thinking of creating the final pool, and then load it with some large files to see how the free avail value behave, but I guess that the more free space you have, the more "wrong" this free space calculation is? ;-) ? One thing I find odd, is that I read somewhere that the recordsize has been 1M since 2015, yet the zfs list assumes 128K? ;-)
One follow up question, and I am sorry if this is not directly related.
Once we have created oew pool with a larger recordsize of 1M, we have to replicate data from the old zpool where I can see that the largest record size if 128K (based on this command zdb -Lbbbs
Support for 1M record sizes was added way back in 2015, but the default was left at 128K.
As for send/receive it won't increase the original block size, so in this case you'll want to use something like rsync.
It sounds like you already have some scripts to determine how large to size the special devices. Another nice way to do this is with zdb -bbb <pool>
which will generate a histogram of used capacity by block size.
Yes I can confirm that zfs send was not able to utilize the larger record size on the destination pool... Which is sad, because zfs send can move about 1GB/sec. while rsync is roughly half :-( I guess I will also have to configure the backup destinations to accept larger records.... will it actually fail or what exactly will happen? ;-)
Closing. The available space is being reported correctly. That said, I completely agree it's not intuitive why the value may be lower than expected and we should probably consider assuming a 1M block size instead for dRAID pools.
System information
Describe the problem you're observing
We're going to use draid feature on a server with 36 disks.
At first we create two pools with raidz1 and raidz2:
Two raidz1 pools with 17 disks in each:
And we have 523T disk space.
Two raidz2 pools with 17 disks in each:
And we have 457T disk space.
We're create some tests draid pool with various parameters to understand how many space we can have.
First draid pool
2 x raidz1 analogue.
draid1:16d:36c:2s
(16 disks + 1 per parity) * 2 group + 2 spare = 36 disks
Looks as expected:
Same available space as with 2 x raidz1.
Second draid pool
draid2:15d:36c:2s
(15 disks + 2 per parity) * 2 group + 2 spare = 36 disks
Why only 349T space?
We expected same space as 2 x raidz2: 457T
Third draid pool
draid2:16d:36c:2s
(16 disks + 2 per parity) * 2 group + 2 spare = 38 disks (!!!)
495T it's ok, but why draid with such strange parameters can be created?
Why we have 495T space? We expected same space as 2 x raidz2: 457T More space is better, but why we have more space?
What does it mean?
Describe how to reproduce the problem
Just create pools as described above.