openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.33k stars 1.72k forks source link

zfs list used/free/refer up to 10% *smaller* than sendfile size for large recordsize pools/datasets (75% for draid!) #14420

Open malventano opened 1 year ago

malventano commented 1 year ago

edit See next comment for identified cause and proposed solution

System information

Type Version/Name
Kernel Version 5.15.83-1-pve
OpenZFS Version 2.1.7

Describe the problem you're observing

For some raidz/z2/z3 configurations and widths using recordsize > 128k:

image

Given the above data, it appears that perhaps some dsize math was attempting to compensate padding / overhead for smaller records (only my guess), but that math does not appear to be accurate for those cases, and it fails in the opposite direction for datasets with larger recordsizes.

Choosing one of the more egregious entries above (8-wide raidz3 with 16MB recordsize, storing a single 16M file), here is an output of various commands:

Test file creation: dd if=/dev/urandom of=testfile bs=1M count=16 (16.0M)

ls -l testfile: -rw-r--r-- 1 root root 16777216 Jan 23 13:06 test/testfile (16.0M)

du -b testfile: 16777216 test/testfile (16.0M)

du -B1 testfile: 15315456 test/testfile (14.6M)

zfs list test: (refer=14.8M)

NAME   USED  AVAIL     REFER  MOUNTPOINT
test  15.7M   440G     14.8M  /test/test

zpool list test:

NAME   SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
test   796G  27.5M   796G        -         -     0%     0%  1.00x    ONLINE  -

zfs snapshot test@1;zfs send -RLPcn test@1|grep size size 16821032 (16.1M)

zdb -dbdbdbdbdbdb test/ 2 (dsize=14.6M)

Dataset test [ZPL], ID 54, cr_txg 1, 14.8M, 7 objects, rootbp DVA[0]=<0:100050000:4000> DVA[1]=<0:200050000:4000> [L0 DMU objset] fletcher4 uncompressed unencrypted LE contiguous unique double size=1000L/1000P birth=8L/8P fill=7 cksum=10fcf1fa0c:2ec854b9ded1:44b71aa9feff71:4769acdee207900f

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
         2    1   128K    16M  14.6M     512    16M  100.00  ZFS plain file (K=inherit) (Z=inherit=uncompressed)
                                               176   bonus  System attributes
    dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED 
    dnode maxblkid: 0
    path    /testfile
    uid     0
    gid     0
    atime   Mon Jan 23 13:06:43 2023
    mtime   Mon Jan 23 13:06:43 2023
    ctime   Mon Jan 23 13:06:43 2023
    crtime  Mon Jan 23 13:06:43 2023
    gen 7
    mode    100644
    size    16777216
    parent  34
    links   1
    pflags  840800000004
Indirect blocks:
               0 L0 DVA[0]=<0:500054000:199c000> [L0 ZFS plain file] fletcher4 uncompressed unencrypted LE contiguous unique single size=1000000L/1000000P birth=8L/8P fill=1 cksum=2000977398cc31:ec0a27a616a076e9:518ba24e091e641e:4716324db9dd86fd

        segment [0000000000000000, 0000000001000000) size   16M

testfile as viewed by Windows via SMBD: (14.6M)

image

malventano commented 1 year ago

Cause of issue

Thanks to assistance from @allanjude and @rincebrain, this occurs due to the following:

/*
 * Compute the raidz-deflation ratio.  Note, we hard-code
 * in 128k (1 << 17) because it is the "typical" blocksize.
 * Even though SPA_MAXBLOCKSIZE changed, this algorithm can not change,
 * otherwise it would inconsistently account for existing bp's.
 */
static void
vdev_set_deflate_ratio(vdev_t *vd)
{
        if (vd == vd->vdev_top && !vd->vdev_ishole && vd->vdev_ashift != 0) {
                vd->vdev_deflate_ratio = (1 << 17) /
                    (vdev_psize_to_asize(vd, 1 << 17) >> SPA_MINBLOCKSHIFT);
        }
}

Brief summary of the above: Raidz vdevs are assumed to have a 128K blocksize, and this assumption is used to generate a deflate_ratio that is then applied to various stats related to those vdevs.

In a mailing list thread related to this similar issue, @ahrens had a few things to say with respect to confusion stemming from 'odd' observations resulting from this issue:

I agree this is confusing and it's an area we should try to improve! So I think we might need to do some more brainstorming to come up with something that is a net improvement on the current situation.

...and so here is my attempted crack at it:

Proposed solution:

edit - see revised solution below

(old proposal here) Create and expose a zpool property called `raidzshift` (expected average block size for raidz vdevs), which defaults to 17 (128K) unless specified during pool creation. This value is then used instead of the hardcoded 128K value in the existing [vdev_set_deflate_ratio](https://github.com/openzfs/zfs/blob/master/module/zfs/vdev.c#L1873) call, enabling more accurate zfs list used/free/refer, du, etc. reporting for when the raidz vdev average block size is expected to fall closer to the specified `raidzshift` value. This should correct accuracy for both ends of the spectrum: - Pools expected to store mostly larger records/blocks/files set `raidzshift` > 17 (max=maxrecordsize). - Provides more accurate space reporting. - Prevents used/free space reports from ever going 'inverted' if set to match the highest record/block size to be stored (which carries the current behavior with hardcoded 128K and default `recordsize=128K` to higher recordsize pools). - Especially useful for large pools using more recent updates / enhancements such as recordsize=16M and special vdev + `special_small_blocks` as those pools would see an even higher average block size on their raidz vdevs. - Pools expected to contain raidz vdevs with recordsizes/zvol block sizes <128K can set `raidzshift` <17 (minimum=ashift). - The error rises far more sharply below 128K than it does above, especially in cases where the pool geometry offered 'perfect' storage efficiencies for 128K blocks but was poorer at smaller sizes. While the benefit here would be more significant, it is tempered by those workloads being better suited for a mirror and therefore less likely to be used with a raidz.

Making this value adjustable on-the-fly will likely be more complex than having it specified at pool creation, as suggested by the warning in the note on the code.

allanjude commented 1 year ago

I think there may be some issues with letting users modify this value on the fly, more investigation will be required

malventano commented 1 year ago

A few additional points and a revised proposal:

For further pathfinding I repeated a portion of the experiment on a draidz3 of matching parameters (no spares for simplicity): image

The higher write amplification for smaller blocks on draid is as expected, however it appears that driad too has a space calculation that assumes 128K stripe width. Those -50% and -75% extreme cases correlate to a stripe width of 256KB (2x 128K) and 512K (4x 128K). Raising ashift lowers the width necessary to have the effect (e.g. 16d + ashift=14), and as this effect also impacts the free space reported, a 16d ashift=14 draid will have its reported free space cut in half regardless of the intended use case or recordsize employed. That example draid filled with 256K files would report half the expected used/refer values as well (e.g. a zfs send stream would be 2x the size indicated by the refer of its associated snapshot).

Revised proposed solution:

Given the following:

My revised recommendation is to do away with the deflate_ratio or equivalent entirely. Chasing an arbitrary fixed (or even user-defined) value is a fool's errand and only leads to additional confusion. A newly created pool should always display the best-case free space regardless of how its geometry meshes with any arbitrarily chosen block size. The free space calculation should assume that all blocks will fit perfectly and with zero padding. With no more ratio being applied, the space consumed by written blocks will reflect whatever write amplification has taken place due to those particular block sizes (this is already done today). The difference will be that there is no more 'fudge factor' applied to raidz/draid vdevs which throws the free space (and by extension asize/used/refer of >128K records) negative. This will have the following impact on the above list:

This change will place the raidz and draid behaviors in line with other zfs vdev types, as well as with other file systems.

behlendorf commented 1 year ago

This area has certainly been a source of confusion. I agree from the end user perspective it's not at all obvious what factors come in to play when estimating usable capacity, particularly for complicated layouts. This has gotten even more involved with dRAID, see https://github.com/openzfs/zfs/issues/13727#issuecomment-1278352959 for an explanation of why. And raidz expansion will complicate it further as you pointed out.

I think getting rid of the deflate_ratio entirely would be an interesting thing to prototype. It's probably the best way to understand what the full consequences of that change would be, and if in the end it's actually preferable to the current behavior.

malventano commented 1 year ago

My initial 'light touch' and 'even lighter touch' takes based on the present observed behavior and some code review, and assuming deflate_ratio (and draid equivalent) were to be abolished:

  1. Address existing and new pools:
    1. Existing pools mounted on newer zfs versions would have their deflate_ratio inversely applied for reporting only. This would mostly (minus rounding) correct the existing underestimations of reported dsize/used/free on-the-fly without the need to refactor any existing data on those legacy pools. Internal accounting would not change and they would still internally apply the existing deflate_ratio for consistency.
    2. New pools set deflate_ratio=1. The code changes from above can then be blanketly applied to both old and new pools. These pools would benefit from reduced inaccuracies caused by applying and later inverting the !=1 deflate_ratio.
    3. (optionally) At some point in the future the changes from 1 can be reverted, leaving only very old pools to revert to the former behavior.
  2. Address only new pools:
    1. New pools set deflate_ratio=1. No other changes.
    2. (The raidz side would be an easy fix, but I've not found where draid is performing its equivalent.)

2 probably makes more sense as most pools today are not so significantly impacted, but newly created pools with larger blocks and ashifts, which may be more likely to see the effects, would naturally have this issue rectified by the change. This will slowly mitigate future confusion caused by 'negative' sizes as they would no longer be possible, leaving only the existing behavior of smaller blocks and padded stripes having their expected amplified effect on consumption, and making estimations of usage far more predictable and consistent given that dsize/used/free/refer would no longer have a progressively irrelevant ratio applied to them from pool to pool.

ahrens commented 1 year ago

Sounds like people are confused by (uncompressed) files "using" less space than expected, on raidz, when the recordsize property has been increased from the default of 128k. But we are OK with files "using" more space than expected?

If that's the case, I think a simple solution would be to base the deflate ratio on recordsize=1M (the highest that it can be set without a tunable). Then, by default (recordsize=128k) files will use more space than expected. That might confuse a lot of people, since it's the default. We might consider changing the default recordsize to 1M at the same time that we change the deflate ratio to be based on 1MB blocks.

Removing the deflate ratio entirely (setting it to 1) seems like it would be even more confusing, since everything would "use" substantially more space than expected. (not to mention the differing behavior on different pools / software would be very noticeable and confusing)

malventano commented 1 year ago

I think a simple solution would be to base the deflate ratio on recordsize=1M (the highest that it can be set without a tunable).

Sounds like people are confused by (uncompressed) files "using" less space than expected, on raidz, when the recordsize property has been increased from the default of 128k. But we are OK with files "using" more space than expected?

Then, by default (recordsize=128k) files will use more space than expected.

Removing the deflate ratio entirely (setting it to 1) seems like it would be even more confusing, since everything would "use" substantially more space than expected.

(not to mention the differing behavior on different pools / software would be very noticeable and confusing)

  • The 'major' result of this change only applies to those currently impacted by the 'wrong' (negative number values) of the above charts.

Some change will need to be made eventually as the current method will only add more confusion and inconsistencies over time. The question is will it be fixed in a way that is future-proof and does not need constant adjustment (leading to even further confusion).

Additional point on further thought: Setting to 1M or 16M already gets you so close to deflate_ratio=1 that you may as well just set it to 1 so that any additional confusion is only added once and the raidz expansion / draid capacity edge cases are fully resolved with a single change.

malventano commented 1 year ago

We might consider changing the default recordsize to 1M at the same time that we change the deflate ratio to be based on 1MB blocks.

1MB may be too aggressive as a default recordsize given how poorly some random workloads can behave. Moving to 1MB there would cut IOPS for that work down to 1/8th of the 128K result, which was already noticeably slow. 128K is probably the safe middle ground as a default, even if it is not ideal for either end of the spectrum.

ahrens commented 1 year ago

I think a simple solution would be to base the deflate ratio on recordsize=1M (the highest that it can be set without a tunable).

https://github.com/openzfs/zfs/pull/13302 has already been merged, so that solution would be insufficient upon release.

Ah, I forgot about that. Then we could base the deflate ratio on recordsize=16M.

malventano commented 1 year ago

Ah, I forgot about that. Then we could base the deflate ratio on recordsize=16M.

That would indeed make things much better, but please consider the following:

...we're 99.85% of the way there!

allanjude commented 1 year ago

currently recordsize can't go any larger without changing the structure of a block pointer, so it is less of a concern. I think changing the deflate_ratio like this may cause a lot of confusion to anyone using zvols, where the default volblocksize is 16 KB (recently raised from 8 KB). Depending on the RAID-Z layout, this might see significant inflation.

I think more thought is required here, I've not had a chance to try to digest all of the implications. This might be a good topic for tomorrow's ZFS leadership call.

malventano commented 1 year ago

zvols, where the default volblocksize is 16 KB (recently raised from 8 KB). Depending on the RAID-Z layout, this might see significant inflation.

Interesting, but would there not have already been inflation with 8K/16K block sizes with the 128K base? The max percent change moving from 128K to 16M should be 10% (worst case raidz3), and that is only for those geometries that look really bad for 128K specifically. The effect of this change should do nothing beyond what was already seen with a current raidz of 1/2/4/8 data disks (perfect ratio), and 8/16K blocks would have similar parity overhead either way.

For layouts suboptimal for 128K blocks, 8/16K blocks would be less impacted across those different geometries than the larger 128K blocks would, as they are already consuming some multiple of the reported free space. The 'aliasing' of the 128K base derived ratio means that those zvols would see a greater variation in free/consumed space across various geometries than they would with the ratio's base being a larger value.

image

... and now with the above figures translated to reflect what is happening today with respect to free space reporting 'error':

image

The blue regions above are where file sizes can appear to be smaller than they are, by the percentages indicated. Note that there is minimal impact on the values at 8K and 16K vs. what was happening already in the prior chart. Also note that the first chart here is the result of an exercise in what is expected without assuming the 128K base for the ratio, meaning the current behavior introduces some error both above and below 128K that is not normally anticipated, even by those who have previously attempted this math. Making the proposed change will make the behavior match those prior exercises, while setting the base to 16M will produce a result which still does not precisely fit those exercises, but it is close:

image

ahrens commented 1 year ago

@malventano

Basing deflate_ratio on 16M brings the value so close to 1 on today's configs ... why not simply use this opportunity to set it to 1, mitigating the need to 'keep up' with future max_recordsize changes.

I think there may be a confusion about what it means for the deflate ratio to be "1". To me this would mean that no deflation occurs. i.e. all space reporting is "bytes for data + parity". So if you have RAIDZ1 on 4x 1TB disks, you would have "4TB avail", but you can write at most 3TB of (uncompressed) user data. Writing a 1TB file (uncompressed) would use at least 1.33TB. This seems counterintuitive to me, which is why we implemented the deflate ratio to begin with, ~20 years ago. But this is not consistent with your assertion that "Basing deflate_ratio on 16M brings the value so close to 1 "

Are you suggesting that we instead base the deflate ratio on an idealized parity ratio, e.g. 75% data given RAIDZ1 on 4 disks? Is your point that we should ignore the specifics of recordsize, ashift, etc. and assume that we'll get the best possible theoretical ratio? E.g. showing "3TB avail" in the above example. The effect of this would be negligibly different from basing the deflate ratio on 16MB blocks, so if this is the user-visible behavior that we want, I think we can leave this as an implementation detail to be worked out by whoever implements it. (i.e. do we have the current deflate code but with 16MB block assumption, do we have deflate code but calculate the ratio slightly differently based on the ideal ratio, do we remove the deflate code and have some different way of achieving the same result)

malventano commented 1 year ago

Parity does not come into play for the issue I am seeking to be corrected here. It is down to how the (current) 128K base value meshes with the number of data disks in the stripe vs. how larger blocks 'fit' into those stripes.

...now if the deflate_ratio is also accounting for parity, then I missed that piece, and please substitute my prior arguments with "whatever math results in a deflate_ratio that accounts for parity but ignores any arbitrary max_recordsize base value" :)

To be extra clear, yes, parity should absolutely be accounted for in the free space calculation!

ahrens commented 1 year ago

Accounting for parity is the main point of the deflate_ratio.

malventano commented 1 year ago

Understood. Thank you for clearing that point for me.

Does this mean that a raidz which began at 75%, but was then expanded, would be stuck with the same ratio?

malventano commented 1 year ago

@allanjude thinking further on your concern about zvols, while increasing the deflate_ratio base would only minimally impact zvols beyond what the 128K value already does, perhaps merging my two proposals would assist you in increasing the free space accuracy of a raidz containing primarily zvols:

image

16K blocks on a zpool configured with defalte_ratio based on 2^14 would yield 'perfect' accuracy provided the pool only contained 16K blocks - just as today's pools do with 128K blocks.

nathandragun commented 1 year ago

ZFS version: zfs-2.1.2

Hey all, so I'm commenting here since #13727 is closed.

I am unsure if this relates to the reported draid capacity calculation issues or end-user error. Still, I'm concerned that we only see about 79% available (21% loss) for a brand-new draid array that has been built.

In short, we have an 18TB x 84-disk draid2 array: draid2:14d:84c:4s-0 using a 1M recordsize, ashift=14, and storing files greater than 10 GB each in size.

I am expecting to see approximately 1.089 PiB available after parity, but we are seeing the following reported by zfs list:

XYZ 12.0T 860T 1023K /XYZ XYZ/ABC 12.0T 860T 12.0T /XYZ/ABC

The more significant issue here is that while these numbers seem to be slightly moving as the pool fills, I do not see these numbers adjusting for the underlying lustre filesystem that is mounting ABC as a standard zfs-based mount:

ABC_UUID 865.3T 12.1T 853.2T 2% /mount/point

So my takeaway from this is that these availability calculations, which I hope are just estimation errors, are used by applications as (absolute?) available space. It may take a complete system reboot to update these calculations for the base storage, which is unrealistic when scaling with larger capacity arrays.

malventano commented 1 year ago

I am expecting to see approximately 1.089 PiB available after parity, but we are seeing the following reported by zfs list:

XYZ 12.0T 860T 1023K /XYZ XYZ/ABC 12.0T 860T 12.0T /XYZ/ABC

The current calcs for draid are assuming 128K records, and with your ashift=14 and 14d (and minus 2 for parity), that comes to 192K stripes across each redundancy group. The current behavior (that I'm seeking to correct with this issue) should put you as reporting only 66% of the 'expected' initial free space. If you were only storing 128K records then that would be a correct assumption, but it appears that your intent is to store larger records (is this what you meant when you said "1M stripe")?

The more significant issue here is that while these numbers seem to be slightly moving as the pool fills

It's expected for them to move as data is stored more or less efficiently than the assumed 128K recordsize. I'm seeking to get this corrected such that the movement is only in the downward direction for those records stored less efficiently (e.g. smaller than the maximum possible recordsize).

Another thing you should be seeing for your configuration there is that files with larger recordsizes would be reported as consuming less space than they actually did (e.g. an uncompressed file would show as roughly 1/3rd smaller than its actual/apparent size). That is another factor that I am seeking to correct.

nathandragun commented 1 year ago

I am expecting to see approximately 1.089 PiB available after parity, but we are seeing the following reported by zfs list: XYZ 12.0T 860T 1023K /XYZ XYZ/ABC 12.0T 860T 12.0T /XYZ/ABC

The current calcs for draid are assuming 128K records, and with your ashift=14 and 14d (and minus 2 for parity), that comes to 192K stripes across each redundancy group. The current behavior (that I'm seeking to correct with this issue) should put you as reporting only 66% of the 'expected' initial free space. If you were only storing 128K records then that would be a correct assumption, but it appears that your intent is to store larger records (is this what you meant when you said "1M stripe")? 1) My understanding is that "d" represents data disks, not full disk+parity stripes, so in reality, draid2 /w 14d = 16-disk stripe. Dividing this out for 80 disks, we get five children groups of 16 disks each. 2) Yes, that was a typo; the 1M stripe was intended to indicate the recordsize for the vdev. (corrected for clarity in the original message)

The more significant issue here is that while these numbers seem to be slightly moving as the pool fills

It's expected for them to move as data is stored more or less efficiently than the assumed 128K recordsize. I'm seeking to get this corrected such that the movement is only in the downward direction for those records stored less efficiently (e.g. smaller than the maximum possible recordsize).

Another thing you should be seeing for your configuration there is that files with larger recordsizes would be reported as consuming less space than they actually did (e.g. an uncompressed file would show as roughly 1/3rd smaller than its actual/apparent size). That is another factor that I am seeking to correct.

Understood. I can't speak to verifying less vs actual space used, but the problem is that applications that are using this number for "available" space are negatively impacted the greater this availability shift moves. It doesn't matter much for a non-high-availability array of a few Terabytes, but this becomes a big issue when moving into the Petabyte and greater range.

malventano commented 1 year ago
  1. My understanding is that "d" represents data disks, not full disk+parity stripes, so in reality, draid2 /w 14d = 16-disk stripe. Dividing this out for 80 disks, we get five children groups of 16 disks each.

Aah ok, in that case, it would likely be reporting closer to half of the expected capacity, and your larger recordsize files should appear to be ~57% of their expected size.

but the problem is that applications that are using this number for "available" space are negatively impacted the greater this availability shift moves

Those values shift over time regardless of configuration as files of varying recordsizes / compression are stored. What application is having a hard time with this?

nathandragun commented 1 year ago

but the problem is that applications that are using this number for "available" space are negatively impacted the greater this availability shift moves

Those values shift over time regardless of configuration as files of varying recordsizes / compression are stored. What application is having a hard time with this?

As previously stated above:

The more significant issue here is that while these numbers seem to be slightly moving as the pool fills, I do not see these numbers adjusting for the underlying lustre filesystem that is mounting ABC as a standard zfs-based mount:

ABC_UUID 865.3T 12.1T 853.2T 2% /mount/point

nathandragun commented 1 year ago

An update regarding used space being reported as less than actual space.

After moving 40+TB to the draid array, we see discrepancies between the actual space in use and what is being reported. It is hard to get exact numbers, but the reported usage is about 10TB less than expected. So, it is confirmed on this end.

One additional point of clarity, the data being moved already resides in other zfs pools using a 1M recordsize, so compression has already been factored into the equation for expected vs. actual.