dsl_dataset: put IO-inducing frees on the pool deadlist

robn commented 2 weeks ago

[Sponsors: Klara, Inc., Wasabi Technology, Inc.]

Motivation and Context

Closes: #16697 Closes: #16708 Closes: #6783

Description

dsl_free() calls zio_free() to free the block. For most blocks, this simply calls metaslab_free() without doing any IO or putting anything on the IO pipeline.

Some blocks however require additional IO to free. This at least includes gang, dedup and cloned blocks. For those, zio_free() will issue a ZIO_TYPE_FREE IO and return.

If a huge number of blocks are being freed all at once, it's possible for dsl_dataset_block_kill() to be called millions of time on a single transaction (eg a 2T object of 128K blocks is 16M blocks). If those are all IO-inducing frees, that then becomes 16M FREE IOs placed on the pipeline. At time of writing, a zio_t is 1280 bytes, so for just one 2T object that requires a 20G allocation of resident memory from the zio_cache. If that can't be satisfied by the kernel, an out-of-memory condition is raised.

This would be better handled by improving the cases that the dmu_tx_assign() throttle will handle, or by reducing the overheads required by the IO pipeline, or with a better central facility for freeing blocks.

For now, we simply check for the cases that would cause zio_free() to create a FREE IO, and instead put the block on the pool's freelist. This is the same place that blocks from destroyed datasets go, and the async destroy machinery will automatically see them and trickle them out as normal.

Implementation notes

I thought about a separate task queue or throttle for this, but the using the pool deadlist and the async destroy machinery was much nicer. It's already there, and quiet most of the time. We have talked in the past about using if for unlink() for performance, and this does actually help with that: deleting enormous test files returned much faster, as you'd expect. However performance is not the purpose of this PR, which is why I've kept it narrow. I'd like to think about it more before we did it for everything, not least how it interacts with a real dataset destroy, and also scrubs etc.

Still, cool you can watch this with zpool wait -t free.

Discussion

There's a few different problems that lead to this happening. This patch isn't really addressing any of them, but rather detecting a known cause and routing around it. I'll write down all the trouble spots that lead to this so we at least have them.

`zio_t` is too big

1280 bytes is a lot for something we create by the thousands. I have a heap of ideas for how to reduce it and/or change the overheads of the IO pipeline and queues as a whole. They're all hugely invasive and not something I want to do in a hurry.

Bonus: there's many similar shape of bugs in the issue tracker that I believe boil down to the same kind of thing: it's easy to load up "too much" IO on the IO taskqs that then sits there, pinning memory, until the workers at the other end can clear it, which can be an age if the the thing at the end is slow. Unfortunately it's not all just zio_t size.

"free" is not really an IO

I suppose it was probably useful to think about that way because allocations are requested from within the IO pipeline. Most of the time though its just metaslab_free(). Obviously the code has to go somewhere, so this is mostly felt in the API: zio_free() doesn't even take a pio, and zio_free_sync() does but might not issue IO, but also has other rules like "same txg". So I think the idea that free is "magic IO" is too magic, and its unpredictability is what gets us here.

I did try to rework just these functions to be "free this sometime, don't care how" or "free it now or error" and then use the error to decide what to do. It didn't work out and I abandon it.

"frees that cause IO" isn't obvious

The old gang and dedup tests are fine, for now. brt_maybe_exists() is a little troubling, as its a feature in use everywhere now, and it might mean that a lot more frees are going on pipeline than used to.

But I also suspect its not the full story. #16037 shows similar behaviour but doesn't appear to involve gang, dedup or cloned blocks. I could find anything else that obviously might generate a ton of IO from a free, but there's still lots of ZFS I'm not familiar with and it also doesn't cover the future.

dmu_tx assignment isn't granular enough

Deleting a file gets a single call to dmu_tx_hold_free(tx, obj, 0, DMU_OBJECT_END). In that respect its doing nothing wrong, but that's clearly too much all at once, and is ultimately what led to this, as the write throttle can only push back on single holds.

write throttle doesn't know about resource overheads

Similar vibes to #15511. Most things don't account for CPU/memory/IO overheads, and so can't push back when those are low. There's no single answer of course, but it seems we're going to run into problems like this more and more as everything gets bigger and bigger.

How Has This Been Tested?

The goal is to create a single file with millions of L0 blocks, and then drop them all at once. This worked well enough for me. Needs a spare 2T+ disk.

zpool create -O recordsize=16K -O dedup=on tank vdf
dd if=/dev/random of=/tank/file bs=1M count=1M
rm /tank/file

Watching the zio_cache field through the process is instructive. Without this patch, even if you haven't gone large enough to trigger an OOM, you will see the count blow out a long way. You may see substantial pauses while reclaim kicks in.

With this patch, it runs at a much more leisurely rate. Still gets big, but never ridiculous.

ZTS run is going locally, though I'm not expecting to see anything much. At least, it should make sure I didn't screw up the accounting.

Update: @serjponomarev has tested this, see https://github.com/openzfs/zfs/issues/16708#issuecomment-2456590569. Seems plausible?

Types of changes

[x] Bug fix (non-breaking change which fixes an issue)
[ ] New feature (non-breaking change which adds functionality)
[x] Performance enhancement (non-breaking change which improves efficiency)
[ ] Code cleanup (non-breaking change which makes code smaller or more readable)
[ ] Breaking change (fix or feature that would cause existing functionality to change)
[ ] Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
[ ] Documentation (a change to man pages or other documentation)

Checklist:

[x] My code follows the OpenZFS code style requirements.
[ ] I have updated the documentation accordingly.
[x] I have read the contributing document.
[ ] I have added tests to cover my changes.
[x] I have run the ZFS Test Suite with this change applied.
[x] All commit messages are properly formatted and contain Signed-off-by.

satmandu commented 2 weeks ago

I wonder if this might have any impact on the OOMs I was having when working with building and tearing down several large docker containers in sequence, as mentioned in https://github.com/openzfs/zfs/issues/10255#issuecomment-2444917424 .

amotin commented 2 weeks ago

@satmandu ZFS ZIOs, massive allocation of which is addressed here, are not accounted towards ARC, so they may increase memory pressure on the system and make kernel to request ARC eviction, which being ignored due to evil zfs_arc_shrinker_limit may cause your OOMs. But this PR mostly addresses cases when either dedup or partially block cloning is used. You may check if you have those active.

satmandu commented 2 weeks ago

@amotin I do have block cloning enabled.

zpool get feature@block_cloning rpool
NAME   PROPERTY               VALUE                  SOURCE
rpool  feature@block_cloning  active                 local

amotin commented 2 weeks ago

I do have block cloning enabled.
rpool  feature@block_cloning  active                 local

Enabled would not be a sign, but active -- may be. But you may also look how much you have cloned via zpool get all | grep bclone. It may not mean that specific files you delete are cloned, but it is possible.

satmandu commented 2 weeks ago

I also have it active, I believe as per

cat /sys/module/zfs/parameters/zfs_bclone_enabled
1

Looking further:

zpool get all rpool | grep bclone
rpool  bcloneused                     1.60G                          -
rpool  bclonesaved                    2.24G                          -
rpool  bcloneratio                    2.39x                          -

(Maybe that's just not enough to warrant this PR having any effect on my system.)

Also, as an aside, looking at https://github.com/moby/moby/blob/master/vendor/golang.org/x/sys/unix/ioctl_linux.go, I see the use of both FICLONE, FICLONERANGE, and FIDEDUPERANGE if OpenZFS at some point supports that. So docker can use block cloning functionality on Linux if available.

robn commented 2 weeks ago

We usually consider deletes as netfree transactions, since they free more space than allocate. This though does not promise when exactly the space is freed, that I guess in theory may allow pool overflow. Sure things are already far from predictable in case of dedup or cloning, but as I have told, just a thought.

Yeah, it's a good thought. I agree not for this PR (at least, I think "just a thought" means that heh). I wonder if the answer will be something like, if an allocation (or transaction?) comes in that we can't fulfill, we check the freeing list and if its running, we wait, or we force it to free enough right now, or... something.

I was also thinking today that maybe if dnode_sync_free_range_impl() is freeing indirects, and they're higher than say L2 or L3, we could flag that through to dsl_dataset_block_kill() so that it only puts "very large" frees on the deadlist.

I think that's an interesting idea in its own right, but maybe it could wired in even higher up, like, past a certain size it's not marked netfree, and we use the netfree mark as a "now or later" flag. I guess then it might not be assignable, so idk, maybe we have to break it up instead of nuking the entire range in one go?

Maybe the whole answer will be in the ZIO layer though. Maybe we want a single "batch free" op that can take a list of BPs instead. Maybe that allows some batching down at the BRT and DDT layers too. Hmm hmm.

Ehh, this is all complicated, it's just deciding where you put the complexity.

As I can see, zfs_max_async_dedup_frees does not apply to cloned blocks, since it is not easily visible from the block pointer, so I wonder if the issue may still be reproducible when sync thread will still try to free all the cloned blocks at once, or at least as many as it can within few seconds of TXG timeout. I wonder if we could somehow add limit for cloned blocks there, or we should re-enable zfs_async_block_max_blocks.

Untested, but the clone block check might be just:

diff --git module/zfs/dsl_scan.c module/zfs/dsl_scan.c
index 6cd0dbdea..0de1df948 100644
--- module/zfs/dsl_scan.c
+++ module/zfs/dsl_scan.c
@@ -3578,7 +3578,7 @@ dsl_scan_free_block_cb(void *arg, const blkptr_t *bp, dmu_tx_t *tx)
        -bp_get_dsize_sync(scn->scn_dp->dp_spa, bp),
        -BP_GET_PSIZE(bp), -BP_GET_UCSIZE(bp), tx);
    scn->scn_visited_this_txg++;
-   if (BP_GET_DEDUP(bp))
+   if (BP_GET_DEDUP(bp) || brt_maybe_exists(scn->scn_dp->dp_spa, bp))
        scn->scn_dedup_frees_this_txg++;
    return (0);
 }

It's a bit tedious, since we did that test to get into this path, and again moments before in zio_free_sync(). It's not super expensive, but there's still a read lock and a couple of AVL lookups. I have no clear sense of what that means for contention elsewhere, nor for scan free performance.

behlendorf commented 2 weeks ago

Looking at the CI results it looks like with the change we're running afoul of some of the accounting.

https://github.com/openzfs/zfs/actions/runs/11714317314/job/32635388639?pr=16722#step:11:2332

  [ 3876.039204] ZTS run /usr/share/zfs/zfs-tests/tests/functional/mmp/mmp_hostid
  [ 3876.996904] VERIFY3(vd->vdev_stat.vs_alloc >= -alloc_delta) failed (0 >= 1024)
  [ 3876.999814] PANIC at vdev.c:5133:vdev_space_update()
  [ 3877.002032] Showing stack for process 225423
  [ 3877.004900] CPU: 1 PID: 225423 Comm: dp_sync_taskq Tainted: P           OE     5.10.0-33-amd64 #1 Debian 5.10.226-1
  [ 3877.010650] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
  [ 3877.014083] Call Trace:
  [ 3877.015473]  dump_stack+0x6b/0x83
  [ 3877.017042]  spl_panic+0xd4/0xfc [spl]
  [ 3877.018735]  ? srso_alias_return_thunk+0x5/0x7f
  [ 3877.020574]  ? srso_alias_return_thunk+0x5/0x7f
  [ 3877.024945]  ? zio_write+0x65/0x110 [zfs]
  [ 3877.026679]  ? remove_reference+0x1b0/0x1b0 [zfs]
  [ 3877.028553]  ? srso_alias_return_thunk+0x5/0x7f
  [ 3877.030445]  ? zfs_btree_find+0x174/0x320 [zfs]
  [ 3877.032328]  ? arc_write+0x38c/0x730 [zfs]
  [ 3877.034069]  ? arc_hdr_alloc_abd+0x250/0x250 [zfs]
  [ 3877.036564]  vdev_space_update+0x1e3/0x2d0 [zfs]
  [ 3877.038642]  arc_hdr_l2hdr_destroy+0x67/0xe0 [zfs]
  [ 3877.040771]  arc_hdr_destroy+0x95/0x3b0 [zfs]
  [ 3877.042658]  arc_freed+0x64/0x170 [zfs]
  [ 3877.044562]  zio_free_sync+0x66/0x190 [zfs]
  [ 3877.046567]  zio_free+0xbc/0x100 [zfs]
  [ 3877.048418]  dsl_dataset_block_kill+0x745/0xb50 [zfs]
  [ 3877.050855]  free_blocks+0x19b/0x340 [zfs]
  [ 3877.055043]  dnode_sync_free_range_impl+0x1c4/0x360 [zfs]
  [ 3877.057633]  dnode_sync_free_range+0x73/0xf0 [zfs]
  [ 3877.059958]  ? dnode_sync_free_range_impl+0x360/0x360 [zfs]
  [ 3877.062647]  range_tree_walk+0x6f/0xb0 [zfs]
  [ 3877.064665]  dnode_sync+0x42e/0xf70 [zfs]
  [ 3877.066538]  dmu_objset_sync_dnodes+0x88/0x130 [zfs]
  [ 3877.068768]  sync_dnodes_task+0x42/0x1e0 [zfs]
  [ 3877.070695]  taskq_thread+0x285/0x5c0 [spl]
  [ 3877.072901]  ? wake_up_q+0xa0/0xa0
  [ 3877.074402]  ? dnode_rele_task+0x70/0x70 [zfs]
  [ 3877.076179]  ? taskq_lowest_id+0xc0/0xc0 [spl]
  [ 3877.078053]  kthread+0x11b/0x140
  [ 3877.079357]  ? __kthread_bind_mask+0x60/0x60
  [ 3877.081172]  ret_from_fork+0x22/0x30

robn commented 2 weeks ago

Hrm. Definitely transient, can't reproduce locally and most of the test runners don't show it. I'll have to do a code read.

behlendorf commented 2 weeks ago

It looks like 2 of the test runners hit it. Here's the other one from the ubuntu20 runner.

  [  752.808441] VERIFY3(vd->vdev_stat.vs_alloc >= -alloc_delta) failed (72192 >= 131072)
  [  752.812239] PANIC at vdev.c:5133:vdev_space_update()

robn commented 2 weeks ago

I could use a hand with this one. @amotin you're probably my best hope here.

AIUI, it's trying to credit vs_alloc with 1K, but it's already 0.

  [ 3876.996904] VERIFY3(vd->vdev_stat.vs_alloc >= -alloc_delta) failed (0 >= 1024)
  [ 3876.999814] PANIC at vdev.c:5133:vdev_space_update()

By this call stack, this free is not a deferred free; we go straight into zio_free() and zio_free_sync() and try to effect the free on the spot:

  [ 3877.036564]  vdev_space_update+0x1e3/0x2d0 [zfs]
  [ 3877.038642]  arc_hdr_l2hdr_destroy+0x67/0xe0 [zfs]
  [ 3877.040771]  arc_hdr_destroy+0x95/0x3b0 [zfs]
  [ 3877.042658]  arc_freed+0x64/0x170 [zfs]
  [ 3877.044562]  zio_free_sync+0x66/0x190 [zfs]
  [ 3877.046567]  zio_free+0xbc/0x100 [zfs]
  [ 3877.048418]  dsl_dataset_block_kill+0x745/0xb50 [zfs]

Now I don't really understand the arc_freed() very well, but it's comment has enough interesting words to suggest relevance. This seems probably like some sort of ordering change: before, the free came through while some sort of "special" IO (prefetch or dedup, by the comment) was out, so we did nothing, and then that replacement IO handle the free. Now, the free can come after, so that special IO does the free (dereference?), and then we arrive in arc_freed() and it's already been done.

By the conditional, there's no L1 header or references (fine). Then in arc_hdr_destroy(), it does have a L2 header, so we go down that road, and end up discovering the accounting error.

There's three failing tests above. The logs make it hard to see what, if anything, is triggering it. One time it's running the cache suite, another mmp_hostid, which does use a cache device.

I don't know enough about the ARC and especially L2ARC to know what to do here. My high-level guesses are one or more of:

wherever the cleanup after the "special" IO (override?) happens, we need an explicit "and clean up the L2ARC" step
we are doing that cleanup, but we're leaving a dangling L2 header behind
arc_freed() needs a tighter "in use" check.

amotin commented 2 weeks ago

@robn L2ARC is quite a special beast. It is written solely by l2arc_write_buffers() and freed either by l2arc_evict() just before new writing or as show in backtrace when respective block is evicted from ARC as result of being freed. L2ARC has no traditional spacemap and so does not use FREE ZIOs. To receive that assertion there must be some bug in vdev_space_update(dev->l2ad_vdev, ...) calls. I have no clue how can it be related to this PR other than some possible timing changes, but looking on the code now I think there might be a bug in l2arc_write_buffers(). I think it should call vdev_space_update(dev->l2ad_vdev, ...) before zio_nowait(wzio);, otherwise I might speculate a scenario when l2arc_write_done() is called in between, completing the L2ARC write, after which block might be freed, triggering the assertion. Or alternatively I don't remember right now how block frees are handled for blocks that are being in process of L2ARC write. I'm too tired and sleepy now, but seems like it might be protected by HDR_L2_WRITING() flag. I guess I could have opened the race there earlier this year by #16040.

amotin commented 2 weeks ago

I think it should call vdev_space_update(dev->l2ad_vdev, ...) before zio_nowait(wzio);, otherwise I might speculate a scenario when l2arc_write_done() is called in between, completing the L2ARC write, after which block might be freed, triggering the assertion.

@robn After second look, ordering between vdev_space_update() and zio_nowait(wzio); does not matter, since the last has no done callback. The callback is set for pio, but zio_wait(pio) is not called until the end of function when we get out of the loop, so there should be no race. I'll continue looking.

amotin commented 2 weeks ago

@robn After the third look I again think it is my fault. I think this should fix it: https://github.com/openzfs/zfs/pull/16743 .

behlendorf commented 2 weeks ago

https://github.com/openzfs/zfs/pull/16743 sure does look like it would explain this. @robn @amotin I've opened https://github.com/openzfs/zfs/pull/16744 with both patches applied for the CI to test this since that's where we've been able to reproduce it.

amotin commented 1 week ago

It's a bit tedious, since we did that test to get into this path, and again moments before in zio_free_sync(). It's not super expensive, but there's still a read lock and a couple of AVL lookups. I have no clear sense of what that means for contention elsewhere, nor for scan free performance.

Yea, it is not exactly free, despite checking only for number of entries in AVL-tree without traversal, which I think may give too many false positives, don't like it. But the worse I saw is lock contention, which should reduce after https://github.com/openzfs/zfs/pull/16740 , but unlikely go away completely. I don't think we can avoid second check before and after queuing since the list is still used for snapshots deletion, and we have no way to pass there the knowledge, but if we checked it just above the stack...

robn commented 1 week ago

@behlendorf thanks for setting that up; that was my plan too, just couldn't get to it until today. Much appreciated!

I'll wait for your review on #16743 and here. Let me know if you want me to rebase this on top of #16743 if/when you merge it.

behlendorf commented 1 week ago

Merged. No rebase needed. We'll want to keep our eye on subsequent CI runs to make sure that accounting bug is really resolved.

robn commented 1 week ago

Thanks all!

@behlendorf yep, will do. Ping me if you see anything!

openzfs / zfs