Closed robn closed 1 week ago
I wonder if this might have any impact on the OOMs I was having when working with building and tearing down several large docker containers in sequence, as mentioned in https://github.com/openzfs/zfs/issues/10255#issuecomment-2444917424 .
@satmandu ZFS ZIOs, massive allocation of which is addressed here, are not accounted towards ARC, so they may increase memory pressure on the system and make kernel to request ARC eviction, which being ignored due to evil zfs_arc_shrinker_limit
may cause your OOMs. But this PR mostly addresses cases when either dedup or partially block cloning is used. You may check if you have those active.
@amotin I do have block cloning enabled.
zpool get feature@block_cloning rpool
NAME PROPERTY VALUE SOURCE
rpool feature@block_cloning active local
I do have block cloning enabled.
rpool feature@block_cloning active local
Enabled would not be a sign, but active -- may be. But you may also look how much you have cloned via zpool get all | grep bclone
. It may not mean that specific files you delete are cloned, but it is possible.
I also have it active, I believe as per
cat /sys/module/zfs/parameters/zfs_bclone_enabled
1
Looking further:
zpool get all rpool | grep bclone
rpool bcloneused 1.60G -
rpool bclonesaved 2.24G -
rpool bcloneratio 2.39x -
(Maybe that's just not enough to warrant this PR having any effect on my system.)
Also, as an aside, looking at https://github.com/moby/moby/blob/master/vendor/golang.org/x/sys/unix/ioctl_linux.go, I see the use of both FICLONE
, FICLONERANGE
, and FIDEDUPERANGE
if OpenZFS at some point supports that. So docker can use block cloning functionality on Linux if available.
We usually consider deletes as netfree transactions, since they free more space than allocate. This though does not promise when exactly the space is freed, that I guess in theory may allow pool overflow. Sure things are already far from predictable in case of dedup or cloning, but as I have told, just a thought.
Yeah, it's a good thought. I agree not for this PR (at least, I think "just a thought" means that heh). I wonder if the answer will be something like, if an allocation (or transaction?) comes in that we can't fulfill, we check the freeing list and if its running, we wait, or we force it to free enough right now, or... something.
I was also thinking today that maybe if dnode_sync_free_range_impl()
is freeing indirects, and they're higher than say L2 or L3, we could flag that through to dsl_dataset_block_kill()
so that it only puts "very large" frees on the deadlist.
I think that's an interesting idea in its own right, but maybe it could wired in even higher up, like, past a certain size it's not marked netfree, and we use the netfree mark as a "now or later" flag. I guess then it might not be assignable, so idk, maybe we have to break it up instead of nuking the entire range in one go?
Maybe the whole answer will be in the ZIO layer though. Maybe we want a single "batch free" op that can take a list of BPs instead. Maybe that allows some batching down at the BRT and DDT layers too. Hmm hmm.
Ehh, this is all complicated, it's just deciding where you put the complexity.
As I can see,
zfs_max_async_dedup_frees
does not apply to cloned blocks, since it is not easily visible from the block pointer, so I wonder if the issue may still be reproducible when sync thread will still try to free all the cloned blocks at once, or at least as many as it can within few seconds of TXG timeout. I wonder if we could somehow add limit for cloned blocks there, or we should re-enablezfs_async_block_max_blocks
.
Untested, but the clone block check might be just:
diff --git module/zfs/dsl_scan.c module/zfs/dsl_scan.c
index 6cd0dbdea..0de1df948 100644
--- module/zfs/dsl_scan.c
+++ module/zfs/dsl_scan.c
@@ -3578,7 +3578,7 @@ dsl_scan_free_block_cb(void *arg, const blkptr_t *bp, dmu_tx_t *tx)
-bp_get_dsize_sync(scn->scn_dp->dp_spa, bp),
-BP_GET_PSIZE(bp), -BP_GET_UCSIZE(bp), tx);
scn->scn_visited_this_txg++;
- if (BP_GET_DEDUP(bp))
+ if (BP_GET_DEDUP(bp) || brt_maybe_exists(scn->scn_dp->dp_spa, bp))
scn->scn_dedup_frees_this_txg++;
return (0);
}
It's a bit tedious, since we did that test to get into this path, and again moments before in zio_free_sync()
. It's not super expensive, but there's still a read lock and a couple of AVL lookups. I have no clear sense of what that means for contention elsewhere, nor for scan free performance.
Looking at the CI results it looks like with the change we're running afoul of some of the accounting.
https://github.com/openzfs/zfs/actions/runs/11714317314/job/32635388639?pr=16722#step:11:2332
[ 3876.039204] ZTS run /usr/share/zfs/zfs-tests/tests/functional/mmp/mmp_hostid
[ 3876.996904] VERIFY3(vd->vdev_stat.vs_alloc >= -alloc_delta) failed (0 >= 1024)
[ 3876.999814] PANIC at vdev.c:5133:vdev_space_update()
[ 3877.002032] Showing stack for process 225423
[ 3877.004900] CPU: 1 PID: 225423 Comm: dp_sync_taskq Tainted: P OE 5.10.0-33-amd64 #1 Debian 5.10.226-1
[ 3877.010650] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 3877.014083] Call Trace:
[ 3877.015473] dump_stack+0x6b/0x83
[ 3877.017042] spl_panic+0xd4/0xfc [spl]
[ 3877.018735] ? srso_alias_return_thunk+0x5/0x7f
[ 3877.020574] ? srso_alias_return_thunk+0x5/0x7f
[ 3877.024945] ? zio_write+0x65/0x110 [zfs]
[ 3877.026679] ? remove_reference+0x1b0/0x1b0 [zfs]
[ 3877.028553] ? srso_alias_return_thunk+0x5/0x7f
[ 3877.030445] ? zfs_btree_find+0x174/0x320 [zfs]
[ 3877.032328] ? arc_write+0x38c/0x730 [zfs]
[ 3877.034069] ? arc_hdr_alloc_abd+0x250/0x250 [zfs]
[ 3877.036564] vdev_space_update+0x1e3/0x2d0 [zfs]
[ 3877.038642] arc_hdr_l2hdr_destroy+0x67/0xe0 [zfs]
[ 3877.040771] arc_hdr_destroy+0x95/0x3b0 [zfs]
[ 3877.042658] arc_freed+0x64/0x170 [zfs]
[ 3877.044562] zio_free_sync+0x66/0x190 [zfs]
[ 3877.046567] zio_free+0xbc/0x100 [zfs]
[ 3877.048418] dsl_dataset_block_kill+0x745/0xb50 [zfs]
[ 3877.050855] free_blocks+0x19b/0x340 [zfs]
[ 3877.055043] dnode_sync_free_range_impl+0x1c4/0x360 [zfs]
[ 3877.057633] dnode_sync_free_range+0x73/0xf0 [zfs]
[ 3877.059958] ? dnode_sync_free_range_impl+0x360/0x360 [zfs]
[ 3877.062647] range_tree_walk+0x6f/0xb0 [zfs]
[ 3877.064665] dnode_sync+0x42e/0xf70 [zfs]
[ 3877.066538] dmu_objset_sync_dnodes+0x88/0x130 [zfs]
[ 3877.068768] sync_dnodes_task+0x42/0x1e0 [zfs]
[ 3877.070695] taskq_thread+0x285/0x5c0 [spl]
[ 3877.072901] ? wake_up_q+0xa0/0xa0
[ 3877.074402] ? dnode_rele_task+0x70/0x70 [zfs]
[ 3877.076179] ? taskq_lowest_id+0xc0/0xc0 [spl]
[ 3877.078053] kthread+0x11b/0x140
[ 3877.079357] ? __kthread_bind_mask+0x60/0x60
[ 3877.081172] ret_from_fork+0x22/0x30
Hrm. Definitely transient, can't reproduce locally and most of the test runners don't show it. I'll have to do a code read.
It looks like 2 of the test runners hit it. Here's the other one from the ubuntu20 runner.
[ 752.808441] VERIFY3(vd->vdev_stat.vs_alloc >= -alloc_delta) failed (72192 >= 131072)
[ 752.812239] PANIC at vdev.c:5133:vdev_space_update()
I could use a hand with this one. @amotin you're probably my best hope here.
AIUI, it's trying to credit vs_alloc
with 1K, but it's already 0.
[ 3876.996904] VERIFY3(vd->vdev_stat.vs_alloc >= -alloc_delta) failed (0 >= 1024)
[ 3876.999814] PANIC at vdev.c:5133:vdev_space_update()
By this call stack, this free is not a deferred free; we go straight into zio_free()
and zio_free_sync()
and try to effect the free on the spot:
[ 3877.036564] vdev_space_update+0x1e3/0x2d0 [zfs]
[ 3877.038642] arc_hdr_l2hdr_destroy+0x67/0xe0 [zfs]
[ 3877.040771] arc_hdr_destroy+0x95/0x3b0 [zfs]
[ 3877.042658] arc_freed+0x64/0x170 [zfs]
[ 3877.044562] zio_free_sync+0x66/0x190 [zfs]
[ 3877.046567] zio_free+0xbc/0x100 [zfs]
[ 3877.048418] dsl_dataset_block_kill+0x745/0xb50 [zfs]
Now I don't really understand the arc_freed()
very well, but it's comment has enough interesting words to suggest relevance. This seems probably like some sort of ordering change: before, the free came through while some sort of "special" IO (prefetch or dedup, by the comment) was out, so we did nothing, and then that replacement IO handle the free. Now, the free can come after, so that special IO does the free (dereference?), and then we arrive in arc_freed()
and it's already been done.
By the conditional, there's no L1 header or references (fine). Then in arc_hdr_destroy()
, it does have a L2 header, so we go down that road, and end up discovering the accounting error.
There's three failing tests above. The logs make it hard to see what, if anything, is triggering it. One time it's running the cache
suite, another mmp_hostid
, which does use a cache device.
I don't know enough about the ARC and especially L2ARC to know what to do here. My high-level guesses are one or more of:
arc_freed()
needs a tighter "in use" check.@robn L2ARC is quite a special beast. It is written solely by l2arc_write_buffers()
and freed either by l2arc_evict()
just before new writing or as show in backtrace when respective block is evicted from ARC as result of being freed. L2ARC has no traditional spacemap and so does not use FREE ZIOs. To receive that assertion there must be some bug in vdev_space_update(dev->l2ad_vdev, ...)
calls. I have no clue how can it be related to this PR other than some possible timing changes, but looking on the code now I think there might be a bug in l2arc_write_buffers()
. I think it should call vdev_space_update(dev->l2ad_vdev, ...)
before zio_nowait(wzio);
, otherwise I might speculate a scenario when l2arc_write_done()
is called in between, completing the L2ARC write, after which block might be freed, triggering the assertion. Or alternatively I don't remember right now how block frees are handled for blocks that are being in process of L2ARC write. I'm too tired and sleepy now, but seems like it might be protected by HDR_L2_WRITING()
flag. I guess I could have opened the race there earlier this year by #16040.
I think it should call
vdev_space_update(dev->l2ad_vdev, ...)
beforezio_nowait(wzio);
, otherwise I might speculate a scenario whenl2arc_write_done()
is called in between, completing the L2ARC write, after which block might be freed, triggering the assertion.
@robn After second look, ordering between vdev_space_update()
and zio_nowait(wzio);
does not matter, since the last has no done
callback. The callback is set for pio
, but zio_wait(pio)
is not called until the end of function when we get out of the loop, so there should be no race. I'll continue looking.
@robn After the third look I again think it is my fault. I think this should fix it: https://github.com/openzfs/zfs/pull/16743 .
https://github.com/openzfs/zfs/pull/16743 sure does look like it would explain this. @robn @amotin I've opened https://github.com/openzfs/zfs/pull/16744 with both patches applied for the CI to test this since that's where we've been able to reproduce it.
It's a bit tedious, since we did that test to get into this path, and again moments before in
zio_free_sync()
. It's not super expensive, but there's still a read lock and a couple of AVL lookups. I have no clear sense of what that means for contention elsewhere, nor for scan free performance.
Yea, it is not exactly free, despite checking only for number of entries in AVL-tree without traversal, which I think may give too many false positives, don't like it. But the worse I saw is lock contention, which should reduce after https://github.com/openzfs/zfs/pull/16740 , but unlikely go away completely. I don't think we can avoid second check before and after queuing since the list is still used for snapshots deletion, and we have no way to pass there the knowledge, but if we checked it just above the stack...
@behlendorf thanks for setting that up; that was my plan too, just couldn't get to it until today. Much appreciated!
I'll wait for your review on #16743 and here. Let me know if you want me to rebase this on top of #16743 if/when you merge it.
Merged. No rebase needed. We'll want to keep our eye on subsequent CI runs to make sure that accounting bug is really resolved.
Thanks all!
@behlendorf yep, will do. Ping me if you see anything!
[Sponsors: Klara, Inc., Wasabi Technology, Inc.]
Motivation and Context
Closes: #16697 Closes: #16708 Closes: #6783
Description
dsl_free()
callszio_free()
to free the block. For most blocks, this simply callsmetaslab_free()
without doing any IO or putting anything on the IO pipeline.Some blocks however require additional IO to free. This at least includes gang, dedup and cloned blocks. For those,
zio_free()
will issue aZIO_TYPE_FREE
IO and return.If a huge number of blocks are being freed all at once, it's possible for
dsl_dataset_block_kill()
to be called millions of time on a single transaction (eg a 2T object of 128K blocks is 16M blocks). If those are all IO-inducing frees, that then becomes 16M FREE IOs placed on the pipeline. At time of writing, azio_t
is 1280 bytes, so for just one 2T object that requires a 20G allocation of resident memory from thezio_cache
. If that can't be satisfied by the kernel, an out-of-memory condition is raised.This would be better handled by improving the cases that the
dmu_tx_assign()
throttle will handle, or by reducing the overheads required by the IO pipeline, or with a better central facility for freeing blocks.For now, we simply check for the cases that would cause
zio_free()
to create aFREE
IO, and instead put the block on the pool's freelist. This is the same place that blocks from destroyed datasets go, and the async destroy machinery will automatically see them and trickle them out as normal.Implementation notes
I thought about a separate task queue or throttle for this, but the using the pool deadlist and the async destroy machinery was much nicer. It's already there, and quiet most of the time. We have talked in the past about using if for
unlink()
for performance, and this does actually help with that: deleting enormous test files returned much faster, as you'd expect. However performance is not the purpose of this PR, which is why I've kept it narrow. I'd like to think about it more before we did it for everything, not least how it interacts with a real dataset destroy, and also scrubs etc.Still, cool you can watch this with
zpool wait -t free
.Discussion
There's a few different problems that lead to this happening. This patch isn't really addressing any of them, but rather detecting a known cause and routing around it. I'll write down all the trouble spots that lead to this so we at least have them.
zio_t
is too big1280 bytes is a lot for something we create by the thousands. I have a heap of ideas for how to reduce it and/or change the overheads of the IO pipeline and queues as a whole. They're all hugely invasive and not something I want to do in a hurry.
Bonus: there's many similar shape of bugs in the issue tracker that I believe boil down to the same kind of thing: it's easy to load up "too much" IO on the IO taskqs that then sits there, pinning memory, until the workers at the other end can clear it, which can be an age if the the thing at the end is slow. Unfortunately it's not all just
zio_t
size."free" is not really an IO
I suppose it was probably useful to think about that way because allocations are requested from within the IO pipeline. Most of the time though its just
metaslab_free()
. Obviously the code has to go somewhere, so this is mostly felt in the API:zio_free()
doesn't even take apio
, andzio_free_sync()
does but might not issue IO, but also has other rules like "same txg". So I think the idea that free is "magic IO" is too magic, and its unpredictability is what gets us here.I did try to rework just these functions to be "free this sometime, don't care how" or "free it now or error" and then use the error to decide what to do. It didn't work out and I abandon it.
"frees that cause IO" isn't obvious
The old gang and dedup tests are fine, for now.
brt_maybe_exists()
is a little troubling, as its a feature in use everywhere now, and it might mean that a lot more frees are going on pipeline than used to.But I also suspect its not the full story. #16037 shows similar behaviour but doesn't appear to involve gang, dedup or cloned blocks. I could find anything else that obviously might generate a ton of IO from a free, but there's still lots of ZFS I'm not familiar with and it also doesn't cover the future.
dmu_tx assignment isn't granular enough
Deleting a file gets a single call to
dmu_tx_hold_free(tx, obj, 0, DMU_OBJECT_END)
. In that respect its doing nothing wrong, but that's clearly too much all at once, and is ultimately what led to this, as the write throttle can only push back on single holds.write throttle doesn't know about resource overheads
Similar vibes to #15511. Most things don't account for CPU/memory/IO overheads, and so can't push back when those are low. There's no single answer of course, but it seems we're going to run into problems like this more and more as everything gets bigger and bigger.
How Has This Been Tested?
The goal is to create a single file with millions of L0 blocks, and then drop them all at once. This worked well enough for me. Needs a spare 2T+ disk.
Watching the
zio_cache
field through the process is instructive. Without this patch, even if you haven't gone large enough to trigger an OOM, you will see the count blow out a long way. You may see substantial pauses while reclaim kicks in.With this patch, it runs at a much more leisurely rate. Still gets big, but never ridiculous.
ZTS run is going locally, though I'm not expecting to see anything much. At least, it should make sure I didn't screw up the accounting.
Update: @serjponomarev has tested this, see https://github.com/openzfs/zfs/issues/16708#issuecomment-2456590569. Seems plausible?
Types of changes
Checklist:
Signed-off-by
.