Open serjponomarev opened 2 days ago
I assume you're talking about (at least): #6783 #16037 #16697.
If so, the problem isn't dedup as such, but a side effect of how the the free pipeline is modified for some kinds of blocks, including dedup blocks, but not only dedup blocks (see #16037 for a non-dedup example).
This specific method can't be done, as unlink()
has to appear atomic to at the filesystem - it's all or nothing. That said, the technique of pacing the frees rather than dumping them all at once I suspect is at least part of the solution under the hood, but there's several complications and I haven't thought it all through yet.
Maybe it's a little bit off-topic, but zfs frees blocks, not files (ddt is per-block too), so you can truncate part of your file (plus iterate over whole file), and only unused blocks would be freed. Maybe it's a workaround, yes.
Hope I didn't miss something.
Maybe it's a little bit off-topic, but zfs frees blocks, not files (ddt is per-block too), so you can truncate part of your file (plus iterate over whole file), and only unused blocks would be freed. Maybe it's a workaround, yes.
Hope I didn't miss something.
Yes, you’re absolutely correct.
My approach to finding a solution to this issue went as follows:
arc_summary
. In this case, there was no excessive memory consumption.truncate
might help achieve this, tested it, and it worked, providing the same memory-efficient behavior as deleting 1024 smaller files.That’s why I decided to share this approach with the community — to discuss possible ways to implement such a mechanism within the ZFS codebase.
I assume you're talking about (at least): #6783 #16037 #16697.
If so, the problem isn't dedup as such, but a side effect of how the the free pipeline is modified for some kinds of blocks, including dedup blocks, but not only dedup blocks (see #16037 for a non-dedup example).
This specific method can't be done, as
unlink()
has to appear atomic to at the filesystem - it's all or nothing. That said, the technique of pacing the frees rather than dumping them all at once I suspect is at least part of the solution under the hood, but there's several complications and I haven't thought it all through yet.
In searching for a solution to this issue, I reviewed all the issues you referenced. I understand that the problem isn’t specifically limited to deduplication; it’s broader in scope. However, in the case of deduplication, this problem is 100% reproducible and testable.
That’s why I chose a more general title for this issue.
Yep, and you can do tricks with ftruncate
in userspace, because you understand what the shrinking file means. It's not suitable as an alternate implementation of unlink()
though, which by definition has to appear to make the file disappear entirely.
It also wouldn't solve the problem properly anyway, because the real problem is in the sheer volume of blocks we're trying to destroy in one go, not that they're from the same file. If you had destroyed your 1024 1GB files on the same transaction (sometimes tricky to arrange), it would have blown up in the same way. Similarly if you had done it with 1M 1MB files.
It's not even theoretically limited to filesystems; any object could do it. I'd be curious to know if one created a 1T zvol, filled it with random data, and then zeroed it in one go (maybe with blkdiscard), would it do the same thing? If it didn't, I expect it would be more to with the locking differences in zvols compare to filesystems, not the underlying block structure.
So yeah, if controlling this way from userspace with ftruncate
is something you can do, then you have a good workaround, but that's all.
I haven't looked there lately and may misremember, but IIRC we've had a mechanisms to throttle deletes to split them between transaction groups. I am not sure it may help single huge file, but for many smaller ones it would be the proper solution.
Yep, and you can do tricks with
ftruncate
in userspace, because you understand what the shrinking file means. It's not suitable as an alternate implementation ofunlink()
though, which by definition has to appear to make the file disappear entirely.It also wouldn't solve the problem properly anyway, because the real problem is in the sheer volume of blocks we're trying to destroy in one go, not that they're from the same file. If you had destroyed your 1024 1GB files on the same transaction (sometimes tricky to arrange), it would have blown up in the same way. Similarly if you had done it with 1M 1MB files.
It's not even theoretically limited to filesystems; any object could do it. I'd be curious to know if one created a 1T zvol, filled it with random data, and then zeroed it in one go (maybe with blkdiscard), would it do the same thing? If it didn't, I expect it would be more to with the locking differences in zvols compare to filesystems, not the underlying block structure.
So yeah, if controlling this way from userspace with
ftruncate
is something you can do, then you have a good workaround, but that's all.
@robn Yes, I understand the aspects you’ve mentioned. My intention was simply to contribute to solving a broader issue, as I can indeed address my specific problem from userspace.
I currently have access to the same host described in my experiment, but with smaller NVMe drives. The maximum size for the ZFS pool I can create is approximately 1.09 TB, which would allow me to create a zvol of around 800-900 GB, assuming the pool is filled to 80-90%.
I would be happy to assist in gathering information to tackle this broader problem. Please provide the parameters for the zvol experiment, including the zvol size and block size. I will fill it with random data and then perform a blkdiscard
.
Also, please clarify what specific data you are looking to obtain from this experiment. If I understand correctly, you aim to test the hypothesis regarding the sequential discarding of blocks and its impact on memory behavior.
The blkdiscard
operation should mimic the behavior observed in my truncate
experiment, but at the block level rather than the file level, correct?
Possibly you mean the zfs_unlinked_drain
stuff. Not sure; I don't fully understand it myself. Maybe something further down though, there is a lot of back and forth in the file delete path. Whatever is there didn't save #16037 though, which claims "lots of small files" so.
For big objects though, it just ends up adding the entire object length to dn_free_ranges
, and then dnode_sync
-> dnode_sync_free_range
and beyond just blasts out a mass of frees. ("beyond" is a long way, I have notes which I'll write up before long).
Anyway, I think I have a plan now: repurpose async_destroy
. I'm working on a simple prototype and test case now, hopefully something to in a day or two.
@robn I am not sure what exactly I mean, but you may see that dmu_tx_count_free()
accounts not only blocks that will be modified in process of deletion, but also in a face of txh_memory_tohold
how much memory will it require to hold the indirects. But it obviously does not account DDT, BRT, ZIO and other stuff. But I have feeling there was something else, just don't remember what.
Ahh yeah, that might be it. And I understand why its not working here.
In zio_free_sync
, any free that will create IO (gang, dedup, maybe BRT) will zio_create
and put it on the pipeline. A zio_t
is 1280 bytes. So if you delete a 2T file of 128K dedup blocks, that'll create 16M zio_t
, so ~20G just off the zio_cache
slab. (This is exactly the scenario in #16697). And of course nothing in the DMU is able to anticipate that.
I'm currently looking at async_destroy
as a way of reusing an existing facility (with a nice side effect of background deletes in general, so very fast unlink()
calls).
In the longer term, the whole zio pipeline needs a lot of work. Reducing zio_t
size at least, maybe frees shouldn't really be done there (since they're not really IO), but also maybe stuff about generally not allocating space until we need it. I had a similar issue in a customer job a few weeks ago where I loaded up a ton of read IOs on the queue, and OOMed the system because all the ABDs needed to be allocated up front, even though they weren't needed until the IO got to vdev_io_start
. There's loads to be done, but I definitely didn't want to just start down this road for this mass-free issue, because it needs real thought and input from more people than just me.
@robn I did some experiments with zvol.
pool: zpool
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
zpool ONLINE 0 0 0
nvme0n1 ONLINE 0 0 0
nvme1n1 ONLINE 0 0 0
nvme2n1 ONLINE 0 0 0
errors: No known data errors
zfs create -s -b 16K -V 900G zpool/zvol
NAME USED AVAIL REFER MOUNTPOINT
zpool 383M 1.06T 96K /zpool
zpool/zvol 56K 1.06T 56K -
Filling:
fio --name=test --numjobs=1 --iodepth=8 --bs=1M --rw=write --ioengine=libaio --direct=1 --group_reporting=1 --filename=/dev/zd0
blkdiscard by default, without specifying a step, discards all data.
Without deduplication:
blkdiscard -v /dev/zd0
- works, memory consumption is almost unchanged.With deduplication:
blkdiscard -v /dev/zd0
- OOMblkdiscard -v --step 1G /dev/zd0
- OOMblkdiscard -v --step 1M /dev/zd0
- works
Describe the feature you would like to see added to OpenZFS
I propose adding an iterative approach for deleting large files in ZFS pools with deduplication enabled. Instead of calling
unlink
to remove the entire file at once, we can implement a mechanism that reduces the file size from the end, freeing blocks incrementally.How will this feature improve OpenZFS?
This feature addresses the issue of Out-Of-Memory (OOM) errors that occur when deleting large files. Currently, when
unlink
is called, ZFS loads all entries from the Deduplication Data Table (DDT) related to the file into memory, which can lead to memory overload, especially on systems with limited RAM. By implementing an iterative file reduction process, we can significantly reduce memory consumption and improve stability.Additional context
The proposed algorithm includes the following steps:
unlink
Call: Once the file is completely truncated, perform a finalunlink
to remove any remaining metadata.Benefits:
Experimental Evidence
The following experiment demonstrates the basis for this proposed improvement:
Environment:
recordsize=16K
.Procedure:
rm /zpool/test.io
, resulting in an OOM event.Reboot and delete the file iteratively, reducing its size by 1 GB in each iteration before final deletion:
Observation: Memory consumption can be monitored with
watch arc_summary
throughout the process.