openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.6k stars 1.75k forks source link

Optimized Large File Deletion to Prevent OOM #16708

Open serjponomarev opened 2 days ago

serjponomarev commented 2 days ago

Describe the feature you would like to see added to OpenZFS

I propose adding an iterative approach for deleting large files in ZFS pools with deduplication enabled. Instead of calling unlink to remove the entire file at once, we can implement a mechanism that reduces the file size from the end, freeing blocks incrementally.

How will this feature improve OpenZFS?

This feature addresses the issue of Out-Of-Memory (OOM) errors that occur when deleting large files. Currently, when unlink is called, ZFS loads all entries from the Deduplication Data Table (DDT) related to the file into memory, which can lead to memory overload, especially on systems with limited RAM. By implementing an iterative file reduction process, we can significantly reduce memory consumption and improve stability.

Additional context

The proposed algorithm includes the following steps:

  1. Iterative File Truncation: Implement internal logic to incrementally truncate the file from the end, allowing ZFS to load only the necessary metadata associated with the current data size, thus minimizing memory usage.
  2. Final unlink Call: Once the file is completely truncated, perform a final unlink to remove any remaining metadata.

Benefits:

Experimental Evidence

The following experiment demonstrates the basis for this proposed improvement:

Environment:

Procedure:

  1. Populate the pool with a file containing random data to fully utilize the DDT:
    
    fio --name=test --numjobs=1 --iodepth=8 --bs=1M --rw=write --ioengine=libaio --fallocate=0 --filename=/zpool/test.io --filesize=1T
  2. Attempt to delete the file using rm /zpool/test.io, resulting in an OOM event.
  3. Reboot and delete the file iteratively, reducing its size by 1 GB in each iteration before final deletion:

    
    filename=/zpool/test.io
    for i in $(seq $(du -BG $filename | cut -f1 | tr -d 'G') -1 0); do
       truncate -s "$i"G $filename
       echo truncated to $i G
    done
    rm -v $filename

Observation: Memory consumption can be monitored with watch arc_summary throughout the process.

robn commented 2 days ago

I assume you're talking about (at least): #6783 #16037 #16697.

If so, the problem isn't dedup as such, but a side effect of how the the free pipeline is modified for some kinds of blocks, including dedup blocks, but not only dedup blocks (see #16037 for a non-dedup example).

This specific method can't be done, as unlink() has to appear atomic to at the filesystem - it's all or nothing. That said, the technique of pacing the frees rather than dumping them all at once I suspect is at least part of the solution under the hood, but there's several complications and I haven't thought it all through yet.

gmelikov commented 2 days ago

Maybe it's a little bit off-topic, but zfs frees blocks, not files (ddt is per-block too), so you can truncate part of your file (plus iterate over whole file), and only unused blocks would be freed. Maybe it's a workaround, yes.

Hope I didn't miss something.

serjponomarev commented 2 days ago

Maybe it's a little bit off-topic, but zfs frees blocks, not files (ddt is per-block too), so you can truncate part of your file (plus iterate over whole file), and only unused blocks would be freed. Maybe it's a workaround, yes.

Hope I didn't miss something.

Yes, you’re absolutely correct.

My approach to finding a solution to this issue went as follows:

  1. I tried various combinations of ZFS module parameters, but this didn’t resolve the problem.
  2. I examined ZFS’s data structures at the block level and confirmed what you described — ZFS indeed frees blocks, not entire files.
  3. To test further, I divided a 1 TB file into 1024 files of 1 GB each, then deleted them sequentially while monitoring memory usage with arc_summary. In this case, there was no excessive memory consumption.
  4. I concluded that what I needed was a way to delete a large 1 TB file as if it were 1024 separate 1 GB files.
  5. I realized that truncate might help achieve this, tested it, and it worked, providing the same memory-efficient behavior as deleting 1024 smaller files.

That’s why I decided to share this approach with the community — to discuss possible ways to implement such a mechanism within the ZFS codebase.

serjponomarev commented 2 days ago

I assume you're talking about (at least): #6783 #16037 #16697.

If so, the problem isn't dedup as such, but a side effect of how the the free pipeline is modified for some kinds of blocks, including dedup blocks, but not only dedup blocks (see #16037 for a non-dedup example).

This specific method can't be done, as unlink() has to appear atomic to at the filesystem - it's all or nothing. That said, the technique of pacing the frees rather than dumping them all at once I suspect is at least part of the solution under the hood, but there's several complications and I haven't thought it all through yet.

In searching for a solution to this issue, I reviewed all the issues you referenced. I understand that the problem isn’t specifically limited to deduplication; it’s broader in scope. However, in the case of deduplication, this problem is 100% reproducible and testable.

That’s why I chose a more general title for this issue.

robn commented 2 days ago

Yep, and you can do tricks with ftruncate in userspace, because you understand what the shrinking file means. It's not suitable as an alternate implementation of unlink() though, which by definition has to appear to make the file disappear entirely.

It also wouldn't solve the problem properly anyway, because the real problem is in the sheer volume of blocks we're trying to destroy in one go, not that they're from the same file. If you had destroyed your 1024 1GB files on the same transaction (sometimes tricky to arrange), it would have blown up in the same way. Similarly if you had done it with 1M 1MB files.

It's not even theoretically limited to filesystems; any object could do it. I'd be curious to know if one created a 1T zvol, filled it with random data, and then zeroed it in one go (maybe with blkdiscard), would it do the same thing? If it didn't, I expect it would be more to with the locking differences in zvols compare to filesystems, not the underlying block structure.

So yeah, if controlling this way from userspace with ftruncate is something you can do, then you have a good workaround, but that's all.

amotin commented 2 days ago

I haven't looked there lately and may misremember, but IIRC we've had a mechanisms to throttle deletes to split them between transaction groups. I am not sure it may help single huge file, but for many smaller ones it would be the proper solution.

serjponomarev commented 2 days ago

Yep, and you can do tricks with ftruncate in userspace, because you understand what the shrinking file means. It's not suitable as an alternate implementation of unlink() though, which by definition has to appear to make the file disappear entirely.

It also wouldn't solve the problem properly anyway, because the real problem is in the sheer volume of blocks we're trying to destroy in one go, not that they're from the same file. If you had destroyed your 1024 1GB files on the same transaction (sometimes tricky to arrange), it would have blown up in the same way. Similarly if you had done it with 1M 1MB files.

It's not even theoretically limited to filesystems; any object could do it. I'd be curious to know if one created a 1T zvol, filled it with random data, and then zeroed it in one go (maybe with blkdiscard), would it do the same thing? If it didn't, I expect it would be more to with the locking differences in zvols compare to filesystems, not the underlying block structure.

So yeah, if controlling this way from userspace with ftruncate is something you can do, then you have a good workaround, but that's all.

@robn Yes, I understand the aspects you’ve mentioned. My intention was simply to contribute to solving a broader issue, as I can indeed address my specific problem from userspace.

I currently have access to the same host described in my experiment, but with smaller NVMe drives. The maximum size for the ZFS pool I can create is approximately 1.09 TB, which would allow me to create a zvol of around 800-900 GB, assuming the pool is filled to 80-90%.

I would be happy to assist in gathering information to tackle this broader problem. Please provide the parameters for the zvol experiment, including the zvol size and block size. I will fill it with random data and then perform a blkdiscard.

Also, please clarify what specific data you are looking to obtain from this experiment. If I understand correctly, you aim to test the hypothesis regarding the sequential discarding of blocks and its impact on memory behavior.

The blkdiscard operation should mimic the behavior observed in my truncate experiment, but at the block level rather than the file level, correct?

robn commented 2 days ago

Possibly you mean the zfs_unlinked_drain stuff. Not sure; I don't fully understand it myself. Maybe something further down though, there is a lot of back and forth in the file delete path. Whatever is there didn't save #16037 though, which claims "lots of small files" so.

For big objects though, it just ends up adding the entire object length to dn_free_ranges, and then dnode_sync -> dnode_sync_free_range and beyond just blasts out a mass of frees. ("beyond" is a long way, I have notes which I'll write up before long).

Anyway, I think I have a plan now: repurpose async_destroy. I'm working on a simple prototype and test case now, hopefully something to in a day or two.

amotin commented 2 days ago

@robn I am not sure what exactly I mean, but you may see that dmu_tx_count_free() accounts not only blocks that will be modified in process of deletion, but also in a face of txh_memory_tohold how much memory will it require to hold the indirects. But it obviously does not account DDT, BRT, ZIO and other stuff. But I have feeling there was something else, just don't remember what.

robn commented 2 days ago

Ahh yeah, that might be it. And I understand why its not working here.

In zio_free_sync, any free that will create IO (gang, dedup, maybe BRT) will zio_create and put it on the pipeline. A zio_t is 1280 bytes. So if you delete a 2T file of 128K dedup blocks, that'll create 16M zio_t, so ~20G just off the zio_cache slab. (This is exactly the scenario in #16697). And of course nothing in the DMU is able to anticipate that.

I'm currently looking at async_destroy as a way of reusing an existing facility (with a nice side effect of background deletes in general, so very fast unlink() calls).

In the longer term, the whole zio pipeline needs a lot of work. Reducing zio_t size at least, maybe frees shouldn't really be done there (since they're not really IO), but also maybe stuff about generally not allocating space until we need it. I had a similar issue in a customer job a few weeks ago where I loaded up a ton of read IOs on the queue, and OOMed the system because all the ABDs needed to be allocated up front, even though they weren't needed until the IO got to vdev_io_start. There's loads to be done, but I definitely didn't want to just start down this road for this mass-free issue, because it needs real thought and input from more people than just me.

serjponomarev commented 1 day ago

@robn I did some experiments with zvol.

  pool: zpool
 state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        zpool       ONLINE       0     0     0
          nvme0n1   ONLINE       0     0     0
          nvme1n1   ONLINE       0     0     0
          nvme2n1   ONLINE       0     0     0

errors: No known data errors

zfs create -s -b 16K -V 900G zpool/zvol

NAME         USED  AVAIL     REFER  MOUNTPOINT
zpool        383M  1.06T       96K  /zpool
zpool/zvol    56K  1.06T       56K  -

Filling: fio --name=test --numjobs=1 --iodepth=8 --bs=1M --rw=write --ioengine=libaio --direct=1 --group_reporting=1 --filename=/dev/zd0

blkdiscard by default, without specifying a step, discards all data.

Without deduplication:

  1. blkdiscard -v /dev/zd0 - works, memory consumption is almost unchanged.

With deduplication:

  1. blkdiscard -v /dev/zd0 - OOM
  2. blkdiscard -v --step 1G /dev/zd0 - OOM
  3. blkdiscard -v --step 1M /dev/zd0 - works