openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.42k stars 1.72k forks source link

Destroying dataset with many snapshots takes long and zpool io stalls #14276

Open rkojedzinszky opened 1 year ago

rkojedzinszky commented 1 year ago

System information

Type Version/Name
Distribution Name TrueNAS
Distribution Version 13.0-U3.1
Kernel Version 13.1-RELEASE-p5
Architecture amd64
OpenZFS Version zfs-2.1.7-1 zfs-kmod-v2022120700-zfs_e46630a9b

Describe the problem you're observing

Issued zfs destroy -r <dataset> with many snapshots (~400) takes long, and stalls other io to the pool.

Describe how to reproduce the problem

Use the attached script to create dataset with data and 10k snapshots. The, destroy the dataset recursively, while monitoring pool io, issue write operations during the deletion.

I can delete snapshots one-by-one, without any effects, howewer, when deleting the dataset, it causes pool stalls.

prepare-snapshots.py.txt

rkojedzinszky commented 1 year ago

Metrics during creating snapshots and during zfs destroy are here: https://grafana.k8s.kojedz.in/dashboard/snapshot/DD1AanWZ56GNliRJc8mpTy2x8hdcBUiC

ryao commented 1 year ago

@freqlabs @ixhamza This might interest you since it was reported from TrueNAS.

ghost commented 1 year ago

Was the previous version fine?

amotin commented 1 year ago

@rkojedzinszky If you attack on multiple fronts, you could at least link all the different tickets, so people would not duplicate efforts. This is the TrueNAS ticket: https://ixsystems.atlassian.net/browse/NAS-119329 .

rkojedzinszky commented 1 year ago

Was the previous version fine?

We just started to use such use cases, so I don't know if it has worked before or not.

rkojedzinszky commented 1 year ago

Was the previous version fine?

I found that with FreeNAS 11.3 the behavior is different. IO stall can be observed only for a few seconds, not for minutes. Please see discussion in ixsystems ticket https://ixsystems.atlassian.net/browse/NAS-119329.

rkojedzinszky commented 1 year ago

Was the previous version fine?

With a previous Truenas version it works fine, to be exact FreeNAS 11.3 works without issues.

So, again, really easy to reproduce. Have a vm with only 3G of ram. Create an empty pool, then run the attached script to prepare a dataset. The script will create a dataset with a file twice the size of ram, and will create approximately 10k snapshots. Then just issue zfs destroy -r on the prepared dataset. That will cause the writes to the pool stall in a few minutes. Also, my vm with 3G ram got unreachable now. But, with TrueNAS 11.3 the destroy operation went well, finished, and io writes were halted just for 45 seconds. I dont know, what zfs version were there, but definitely it works, and since they'we switched to openzfs it behaves differently.

rkojedzinszky commented 1 year ago

I treat this operation similar to a rm -rf /directory. The total operation time shall depend on directory size (entries, file sizes, extents, etc) but should not block writes to the filesystem. I assume the same from ZFS, even if I have many snapshots for a dataset, and I destroy it, in any way it should not block writes to the pool. Am I wrong with this?

amotin commented 1 year ago

IIRC ZFS throttles free operations, same as it throttles write operations, since both are background. I may guess inter-operation of those two may be not trivial. I have vague memories somebody touched that area not so long ago...

rkojedzinszky commented 1 year ago

I've prepared a more compact script to demonstrate the issue. A thread is started to monitor pool performance. It does this by issuing writes to a file in a dataset, and prints statistics about those operations. While this thread is running, a separate one is started to prepare a child dataset with data modifications and snapshots. A loop writes 100 blocks to a file in random positions, then takes a snapshot. Now, even ~700 snapshots were enough for me to demonstrate pool stall, howewer, this time it took shorter. When this thread is done, a simple zfs destroy -r is issued against the child dataset. The monitoring thread is still running, printing statistics.

I've run this script on a box with 8G of ram. With these settings during zfs destroy I observed around 20seconds of writes being blocked. Howewer, the amount of created data is not much (a few gigabytes), and also the amount of snapshots matches our use case (~700 snapshots).

Perhaps, if running on a box with more RAM, tune the script to have more IOS_PER_SNAPSHOT (e.g. to have more data in a snapshot) or increase SNAPSHOT_COUNT.

zpool-snapdel-stall-demo.py.txt

marcinkuk commented 1 year ago

Hi,

parallel destroying thousand of snapshots works in seconds.

Server load is huge but is still working good - no decreased performance. Maybe parallel destroying can be applied inside ZFS code.

I use something like this: zfs list -H -o name -t snap -r rpool | while read line; do echo zfs destroy $line \&; done >> destroy.sh ,/destroy.sh

amotin commented 1 year ago

@marcinkuk I doubt it may be faster than single command with multiple names. Any way delete is processed in a single TXG commit thread. I am currently working on this issue, and the problem as I see it is caused by cache missed during deadlists handling. For HDD pool with many snapshots freeing data from random other snapshots, it may require significant amount of random reads, that take time. I am adding prescient prefetch to accelerate that.

rkojedzinszky commented 1 year ago

I would like to say thanks for all the work on this issue. Howewer I must still highlight that with freenas 11.3 the delete operation causes 45sec write outage to the pool, while with the fixes here it still causes 6 minutes of outage. My test vm is run on my desktop with an ssd inside, so in real hdd is not involved. Also, I suspect my host zfs has cached everything already, so disk io is not involved at all. Disk utilization in the vm confirms this. While the destroy process is running, disk utilization is around 10%.

How could I help more to bring back destroy behavior in freenas 11.3?

rkojedzinszky commented 1 year ago

Cant we make zfs destroy -r behave as an rm -rf? I mean, an rm -rf can be interrupted, then we'll have partial results. Also, usually an rm -rf dont block other io calls to a filesystem.

rkojedzinszky commented 1 year ago

I observe high cpu utilization during delete. The bottleneck is CPU, not io:

During a zfs destroy -r <dataset-wih-many-snapshots> perf top shows:

Samples: 2M of event 'cycles', 4000 Hz, Event count (approx.): 172669845701 lost: 0/0 drop: 0/0
Overhead  Shared Object             Symbol
  20.90%  [kernel]                  [k] _raw_spin_lock
  18.45%  [kernel]                  [k] __raw_callee_save___pv_queued_spin_unlock
   6.04%  [kernel]                  [k] syscall_exit_to_user_mode
   5.37%  [kernel]                  [k] igrab
   4.90%  [kernel]                  [k] _atomic_dec_and_lock
   4.78%  [kernel]                  [k] zfs_prune
   3.48%  [kernel]                  [k] mutex_unlock
   3.09%  [kernel]                  [k] entry_SYSCALL_64
   3.05%  [kernel]                  [k] syscall_return_via_sysret
   2.99%  [kernel]                  [k] arc_evict_state
   2.38%  [kernel]                  [k] mutex_lock
   2.12%  [kernel]                  [k] d_prune_aliases
   1.76%  [kernel]                  [k] __list_del_entry_valid
   1.46%  [kernel]                  [k] mutex_trylock
   1.34%  [kernel]                  [k] __list_add_valid
   0.99%  [kernel]                  [k] aggsum_add
   0.93%  [kernel]                  [k] hdr_full_cons

The top process threads are:

    325 root      39  19       0      0      0 R  92.4   0.0   7:33.99 arc_evict                                                                                                
    324 root      20   0       0      0      0 R  87.7   0.0   7:07.45 arc_prune                                                                                                
    657 root      20   0       0      0      0 D   5.0   0.0   2:46.88 txg_sync                                                                                                 
ryao commented 1 year ago

14603 might help.

stale[bot] commented 5 months ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.