Open rkojedzinszky opened 1 year ago
Metrics during creating snapshots and during zfs destroy
are here: https://grafana.k8s.kojedz.in/dashboard/snapshot/DD1AanWZ56GNliRJc8mpTy2x8hdcBUiC
@freqlabs @ixhamza This might interest you since it was reported from TrueNAS.
Was the previous version fine?
@rkojedzinszky If you attack on multiple fronts, you could at least link all the different tickets, so people would not duplicate efforts. This is the TrueNAS ticket: https://ixsystems.atlassian.net/browse/NAS-119329 .
Was the previous version fine?
We just started to use such use cases, so I don't know if it has worked before or not.
Was the previous version fine?
I found that with FreeNAS 11.3 the behavior is different. IO stall can be observed only for a few seconds, not for minutes. Please see discussion in ixsystems ticket https://ixsystems.atlassian.net/browse/NAS-119329.
Was the previous version fine?
With a previous Truenas version it works fine, to be exact FreeNAS 11.3 works without issues.
So, again, really easy to reproduce. Have a vm with only 3G of ram. Create an empty pool, then run the attached script to prepare a dataset. The script will create a dataset with a file twice the size of ram, and will create approximately 10k snapshots. Then just issue zfs destroy -r
on the prepared dataset. That will cause the writes to the pool stall in a few minutes. Also, my vm with 3G ram got unreachable now. But, with TrueNAS 11.3 the destroy operation went well, finished, and io writes were halted just for 45 seconds. I dont know, what zfs version were there, but definitely it works, and since they'we switched to openzfs it behaves differently.
I treat this operation similar to a rm -rf /directory
. The total operation time shall depend on directory size (entries, file sizes, extents, etc) but should not block writes to the filesystem. I assume the same from ZFS, even if I have many snapshots for a dataset, and I destroy it, in any way it should not block writes to the pool. Am I wrong with this?
IIRC ZFS throttles free operations, same as it throttles write operations, since both are background. I may guess inter-operation of those two may be not trivial. I have vague memories somebody touched that area not so long ago...
I've prepared a more compact script to demonstrate the issue. A thread is started to monitor pool performance. It does this by issuing writes to a file in a dataset, and prints statistics about those operations.
While this thread is running, a separate one is started to prepare a child dataset with data modifications and snapshots. A loop writes 100 blocks to a file in random positions, then takes a snapshot. Now, even ~700 snapshots were enough for me to demonstrate pool stall, howewer, this time it took shorter. When this thread is done, a simple zfs destroy -r
is issued against the child dataset. The monitoring thread is still running, printing statistics.
I've run this script on a box with 8G of ram. With these settings during zfs destroy I observed around 20seconds of writes being blocked. Howewer, the amount of created data is not much (a few gigabytes), and also the amount of snapshots matches our use case (~700 snapshots).
Perhaps, if running on a box with more RAM, tune the script to have more IOS_PER_SNAPSHOT
(e.g. to have more data in a snapshot) or increase SNAPSHOT_COUNT
.
Hi,
parallel destroying thousand of snapshots works in seconds.
Server load is huge but is still working good - no decreased performance. Maybe parallel destroying can be applied inside ZFS code.
I use something like this:
zfs list -H -o name -t snap -r rpool | while read line; do echo zfs destroy $line \&; done >> destroy.sh ,/destroy.sh
@marcinkuk I doubt it may be faster than single command with multiple names. Any way delete is processed in a single TXG commit thread. I am currently working on this issue, and the problem as I see it is caused by cache missed during deadlists handling. For HDD pool with many snapshots freeing data from random other snapshots, it may require significant amount of random reads, that take time. I am adding prescient prefetch to accelerate that.
I would like to say thanks for all the work on this issue. Howewer I must still highlight that with freenas 11.3 the delete operation causes 45sec write outage to the pool, while with the fixes here it still causes 6 minutes of outage. My test vm is run on my desktop with an ssd inside, so in real hdd is not involved. Also, I suspect my host zfs has cached everything already, so disk io is not involved at all. Disk utilization in the vm confirms this. While the destroy process is running, disk utilization is around 10%.
How could I help more to bring back destroy behavior in freenas 11.3?
Cant we make zfs destroy -r
behave as an rm -rf
? I mean, an rm -rf
can be interrupted, then we'll have partial results. Also, usually an rm -rf
dont block other io calls to a filesystem.
I observe high cpu utilization during delete. The bottleneck is CPU, not io:
During a zfs destroy -r <dataset-wih-many-snapshots>
perf top
shows:
Samples: 2M of event 'cycles', 4000 Hz, Event count (approx.): 172669845701 lost: 0/0 drop: 0/0
Overhead Shared Object Symbol
20.90% [kernel] [k] _raw_spin_lock
18.45% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock
6.04% [kernel] [k] syscall_exit_to_user_mode
5.37% [kernel] [k] igrab
4.90% [kernel] [k] _atomic_dec_and_lock
4.78% [kernel] [k] zfs_prune
3.48% [kernel] [k] mutex_unlock
3.09% [kernel] [k] entry_SYSCALL_64
3.05% [kernel] [k] syscall_return_via_sysret
2.99% [kernel] [k] arc_evict_state
2.38% [kernel] [k] mutex_lock
2.12% [kernel] [k] d_prune_aliases
1.76% [kernel] [k] __list_del_entry_valid
1.46% [kernel] [k] mutex_trylock
1.34% [kernel] [k] __list_add_valid
0.99% [kernel] [k] aggsum_add
0.93% [kernel] [k] hdr_full_cons
The top process threads are:
325 root 39 19 0 0 0 R 92.4 0.0 7:33.99 arc_evict
324 root 20 0 0 0 0 R 87.7 0.0 7:07.45 arc_prune
657 root 20 0 0 0 0 D 5.0 0.0 2:46.88 txg_sync
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.
System information
Describe the problem you're observing
Issued
zfs destroy -r <dataset>
with many snapshots (~400) takes long, and stalls other io to the pool.Describe how to reproduce the problem
Use the attached script to create dataset with data and 10k snapshots. The, destroy the dataset recursively, while monitoring pool io, issue write operations during the deletion.
I can delete snapshots one-by-one, without any effects, howewer, when deleting the dataset, it causes pool stalls.
prepare-snapshots.py.txt