Open hhhappe opened 8 years ago
1、We didn't delete the entry of directory ZAP when delete the file, which for purpose of reuse the entry. 2、I noticed that your test dataset have an used space of 10.1G, it's different from your test2 dataset. Is there any other difference between them except the ashift?
Sorry, I mixed up some test I did on a file based zpool. I'll recreate the test on 512B sector disks and a 4k sector disk.
I had some of the test file systems around still. After a lot of days online they are still slow at first listing.
We have to read data from disk when first listing, so it may take some time if you have a large amount of files. It will be much better if your pool is created on high-speed stoarge, such as SSD.
Sure, listing is creating a lot of IO, so an SSD would be better. That's not the point. Listing an empty directory, should not take that much time.
Here is the result of an ashift=9 pool on a Seagate ST4000NM0023:
# time ls /ashift9
real 1m22.219s
user 0m0.000s
sys 0m0.933s
# zdb -ddd ashift9/ 4
Dataset ashift9 [ZPL], ID 21, cr_txg 1, 114M, 6 objects
Object lvl iblk dblk dsize lsize %full type
4 4 16K 16K 114M 256M 99.96 ZFS directory
I will be back with ashift=12 when it's ready. It seems my initial testing were flawed.
Ironically enough, it seems ext4 suffers the same issue. They have a half-assed solution. See torvalds/linux@1f60fbe7274918adb8db2f616e321890730ab7e3
@DeHackEd yeah, and of course
cond_resched();
which seems to be the solution to everything nowadays, interesting :)
I don't think that was the intention. The patch to ext4 doesn't make it go faster, just makes it interruptible and not set off the "task blocked" alarm. At the end of the day, the directory payload itself isn't shrunk either when the directory contents are pruned.
Here is the result of an ashift=12 pool on a Seagate ST6000NM0034:
# time ls /ashift12
real 1m35.450s
user 0m0.000s
sys 0m0.835s
# zdb -ddd ashift12/ 4
Dataset ashift12 [ZPL], ID 21, cr_txg 1, 136M, 6 objects
Object lvl iblk dblk dsize lsize %full type
4 4 16K 16K 130M 256M 99.96 ZFS directory
I guess a bit more overhead than ashift=9 and then a bit longer time to process the empty entries. That adds up.
So the conclusion is that this is just how ZFS is designed? Could it be improved, without breaking anything?
FYI, when doing "ls benchmarks" be aware that ls sorts by default. Thus the time is impacted by sort time which can vary widely, depending on your locale. Please use "ls -U" or similar flags to tell ls not to sort, then we'll get a better idea of the actual I/O.
Be reasonable here. Sort time is revealed in user
time which is obviously small to the point of being immeasurable. Which makes sense since we're told the directory is empty to the point that rmdir()
will succeed.
Interesting topic. Sadly, I have nothing significant to add, other than telling the story that ridiculously slow directory listing times (of directories which contained only hundreds of files but old ones were constantly deleted and new ones created 24/7, so the numbers added up fast to considerably higher magnitudes over a few months if we count the deleted files as well) was one of the main reasons I tried to move from Btrfs to ZFS and I am satisfied so far (well, at least with the directory listing times in this particular case seems to be a lot better). I also had stupidly slow deletion times in these directories with Btrfs which also improved a lot on ZFS (it's still relatively slow, but a lot better and still "sane" and acceptable/bearable). I use ashift=9 on 512-byte drivers (RAID-Z, no SLOG, no L2ARC, metadata-only L1ARC).
However, running online defrag and rebalance consecutively on Btrfs seemed to cure this for some degree and for some time (although rebalance could take a week, slowing down everything, and then I had only a few weeks of great performace again before the next slowdown and/or rebalance) whereas there seem to be no easy solution for ZFS (when and IF it happens at some point in the future with my setup).
The problem exists also with "original" ZFS in FreeBSD (at least for 10.2)
The issue is likely generic - ZAPs don't shrink over time and/or can't be converted to a microzap when it would be appropriate. A non-directory example also occurs in #5916
Let me chime in along with not much news: indeed this is the same on freebsd, and stays nicely unchanged by sending it over to linux. A directory with 69 files require 4 minutes to access (by happily spending 800+ms on every getdents syscall) where there used to be a few millions of files in the past. There doesn't seem to be any way to fix it (apart from recreating, which isn't "fixing" but getting a bigger hammer). Basically this can break irreparably any zfs filesystem anytime. Not shiny.
Does deleting the directory fix this, or must the whole filesystem be recreated?
As an optimization it would be possible to support shrinking FatZAPs as entries are removed including converting them back to microZAPs. This would let you reclaim the original directory performance. If someone's interested in working on this let me know.
As someone affected by this in other ways (see #5916) I'm interested in trying to fix it, but my understanding of the deep internals of ZFS is sorely lacking.
@lnicola I believe deleting the directory would fix it unless it's the FS root, which it is in my case.
I can verify that deleting the directory works; no need to delete the FS unless it is the root.
I ran into this problem a couple of years ago when we were storing a year's worth of SBD messages (several million) in a ZFS mount. I had to modify the layout so there's a separate directory for each day, or else it would take about a minute or two every time you wanted to list the files, even after they were deleted.
At the time, I did some tests and saw that EXT4 seemed to have the same issue, but the latency was only a couple of seconds. I didn't test with others.
Since we still had some SUNs running, I saw the same issue there under ZFS, and again under BSD.
I have a root-related case too. With such severe performance-drop it affects a lot of users, just in barely noticeable "by sight" way with unobvious reasons. The fix would make a big difference.
Some notes on plans of how this will (eventually) be fixed on ext4 that may or may not be useful for shrinking ZAP blocks:
I commented about this in issue #3967, but I guess it should be tracked as a separate issue.
I've experienced that listing a directory can be slow after all files have been deleted.
The issue can be seen by creating creating 3 million files (taken from the case that made me discover it) in a dir and then delete them. A simple touch for creation will do. After a clean import of the fs (no caching) I get:
That's with ashift=12. With ashift=9:
So ashift=12 has a huge effect.
I hope this is not how it was meant to be.