Directory listing slow after it has contained many entires

hhhappe commented 8 years ago

I commented about this in issue #3967, but I guess it should be tracked as a separate issue.

I've experienced that listing a directory can be slow after all files have been deleted.

The issue can be seen by creating creating 3 million files (taken from the case that made me discover it) in a dir and then delete them. A simple touch for creation will do. After a clean import of the fs (no caching) I get:

# time ls /test

real    1m35.507s
user    0m0.000s
sys     0m0.871s

# zdb -ddd test/ 4
Dataset test [ZPL], ID 21, cr_txg 1, 10.1G, 7 objects

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
         4    4    16K    16K   130M   256M   99.96  ZFS directory

That's with ashift=12. With ashift=9:

# time ls /test2

real    0m7.467s
user    0m0.000s
sys     0m0.345s
# zdb -ddd test2/ 8
Dataset test2 [ZPL], ID 21, cr_txg 1, 305M, 1761741 objects

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
         8    4    16K    16K  81.7M   256M   99.96  ZFS directory

So ashift=12 has a huge effect.

I hope this is not how it was meant to be.

GeLiXin commented 8 years ago

1、We didn't delete the entry of directory ZAP when delete the file, which for purpose of reuse the entry. 2、I noticed that your test dataset have an used space of 10.1G, it's different from your test2 dataset. Is there any other difference between them except the ashift?

hhhappe commented 8 years ago

Sorry, I mixed up some test I did on a file based zpool. I'll recreate the test on 512B sector disks and a 4k sector disk.

I had some of the test file systems around still. After a lot of days online they are still slow at first listing.

GeLiXin commented 8 years ago

We have to read data from disk when first listing, so it may take some time if you have a large amount of files. It will be much better if your pool is created on high-speed stoarge, such as SSD.

hhhappe commented 8 years ago

Sure, listing is creating a lot of IO, so an SSD would be better. That's not the point. Listing an empty directory, should not take that much time.

Here is the result of an ashift=9 pool on a Seagate ST4000NM0023:

# time ls /ashift9

real    1m22.219s
user    0m0.000s
sys     0m0.933s

# zdb -ddd ashift9/ 4
Dataset ashift9 [ZPL], ID 21, cr_txg 1, 114M, 6 objects

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
         4    4    16K    16K   114M   256M   99.96  ZFS directory

I will be back with ashift=12 when it's ready. It seems my initial testing were flawed.

DeHackEd commented 8 years ago

Ironically enough, it seems ext4 suffers the same issue. They have a half-assed solution. See torvalds/linux@1f60fbe7274918adb8db2f616e321890730ab7e3

kernelOfTruth commented 8 years ago

@DeHackEd yeah, and of course

cond_resched();

which seems to be the solution to everything nowadays, interesting :)

DeHackEd commented 8 years ago

I don't think that was the intention. The patch to ext4 doesn't make it go faster, just makes it interruptible and not set off the "task blocked" alarm. At the end of the day, the directory payload itself isn't shrunk either when the directory contents are pruned.

hhhappe commented 8 years ago

Here is the result of an ashift=12 pool on a Seagate ST6000NM0034:

# time ls /ashift12

real    1m35.450s
user    0m0.000s
sys     0m0.835s

# zdb -ddd ashift12/ 4
Dataset ashift12 [ZPL], ID 21, cr_txg 1, 136M, 6 objects

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
         4    4    16K    16K   130M   256M   99.96  ZFS directory

I guess a bit more overhead than ashift=9 and then a bit longer time to process the empty entries. That adds up.

So the conclusion is that this is just how ZFS is designed? Could it be improved, without breaking anything?

richardelling commented 8 years ago

FYI, when doing "ls benchmarks" be aware that ls sorts by default. Thus the time is impacted by sort time which can vary widely, depending on your locale. Please use "ls -U" or similar flags to tell ls not to sort, then we'll get a better idea of the actual I/O.

DeHackEd commented 8 years ago

Be reasonable here. Sort time is revealed in user time which is obviously small to the point of being immeasurable. Which makes sense since we're told the directory is empty to the point that rmdir() will succeed.

janos666 commented 8 years ago

Interesting topic. Sadly, I have nothing significant to add, other than telling the story that ridiculously slow directory listing times (of directories which contained only hundreds of files but old ones were constantly deleted and new ones created 24/7, so the numbers added up fast to considerably higher magnitudes over a few months if we count the deleted files as well) was one of the main reasons I tried to move from Btrfs to ZFS and I am satisfied so far (well, at least with the directory listing times in this particular case seems to be a lot better). I also had stupidly slow deletion times in these directories with Btrfs which also improved a lot on ZFS (it's still relatively slow, but a lot better and still "sane" and acceptable/bearable). I use ashift=9 on 512-byte drivers (RAID-Z, no SLOG, no L2ARC, metadata-only L1ARC).

However, running online defrag and rebalance consecutively on Btrfs seemed to cure this for some degree and for some time (although rebalance could take a week, slowing down everything, and then I had only a few weeks of great performace again before the next slowdown and/or rebalance) whereas there seem to be no easy solution for ZFS (when and IF it happens at some point in the future with my setup).

acachy commented 7 years ago

The problem exists also with "original" ZFS in FreeBSD (at least for 10.2)

DeHackEd commented 7 years ago

The issue is likely generic - ZAPs don't shrink over time and/or can't be converted to a microzap when it would be appropriate. A non-directory example also occurs in #5916

grinapo commented 7 years ago

Let me chime in along with not much news: indeed this is the same on freebsd, and stays nicely unchanged by sending it over to linux. A directory with 69 files require 4 minutes to access (by happily spending 800+ms on every getdents syscall) where there used to be a few millions of files in the past. There doesn't seem to be any way to fix it (apart from recreating, which isn't "fixing" but getting a bigger hammer). Basically this can break irreparably any zfs filesystem anytime. Not shiny.

lnicola commented 7 years ago

Does deleting the directory fix this, or must the whole filesystem be recreated?

behlendorf commented 7 years ago

As an optimization it would be possible to support shrinking FatZAPs as entries are removed including converting them back to microZAPs. This would let you reclaim the original directory performance. If someone's interested in working on this let me know.

DeHackEd commented 7 years ago

As someone affected by this in other ways (see #5916) I'm interested in trying to fix it, but my understanding of the deep internals of ZFS is sorely lacking.

grinapo commented 7 years ago

@lnicola I believe deleting the directory would fix it unless it's the FS root, which it is in my case.

angstymeat commented 7 years ago

I can verify that deleting the directory works; no need to delete the FS unless it is the root.

I ran into this problem a couple of years ago when we were storing a year's worth of SBD messages (several million) in a ZFS mount. I had to modify the layout so there's a separate directory for each day, or else it would take about a minute or two every time you wanted to list the files, even after they were deleted.

At the time, I did some tests and saw that EXT4 seemed to have the same issue, but the latency was only a couple of seconds. I didn't test with others.

Since we still had some SUNs running, I saw the same issue there under ZFS, and again under BSD.

acachy commented 7 years ago

I have a root-related case too. With such severe performance-drop it affects a lot of users, just in barely noticeable "by sight" way with unobvious reasons. The fix would make a big difference.

adilger commented 7 years ago

Some notes on plans of how this will (eventually) be fixed on ext4 that may or may not be useful for shrinking ZAP blocks:

leaf blocks will store the "fullness" ratio (4-bit fraction of leaf block used by bytes, not entries) so that when there are multiple adjacent blocks in the hash tree that added together are less than, say, 1/2 or 5/8 full, they can be merged into a single block. There is little value to merge two leaf blocks that result in a nearly-full block, since it would just be split again under most normal (create/delete cycle) workloads
for ext4 htree (not sure about ZAP) it is easiest to drop blocks at the logical end of file, rather than leaving holes in the directory, which would require copying the block with the highest file offset (which may not be the highest hash) into the hole from the freed block and updating the tree. There is also some work underway to allow the kernel/e2fsck to allow holes in directories so that this is not needed.

openzfs / zfs

Directory listing slow after it has contained many entires #4933