slow ls on large directories

dmaziuk commented 9 years ago

On x64 centos 7, this seems to have started with the switch to kmod repo and update to kmod-zfs-0.6.5.3. With kernel-3.10.0-229.14.1.el7.x86_64 I had overnight scripts actually dying on i/o, after reboot to 3.10.0-229.11.1.el7.x86_64 the scripts complete but:

first one today:

web@manta:/website/htdocs$ time ls -l /websites/www/ftp/pub/bmrb/entry_lists/nmr-star2.1/ | wc -l
10503

real    0m21.713s
user    0m0.164s
sys 0m0.370s

After that the numbers on this and other similarly large dirs on the same FS are OK, like real 0m0.340s.

AndCycle commented 9 years ago

@dmaziuk have you test just copy the entire directory to a new one?

I have similar issue before due to number of file grow overtime that result in fragmentation over dir node, copy is the only way to defrag it.

dmaziuk commented 9 years ago

This particular directory is full of symlinks that get wiped out and re-created every week, so no to fragmentation.

shoeper commented 9 years ago

Same here with most recent 0.6.4 version.

dmaziuk commented 9 years ago

An additional data point: someone's tried mget/wget on that directory (vsftpd) and it's taken 5 hours instead of 15 minutes -- this is all local over a gigabit wire, not a network issue.

AndCycle commented 9 years ago

@dmaziuk

could you provide test data or a test script to create those problematic file structure?

I think that's much useful if anyone wanna dig into this problem, if they can't recreate the issue on their platform, they can't target what to fix.

dmaziuk commented 9 years ago

there is a "data" directory with about 10K subdirectories, each has a handful of files inside,
there are several "list" directories (at the same level as "data") that are full of symlinks to (some of) the files under "data".

If you want to recreate that, just cat bytes out of /dev/zero or /dev/random, there's nothing special about those files or the directory structure (well, most of the files are ascii).

The total is under a TB in a ~2.6 TB pool/filesystem using low-end spinning rust drives. It has compression enabled:

# zpool status tank
  pool: tank
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
    still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
    the pool may no longer be accessible by software that does not support
    the features. See zpool-features(5) for details.
  scan: scrub repaired 0 in 3h59m with 0 errors on Fri Oct 23 19:22:33 2015
config:

    NAME                                                  STATE     READ WRITE CKSUM
    tank                                                  ONLINE       0     0     0
      mirror-0                                            ONLINE       0     0     0
        ata-ST3000DM001-1CH166_W1F40NQQ                   ONLINE       0     0     0
        ata-ST3000DM001-1E6166_W1F460D1                   ONLINE       0     0     0
    logs
      mirror-1                                            ONLINE       0     0     0
        ata-WDC_WD5003ABYX-01WERA1_WD-WMAYP3954112-part3  ONLINE       0     0     0
        ata-WDC_WD5003ABYX-01WERA1_WD-WMAYP4658203-part3  ONLINE       0     0     0

errors: No known data errors

# zpool get all tank
NAME  PROPERTY                    VALUE                       SOURCE
tank  size                        2.72T                       -
tank  capacity                    38%                         -
tank  altroot                     -                           default
tank  health                      ONLINE                      -
tank  guid                        3105735609222311824         default
tank  version                     -                           default
tank  bootfs                      -                           default
tank  delegation                  on                          default
tank  autoreplace                 off                         default
tank  cachefile                   -                           default
tank  failmode                    wait                        default
tank  listsnapshots               off                         default
tank  autoexpand                  on                          local
tank  dedupditto                  0                           default
tank  dedupratio                  1.00x                       -
tank  free                        1.68T                       -
tank  allocated                   1.04T                       -
tank  readonly                    off                         -
tank  ashift                      0                           default
tank  comment                     -                           default
tank  expandsize                  -                           -
tank  freeing                     0                           default
tank  fragmentation               27%                         -
tank  leaked                      0                           default
tank  feature@async_destroy       enabled                     local
tank  feature@empty_bpobj         active                      local
tank  feature@lz4_compress        active                      local
tank  feature@spacemap_histogram  active                      local
tank  feature@enabled_txg         active                      local
tank  feature@hole_birth          active                      local
tank  feature@extensible_dataset  enabled                     local
tank  feature@embedded_data       active                      local
tank  feature@bookmarks           enabled                     local
tank  feature@filesystem_limits   disabled                    local
tank  feature@large_blocks        disabled                    local

# zfs get all tank/www
NAME      PROPERTY              VALUE                                                                                 SOURCE
tank/www  type                  filesystem                                                                            -
tank/www  creation              Mon Aug 11 17:44 2014                                                                 -
tank/www  used                  974G                                                                                  -
tank/www  available             1.60T                                                                                 -
tank/www  referenced            970G                                                                                  -
tank/www  compressratio         1.34x                                                                                 -
tank/www  mounted               yes                                                                                   -
tank/www  quota                 none                                                                                  default
tank/www  reservation           none                                                                                  default
tank/www  recordsize            128K                                                                                  default
tank/www  mountpoint            /websites/www                                                                         local
tank/www  sharenfs              rw=@144.92.167.128/25,no_root_squash,no_all_squash,insecure,mountpoint=/websites/www  local
tank/www  checksum              on                                                                                    default
tank/www  compression           lzjb                                                                                  local
tank/www  atime                 on                                                                                    default
tank/www  devices               on                                                                                    default
tank/www  exec                  on                                                                                    default
tank/www  setuid                on                                                                                    default
tank/www  readonly              off                                                                                   default
tank/www  zoned                 off                                                                                   default
tank/www  snapdir               hidden                                                                                default
tank/www  aclinherit            restricted                                                                            default
tank/www  canmount              on                                                                                    default
tank/www  xattr                 on                                                                                    default
tank/www  copies                1                                                                                     default
tank/www  version               5                                                                                     -
tank/www  utf8only              off                                                                                   -
tank/www  normalization         none                                                                                  -
tank/www  casesensitivity       sensitive                                                                             -
tank/www  vscan                 off                                                                                   default
tank/www  nbmand                off                                                                                   default
tank/www  sharesmb              off                                                                                   default
tank/www  refquota              none                                                                                  default
tank/www  refreservation        none                                                                                  default
tank/www  primarycache          all                                                                                   default
tank/www  secondarycache        all                                                                                   default
tank/www  usedbysnapshots       3.28G                                                                                 -
tank/www  usedbydataset         970G                                                                                  -
tank/www  usedbychildren        0                                                                                     -
tank/www  usedbyrefreservation  0                                                                                     -
tank/www  logbias               latency                                                                               default
tank/www  dedup                 off                                                                                   default
tank/www  mlslabel              none                                                                                  default
tank/www  sync                  standard
tank/www  refcompressratio      1.34x                                                                                 -
tank/www  written               39.0M                                                                                 -
tank/www  logicalused           1.26T                                                                                 -
tank/www  logicalreferenced     1.25T                                                                                 -
tank/www  filesystem_limit      none                                                                                  default
tank/www  snapshot_limit        none                                                                                  default
tank/www  filesystem_count      none                                                                                  default
tank/www  snapshot_count        none                                                                                  default
tank/www  snapdev               hidden                                                                                default
tank/www  acltype               off                                                                                   default
tank/www  context               none                                                                                  default
tank/www  fscontext             none                                                                                  default
tank/www  defcontext            none                                                                                  default
tank/www  rootcontext           none                                                                                  default
tank/www  relatime              on                                                                                    temporary
tank/www  redundant_metadata    all                                                                                   default
tank/www  overlay               off                                                                                   default

dmaziuk commented 9 years ago

OK, part of it was coincidence: a drive decided to start failing at just the right time. Once I fired up a very large rsync and started monitoring iostat (not zpool iostat), I saw 4-digit r_await times on it. (Of course there's no smart errors or anything on it.) Replaced with a "nas" drive and now I see

$ time ls -l /websites/www/ftp/pub/bmrb/entry_lists/nmr-star3.1/ | wc -l
10477

real    0m11.098s
user    0m0.119s
sys 0m0.569s

which is 10 seconds better than the previous 21s but still not good...

GregorKopka commented 9 years ago

In case you don't need atime, maybe zfs set atime=off tank/www so your reads will not turn into writes?

dmaziuk commented 9 years ago

Good one, thank you. I normally have that in fstab but so far I was unable to mount zfs filesystems via fstab. :-(

With new disk, resilvering complete, and atime off I get 6 seconds on the first ls which I guess is tolerable.

I do have zfs_arc_max=2147483648 and 8GB RAM on this box, in case that's relevant...

behlendorf commented 8 years ago

For reference here's something to keep in mind which isn't obvious unless you're already familiar with ZFS internals. Directories in ZFS are stored as either a micro zap or a fat zap. Micro zaps are an optimization designed to improve the common directory case and what is normally used. However, if certain criteria are met the micro zap is automatically promoted to a fat zap. A fat zap is designed for scalability and allows for millions of entries in a directory while preserving a constant lookup time (unlike a micro zap).

Micro zaps are promoted to fat zaps when either one of the following occurs:

A file name exceeds 50 characters, or
The total directory size exceeds 128K

dmaziuk commented 8 years ago

Thank you, added to my "notes to self" wiki. Does it get demoted back to micro zap if those are no longer true? For this particular case,

$ time ls -l /websites/www/ftp/pub/bmrb/entry_lists/nmr-star3.1/ | wc -l

10545

real    0m8.469s
user    0m0.161s
sys 0m0.953s
$ time ls -l /websites/www/ftp/pub/bmrb/entry_lists/nmr-star3.1/ | wc -l
10545

real    0m0.326s
user    0m0.106s
sys 0m0.226s
$ touch _A_silly_zfs_filename_to_trick_it_into_promoting_the_directory_to_fat_zap_must_exceed_fifty_characters_is_this_long_enough_already

which is consistent with previous numbers. After a reboot (upgrading centos to 7.2)

# time ls -l /websites/www/ftp/pub/bmrb/entry_lists/nmr-star3.1/ | wc -l
10546

real    0m7.784s
user    0m0.121s
sys 0m0.751s
# time ls -l /websites/www/ftp/pub/bmrb/entry_lists/nmr-star3.1/ | wc -l
10546

real    0m0.335s
user    0m0.114s
sys 0m0.226s

I.e. not much difference.

behlendorf commented 8 years ago

It could be downgraded but that's not currently implemented. You can however create a new directory and move the files.

hhhappe commented 8 years ago

I've experienced that it can also be slow if all files are deleted. When @behlendorf mentioned downgrading is not implemented I'm wondering if deallocation of unused entries is implemented?

The issue can be seen by creating creating 3 million files (taken from the case that made me discover it) in a dir and the delete them. A simple touch for creation will do. After a clean import of the fs (no caching) i get:

# time ls /test

real    1m35.507s
user    0m0.000s
sys     0m0.871s
# zdb -ddd test/ 4
Dataset test [ZPL], ID 21, cr_txg 1, 10.1G, 7 objects

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
         4    4    16K    16K   130M   256M   99.96  ZFS directory

That's with ashift=12. With ashift=9:

# time ls /test2

real    0m7.467s
user    0m0.000s
sys     0m0.345s
# zdb -ddd test2/ 8
Dataset test2 [ZPL], ID 21, cr_txg 1, 305M, 1761741 objects

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
         8    4    16K    16K  81.7M   256M   99.96  ZFS directory

So ashift=12 has a huge effect.

I hope this is not how it was meant to be. On FreeBSD it is the same.

TiagoJacobs commented 3 years ago

In case you don't need atime, maybe zfs set atime=off tank/www so your reads will not turn into writes?

Disabling atime solved a similar issue for me.

openzfs / zfs

slow ls on large directories #3967