Closed dmaziuk closed 5 years ago
@dmaziuk have you test just copy the entire directory to a new one?
I have similar issue before due to number of file grow overtime that result in fragmentation over dir node, copy is the only way to defrag it.
This particular directory is full of symlinks that get wiped out and re-created every week, so no to fragmentation.
Same here with most recent 0.6.4 version.
An additional data point: someone's tried mget/wget on that directory (vsftpd) and it's taken 5 hours instead of 15 minutes -- this is all local over a gigabit wire, not a network issue.
@dmaziuk
could you provide test data or a test script to create those problematic file structure?
I think that's much useful if anyone wanna dig into this problem, if they can't recreate the issue on their platform, they can't target what to fix.
If you want to recreate that, just cat
bytes out of /dev/zero
or /dev/random
, there's nothing special about those files or the directory structure (well, most of the files are ascii).
The total is under a TB in a ~2.6 TB pool/filesystem using low-end spinning rust drives. It has compression enabled:
# zpool status tank
pool: tank
state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(5) for details.
scan: scrub repaired 0 in 3h59m with 0 errors on Fri Oct 23 19:22:33 2015
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ata-ST3000DM001-1CH166_W1F40NQQ ONLINE 0 0 0
ata-ST3000DM001-1E6166_W1F460D1 ONLINE 0 0 0
logs
mirror-1 ONLINE 0 0 0
ata-WDC_WD5003ABYX-01WERA1_WD-WMAYP3954112-part3 ONLINE 0 0 0
ata-WDC_WD5003ABYX-01WERA1_WD-WMAYP4658203-part3 ONLINE 0 0 0
errors: No known data errors
# zpool get all tank
NAME PROPERTY VALUE SOURCE
tank size 2.72T -
tank capacity 38% -
tank altroot - default
tank health ONLINE -
tank guid 3105735609222311824 default
tank version - default
tank bootfs - default
tank delegation on default
tank autoreplace off default
tank cachefile - default
tank failmode wait default
tank listsnapshots off default
tank autoexpand on local
tank dedupditto 0 default
tank dedupratio 1.00x -
tank free 1.68T -
tank allocated 1.04T -
tank readonly off -
tank ashift 0 default
tank comment - default
tank expandsize - -
tank freeing 0 default
tank fragmentation 27% -
tank leaked 0 default
tank feature@async_destroy enabled local
tank feature@empty_bpobj active local
tank feature@lz4_compress active local
tank feature@spacemap_histogram active local
tank feature@enabled_txg active local
tank feature@hole_birth active local
tank feature@extensible_dataset enabled local
tank feature@embedded_data active local
tank feature@bookmarks enabled local
tank feature@filesystem_limits disabled local
tank feature@large_blocks disabled local
# zfs get all tank/www
NAME PROPERTY VALUE SOURCE
tank/www type filesystem -
tank/www creation Mon Aug 11 17:44 2014 -
tank/www used 974G -
tank/www available 1.60T -
tank/www referenced 970G -
tank/www compressratio 1.34x -
tank/www mounted yes -
tank/www quota none default
tank/www reservation none default
tank/www recordsize 128K default
tank/www mountpoint /websites/www local
tank/www sharenfs rw=@144.92.167.128/25,no_root_squash,no_all_squash,insecure,mountpoint=/websites/www local
tank/www checksum on default
tank/www compression lzjb local
tank/www atime on default
tank/www devices on default
tank/www exec on default
tank/www setuid on default
tank/www readonly off default
tank/www zoned off default
tank/www snapdir hidden default
tank/www aclinherit restricted default
tank/www canmount on default
tank/www xattr on default
tank/www copies 1 default
tank/www version 5 -
tank/www utf8only off -
tank/www normalization none -
tank/www casesensitivity sensitive -
tank/www vscan off default
tank/www nbmand off default
tank/www sharesmb off default
tank/www refquota none default
tank/www refreservation none default
tank/www primarycache all default
tank/www secondarycache all default
tank/www usedbysnapshots 3.28G -
tank/www usedbydataset 970G -
tank/www usedbychildren 0 -
tank/www usedbyrefreservation 0 -
tank/www logbias latency default
tank/www dedup off default
tank/www mlslabel none default
tank/www sync standard
tank/www refcompressratio 1.34x -
tank/www written 39.0M -
tank/www logicalused 1.26T -
tank/www logicalreferenced 1.25T -
tank/www filesystem_limit none default
tank/www snapshot_limit none default
tank/www filesystem_count none default
tank/www snapshot_count none default
tank/www snapdev hidden default
tank/www acltype off default
tank/www context none default
tank/www fscontext none default
tank/www defcontext none default
tank/www rootcontext none default
tank/www relatime on temporary
tank/www redundant_metadata all default
tank/www overlay off default
OK, part of it was coincidence: a drive decided to start failing at just the right time. Once I fired up a very large rsync and started monitoring iostat
(not zpool iostat
), I saw 4-digit r_await
times on it. (Of course there's no smart errors or anything on it.) Replaced with a "nas" drive and now I see
$ time ls -l /websites/www/ftp/pub/bmrb/entry_lists/nmr-star3.1/ | wc -l
10477
real 0m11.098s
user 0m0.119s
sys 0m0.569s
which is 10 seconds better than the previous 21s but still not good...
In case you don't need atime, maybe zfs set atime=off tank/www so your reads will not turn into writes?
Good one, thank you. I normally have that in fstab
but so far I was unable to mount zfs filesystems via fstab
. :-(
With new disk, resilvering complete, and atime off I get 6 seconds on the first ls
which I guess is tolerable.
I do have zfs_arc_max=2147483648
and 8GB RAM on this box, in case that's relevant...
For reference here's something to keep in mind which isn't obvious unless you're already familiar with ZFS internals. Directories in ZFS are stored as either a micro zap or a fat zap. Micro zaps are an optimization designed to improve the common directory case and what is normally used. However, if certain criteria are met the micro zap is automatically promoted to a fat zap. A fat zap is designed for scalability and allows for millions of entries in a directory while preserving a constant lookup time (unlike a micro zap).
Micro zaps are promoted to fat zaps when either one of the following occurs:
Thank you, added to my "notes to self" wiki. Does it get demoted back to micro zap if those are no longer true? For this particular case,
$ time ls -l /websites/www/ftp/pub/bmrb/entry_lists/nmr-star3.1/ | wc -l
10545
real 0m8.469s
user 0m0.161s
sys 0m0.953s
$ time ls -l /websites/www/ftp/pub/bmrb/entry_lists/nmr-star3.1/ | wc -l
10545
real 0m0.326s
user 0m0.106s
sys 0m0.226s
$ touch _A_silly_zfs_filename_to_trick_it_into_promoting_the_directory_to_fat_zap_must_exceed_fifty_characters_is_this_long_enough_already
which is consistent with previous numbers. After a reboot (upgrading centos to 7.2)
# time ls -l /websites/www/ftp/pub/bmrb/entry_lists/nmr-star3.1/ | wc -l
10546
real 0m7.784s
user 0m0.121s
sys 0m0.751s
# time ls -l /websites/www/ftp/pub/bmrb/entry_lists/nmr-star3.1/ | wc -l
10546
real 0m0.335s
user 0m0.114s
sys 0m0.226s
I.e. not much difference.
It could be downgraded but that's not currently implemented. You can however create a new directory and move the files.
I've experienced that it can also be slow if all files are deleted. When @behlendorf mentioned downgrading is not implemented I'm wondering if deallocation of unused entries is implemented?
The issue can be seen by creating creating 3 million files (taken from the case that made me discover it) in a dir and the delete them. A simple touch for creation will do. After a clean import of the fs (no caching) i get:
# time ls /test
real 1m35.507s
user 0m0.000s
sys 0m0.871s
# zdb -ddd test/ 4
Dataset test [ZPL], ID 21, cr_txg 1, 10.1G, 7 objects
Object lvl iblk dblk dsize lsize %full type
4 4 16K 16K 130M 256M 99.96 ZFS directory
That's with ashift=12. With ashift=9:
# time ls /test2
real 0m7.467s
user 0m0.000s
sys 0m0.345s
# zdb -ddd test2/ 8
Dataset test2 [ZPL], ID 21, cr_txg 1, 305M, 1761741 objects
Object lvl iblk dblk dsize lsize %full type
8 4 16K 16K 81.7M 256M 99.96 ZFS directory
So ashift=12 has a huge effect.
I hope this is not how it was meant to be. On FreeBSD it is the same.
In case you don't need atime, maybe zfs set atime=off tank/www so your reads will not turn into writes?
Disabling atime solved a similar issue for me.
On x64 centos 7, this seems to have started with the switch to
kmod
repo and update tokmod-zfs-0.6.5.3
. Withkernel-3.10.0-229.14.1.el7.x86_64
I had overnight scripts actually dying on i/o, after reboot to3.10.0-229.11.1.el7.x86_64
the scripts complete but:first one today:
After that the numbers on this and other similarly large dirs on the same FS are OK, like
real 0m0.340s
.