openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.6k stars 1.75k forks source link

arc_adapt left spinning after rsync with lots of small files #3303

Closed angstymeat closed 9 years ago

angstymeat commented 9 years ago

This is a split off of #3235. The symptom is that after a large set up rsyncs containing lots of small files, I'm left with arc_adapt spinning at around 3.3% CPU time.

I've compiled the info that @dweeezil wanted here: https://cloud.passcal.nmt.edu/index.php/s/l52T2UhZ0K7taY9.

The system is a Dell R515 with 16GB of RAM. The pool is a single raidz2 pool made of of 7 2TB SATA drives. Compression is enabled and set to lz4. atime is off.

The OS is Fedora 20 with the 3.17.8-200.fc20.x86_64 kernel.

The machine this happens on is an offsite server that hosts our offsite backups. There are 20+ rsync processes running that send large and small files from our internal systems to this system. The majority of the files are either large data files or the files you would typically find as part of a linux installation (/usr, /bin, /var, etc.)

Also, about 50 home directories are backed up containing a mix of large and small files.

This takes about 45 minutes. One hour after these jobs are started, the email servers begin their backup (so usually a 15 minute delay between the start of one set of backups and another). Also, our large data collector sends its backups at this time. These are large files, but a lot of them. It is sometime during this 2nd stage backup that this issue occurs.

During the backup, I have another process that periodically runs an echo 2 > /proc/sys/vm/drop_caches. I do this because once the ARC gets filled up performance drops drastically. Without doing this, it will take up to 10 hours to perform a backup. With it, it will take less than 2 hours.

This happens even if I do not run the periodic drop_caches, but seems to occur less often.

The drop_caches is a relatively new addition to my scripting on this system as this didn't appear to happen under 0.6.3. I don't have a good idea when it started, but I'm pretty sure it was sometime around the kmem rework on 0.6.4.

I am unable to rollback and test under 0.6.3 because while I was testing 0.6.4, the new feature flags were enabled. This unit is not a critical system, so I have quite a bit of leeway with it, as long as I don't have to destroy and recreate the pool. I usually use it to test new versions of ZFS since it gets a lot of disk usage.

dweeezil commented 9 years ago

@angstymeat I've started looking over your debug output. It's not really spinning in arc_adapt_thread() but, instead, what's likely happening is that arc_adjust_meta() is running through all 4096 iterations and not doing much of anything. The drop_caches is likely only making things worse which is why I never recommend using it as a workaround anymore.

You might try increasing zfs_arc_meta_prune from its default of 10000 to maybe 100000 or even higher. Unfortunately, I didn't ask you to put the arcstats in your debugging output but I've got a very good idea of what the numbers would look like. Your dnode cache (dnode_t) has got over 3 million entries and in order to beat down the ARC metadata, they've got to be cleaned out a lot faster. I'll be interested to hear whether increasing zfs_arc_meta_prune helps. If it does, it seems we could add a heuristic to progressively multiply it internally if things aren't being freed rapidly enough.

If all else fails and you want to try something which should immediately free some metadata, try settinng zfs_arc_meta_limit to a low-ish value.

Regardless, I think you should avoid using drop_caches if possible.

angstymeat commented 9 years ago

I had been running with zfs_arc_meta_prune set to 100000, but with the same results. When I rebooted this system last, I set it back to the default.

I really want to avoid using drop_cache since it is such a hack. I've been using it because the backup times had become so high.

If 100,000 isn't enough, do you have a recommendation for a higher number? Should I try increasing the order of magnitude each time I test it? Is there an upper limit value that will cause more problems than it solves?

angstymeat commented 9 years ago

Also, does zfs_arc_meta_prune take effect immediately, or do I need to reboot?

angstymeat commented 9 years ago

I just tried setting `zfs_arc_meta_limitto 100MB and uppedzfs_arc_meta_prune`` to 1,000,000 without rebooting, but nothing has changed. arc_adapt is still running, and there's no change in the amount of memory in use.

snajpa commented 9 years ago

@dweezil what's your reasoning behind going as far as to say it would be best to avoid drop_caches completely? Since the VFS dentry cache can't be bypassed I see no other option, than to use this as a workaround. I've set vfs_cache_pressure to really high number (IIRC 1M), but in the end the dentry cache over time causes ARC as whole to shrink. My use-case is OpenVZ containers, where others use KVM. If the shrinkers registered to VM can be prioritized, which is the only sensible solution I can think of, that wouldn't still do as much, as there are other dynamic caches in the system.

I'm just being curious, what negative side effects does drop_caches have? Besides obviously dropping even dentries for non-ZFS data (though that's not a problem in my scenario).

On 16 Apr 2015, at 01:36, Tim Chase notifications@github.com wrote:

@angstymeat I've started looking over your debug output. It's not really spinning in arc_adapt_thread() but, instead, what's likely happening is that arc_adjust_meta() is running through all 4096 iterations and not doing much of anything. The drop_caches is likely only making things worse which is why I never recommend using it as a workaround anymore.

You might try increasing zfs_arc_meta_prune from its default of 10000 to maybe 100000 or even higher. Unfortunately, I didn't ask you to put the arcstats in your debugging output but I've got a very good idea of what the numbers would look like. Your dnode cache (dnode_t) has got over 3 million entries and in order to beat down the ARC metadata, they've got to be cleaned out a lot faster. I'll be interested to hear whether increasing zfs_arc_meta_prune helps. If it does, it seems we could add a heuristic to progressively multiply it internally if things aren't being freed rapidly enough.

If all else fails and you want to try something which should immediately free some metadata, try settinng zfs_arc_meta_limit to a low-ish value.

Regardlesss, I think you should avoid using drop_caches if possible.

— Reply to this email directly or view it on GitHub.

angstymeat commented 9 years ago

I've set zfs_arc_meta_prune up to 50,000,000 and it's not helping much with performance, but I think I've seen larger amounts of cache cleared in arcstat.py.

My rsyncs are still running so I'm going to wait until it's finished and see if arc_adapt continues to run afterwards.

So far, I've tried this with zfs_arc_meta_prune set to 100,000 and 1,000,000 with the same results of arc_adapt continuing to run long after the rsync processes have finished.

Also, I'm seeing zfs_arc_meta_used exceeding zfs_arc_meta_limit by almost 600MB right now.

dweeezil commented 9 years ago

@snajpa Regarding potential issues when using drop-caches, see https://github.com/zfsonlinux/spl/issues/420#issuecomment-66284710.

@angstymeat Your issue certainly appears to be the same as that which bc88866 and 2cbb06b were intended to fix. The question is why it's not working in your case. I supposed a good first step would be to watch the value of arc_prune in arcstats and make sure it's increasing.

As to increasing zfs_arc_meta_prune, now that I think about it, the default tuning of zfs_arc_meta_prune=10000 ought to be OK given that you've only got about 2 million znodes. With 2048 metadata iterations, in arc_adjust_meta(), the final prune count would be 20.48M which is much higher than the number of cached objects in your case.

dweeezil commented 9 years ago

I think I found the problem. torvalds/linux@9b17c6238 (appeared first in kernel 3.12) added a node id argument to nr_cached_objects and is causing our autoconf test to fail.

As a hack, you can try:

diff --git a/config/kernel-shrink.m4 b/config/kernel-shrink.m4
index 1c211ed..6e88a7e 100644
--- a/config/kernel-shrink.m4
+++ b/config/kernel-shrink.m4
@@ -72,7 +72,7 @@ AC_DEFUN([ZFS_AC_KERNEL_NR_CACHED_OBJECTS], [
        ZFS_LINUX_TRY_COMPILE([
                #include <linux/fs.h>

-               int nr_cached_objects(struct super_block *sb) { return 0; }
+               int nr_cached_objects(struct super_block *sb, int nid) { return 0; }

                static const struct super_operations
                    sops __attribute__ ((unused)) = {

I'll see if I can work up a proper autoconf test later today (which will handle the intermediate kernels). We'll need yet another flag to handle < 3.12 kernels which still do have the callback.

dweeezil commented 9 years ago

And to make matters even more fun, torvalds/linux@4101b62 changed it yet again (post 4.0). This issue also applies to the free_cached_objects callback. I'm working on an enhanced autoconf test.

snajpa commented 9 years ago

@dweeezil If understand this correctly, echoing 1 to drop_caches is OK, when we're talking primarily about cached metadata - and that's what I've been doing from start. You're right that echoing 2 might get the system stuck in reclaim ~forever, I've experienced that. Bad idea doing that on a larger system (RHEL6 kernel, 2.6.32 patched heavily).

dweeezil commented 9 years ago

I just posted pull request #3308 to properly enable the per-superblock shrinker callbacks.

I'm wondering, however, whether we ought to actually do something in the free_cached_objects function. It seems like we ought to call arc_do_user_prune() utilizing the count (which, of course, can't be done right now - it's static plus likely a major layering violation). This might obviate the need for the newly-added restart logic in arc_adjust_meta().

dweeezil commented 9 years ago

@snajpa Yes, my comments apply only to the echo 2 or echo 3 cases which I've seen suggested from time to time.

angstymeat commented 9 years ago

My backup has now been running for 15 hours. I'm going to stop it, apply #3308, leave zfs_arc_meta_prune at its default value, and try it again.

dweeezil commented 9 years ago

@angstymeat Please hold off just a bit on applying that patch. I'm trying to get a bit of initial testing done right now and am going to try to do something to zpl_free_cached_objects() as well.

angstymeat commented 9 years ago

I just finished compiling and I'm rebooting, but I'll hold off on doing anything.

dweeezil commented 9 years ago

@angstymeat I'm not going to have a chance to work with this more until later. The latest commit in the branch (6318203) should be safe, however, I'll be surprised it if helps. That said, it certainly shouldn't hurt matters any.

angstymeat commented 9 years ago

I'll try it out just to make sure nothing breaks, then.

dweeezil commented 9 years ago

@angstymeat As I pointed out in a recent comment in #3308, it's very unlikely to make any difference at all and may even make things worse. Have you been able to grab arcstats (arc_prune in particular) during the problem yet?

angstymeat commented 9 years ago

I wasn't expecting anything, and I tried it with the April 17th ZFS commits. I ran the backups and arc_adapt continued running afterwards. It's hard to tell, but it could be running with a little more CPU usage than before. It's range before was around 3% to 3.8%, not it looks like it is between 3.8% to 4.3%.

It's been going for about 6 hours now, and arc_prune is incrementing. It looks like it's incrementing around 3680 per second.

Here's arcstats:

name                            type data
hits                            4    44003792
misses                          4    35268521
demand_data_hits                4    2409447
demand_data_misses              4    296772
demand_metadata_hits            4    25134591
demand_metadata_misses          4    30445808
prefetch_data_hits              4    3111351
prefetch_data_misses            4    1837834
prefetch_metadata_hits          4    13348403
prefetch_metadata_misses        4    2688107
mru_hits                        4    11009034
mru_ghost_hits                  4    119660
mfu_hits                        4    16535007
mfu_ghost_hits                  4    145394
deleted                         4    34539715
recycle_miss                    4    32133498
mutex_miss                      4    131726
evict_skip                      4    113792369794
evict_l2_cached                 4    0
evict_l2_eligible               4    727929288704
evict_l2_ineligible             4    74060988416
hash_elements                   4    126993
hash_elements_max               4    619388
hash_collisions                 4    2202365
hash_chains                     4    3824
hash_chain_max                  4    6
p                               4    8338591744
c                               4    8338591744
c_min                           4    4194304
c_max                           4    8338591744
size                            4    7512477800
hdr_size                        4    51871896
data_size                       4    0
meta_size                       4    2098882560
other_size                      4    5361723344
anon_size                       4    18759680
anon_evict_data                 4    0
anon_evict_metadata             4    0
mru_size                        4    1571022848
mru_evict_data                  4    0
mru_evict_metadata              4    0
mru_ghost_size                  4    0
mru_ghost_evict_data            4    0
mru_ghost_evict_metadata        4    0
mfu_size                        4    509100032
mfu_evict_data                  4    0
mfu_evict_metadata              4    0
mfu_ghost_size                  4    0
mfu_ghost_evict_data            4    0
mfu_ghost_evict_metadata        4    0
l2_hits                         4    0
l2_misses                       4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_hdr_miss              4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_free_on_write                4    0
l2_cdata_free_on_write          4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_asize                        4    0
l2_hdr_size                     4    0
l2_compress_successes           4    0
l2_compress_zeros               4    0
l2_compress_failures            4    0
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    0
memory_direct_count             4    0
memory_indirect_count           4    0
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    963352932
arc_meta_used                   4    7512477800
arc_meta_limit                  4    6253943808
arc_meta_max                    4    7628319744

and again from 30 seconds later:

name                            type data
hits                            4    44003792
misses                          4    35268521
demand_data_hits                4    2409447
demand_data_misses              4    296772
demand_metadata_hits            4    25134591
demand_metadata_misses          4    30445808
prefetch_data_hits              4    3111351
prefetch_data_misses            4    1837834
prefetch_metadata_hits          4    13348403
prefetch_metadata_misses        4    2688107
mru_hits                        4    11009034
mru_ghost_hits                  4    119660
mfu_hits                        4    16535007
mfu_ghost_hits                  4    145394
deleted                         4    34539715
recycle_miss                    4    32133498
mutex_miss                      4    131726
evict_skip                      4    113792369794
evict_l2_cached                 4    0
evict_l2_eligible               4    727929288704
evict_l2_ineligible             4    74060988416
hash_elements                   4    126993
hash_elements_max               4    619388
hash_collisions                 4    2202365
hash_chains                     4    3824
hash_chain_max                  4    6
p                               4    8338591744
c                               4    8338591744
c_min                           4    4194304
c_max                           4    8338591744
size                            4    7512477800
hdr_size                        4    51871896
data_size                       4    0
meta_size                       4    2098882560
other_size                      4    5361723344
anon_size                       4    18759680
anon_evict_data                 4    0
anon_evict_metadata             4    0
mru_size                        4    1571022848
mru_evict_data                  4    0
mru_evict_metadata              4    0
mru_ghost_size                  4    0
mru_ghost_evict_data            4    0
mru_ghost_evict_metadata        4    0
mfu_size                        4    509100032
mfu_evict_data                  4    0
mfu_evict_metadata              4    0
mfu_ghost_size                  4    0
mfu_ghost_evict_data            4    0
mfu_ghost_evict_metadata        4    0
l2_hits                         4    0
l2_misses                       4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_hdr_miss              4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_free_on_write                4    0
l2_cdata_free_on_write          4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_asize                        4    0
l2_hdr_size                     4    0
l2_compress_successes           4    0
l2_compress_zeros               4    0
l2_compress_failures            4    0
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    0
memory_direct_count             4    0
memory_indirect_count           4    0
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    963412324
arc_meta_used                   4    7512477800
arc_meta_limit                  4    6253943808
arc_meta_max                    4    7628319744
dweeezil commented 9 years ago

@angstymeat Does this system have selinux enabled (and using a policy which supports zfs) and/or are there a lot of xattrs set on the files on the zfs filesystem? Is xattr=sa set?

The problem you're facing is that the new logic in arc_adjust_meta() is not making any progress because the kernel isn't able to free dentries which, effectively pins a whole lot of metadata into memory. My hunch is this may have something to do with dir-style xattrs.

I'm going to modify my test suite to use a dir-style xattr on each file to see whether I can duplicate this behavior.

angstymeat commented 9 years ago

No selinux, but xattr=sa is set on all of the filesystems since we're backing up a number of systems that have them.

dweeezil commented 9 years ago

@angstymeat Do you have any non-default settings for primarycache or any module parameters, especially any related to prefetch? I'd also like to confirm that there is only one pool on the system. I'm asking these questions because of how large evict_skip is and also that data_size is zero.

I've not been able to duplicate this problem yet but it sounds like the key is that you've also likely got a pretty heavy write load going on at some time as well as all the filesystem traversal.

angstymeat commented 9 years ago

There is one pool called "storage".

I have atime=off, xattr=sa, relatime=off, and acltype=posixacl. I also have com.sun:auto-snapshot set to true or false on various filesystems.

EDIT: All of my module parameters should be standard. I've removed any modprobe.d settings so they don't interfere with the debugging.

I do see that after the first 25 to 30 minutes of my backups running that I lose all of the data-portion of the ARC and I'm only caching metadata.

I use Cacti to graph some of my arc stats like cache hits & misses, size of the cache, arc size, meta_used, etc. Is there anything I can add to my graphs that would help?

spacelama commented 9 years ago

I had arc_adapt spinning at 80% cpu after a week of uptime on 0.6.4

xattr=on, but not being used AFAIK.

I'll try grabbing stats next time instead of emergency-rebooting.

spacelama commented 9 years ago

On Sun, 19 Apr 2015, Tim Chase wrote:

@angstymeat Does this system have selinux enabled (and using a policy which supports zfs) and/or are there a lot of xattrs set on the files on the zfs filesystem? Is xattr=sa set?

The problem you're facing is that the new logic in arc_adjust_meta() is not making any progress because the kernel isn't able to free dentries which, effectively pins a whole lot of metadata into memory. My hunch is this may have something to do with dir-style xattrs.

I'm going to modify my test suite to use a dir-style xattr on each file to see whether I can duplicate this behavior.

arc_adapt 3% to 66% usage here (occasionally 100%). nfs serving of 600kB files incredibly slow (seconds). 3 devices in pool have to do 160tps in iostat -k 1 for the several seconds it takes to load up a single 600kB file. NFS serving is of tank3 filesystems below:

zpool list NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT tank 2.72T 2.21T 522G - 63% 81% 1.00x ONLINE - tank3 5.45T 3.24T 2.22T - 15% 59% 1.00x ONLINE - tank4 17.5G 960K 17.5G - 0% 0% 1.00x ONLINE -

zfs list NAME USED AVAIL REFER MOUNTPOINT tank 2.21T 479G 120K /tank tank/backuppc 2.21T 479G 2.20T legacy tank/backuppc@20150430 3.94G - 2.19T - tank/backuppc@20150501 900M - 2.19T - tank/backuppc@20150502 222M - 2.19T - tank/backuppc@20150503 94.8M - 2.20T - tank3 2.16T 1.42T 256K /tank3 tank3/.mp3.low_qual 7.09G 1.42T 7.09G /home/tconnors/.mp3.low_qual tank3/apt_archives 28.9G 1.42T 28.9G /var/cache/apt/archives tank3/background 268G 1.42T 268G /home/tconnors/background tank3/mp3 159G 1.42T 159G /home/tconnors/mp3 tank3/images 820G 1.42T 820G /home/tconnors/images tank3/photos 194G 1.42T 194G /home/tconnors/photos tank3/pvr 365G 1.42T 365G /home/tconnors/movies/kaffeine tank3/qBT_dir 165G 1.42T 165G /home/tconnors/qBT_dir tank3/raid 26.0G 1.42T 26.0G - tank3/scratch 128K 1.42T 128K /home/tconnors/scratch tank3/sysadmin 7.72G 1.42T 7.72G /home/tconnors/sysadmin tank3/thesis 166G 1.42T 166G /home/tconnors/thesis tank4 852K 17.2G 120K /tank4

zfs get xattr NAME PROPERTY VALUE SOURCE tank xattr on default tank/backuppc xattr on default tank/backuppc@20150430 xattr on default tank/backuppc@20150501 xattr on default tank/backuppc@20150502 xattr on default tank/backuppc@20150503 xattr on default tank3 xattr on default tank3/.mp3.low_qual xattr on default tank3/apt_archives xattr on default tank3/background xattr on default tank3/mp3 xattr on default tank3/images xattr on default tank3/photos xattr on default tank3/pvr xattr on default tank3/qBT_dir xattr on default tank3/raid xattr - - tank3/scratch xattr on default tank3/sysadmin xattr on default tank3/thesis xattr on default tank4 xattr on default

zfs get primarycache NAME PROPERTY VALUE SOURCE tank primarycache all default tank/backuppc primarycache all default tank/backuppc@20150430 primarycache all default tank/backuppc@20150501 primarycache all default tank/backuppc@20150502 primarycache all default tank/backuppc@20150503 primarycache all default tank3 primarycache all default tank3/.mp3.low_qual primarycache all default tank3/apt_archives primarycache all default tank3/background primarycache all default tank3/mp3 primarycache all default tank3/images primarycache all default tank3/photos primarycache all default tank3/pvr primarycache all default tank3/qBT_dir primarycache all default tank3/raid primarycache all default tank3/scratch primarycache all default tank3/sysadmin primarycache all default tank3/thesis primarycache all default tank4 primarycache none local

(tank4 has nothing on it, tank3/raid is a zvol, tank3 is a 3 disk pool with log and cache, tank is a 1 disk pool with log and cache)

modinfo zfs | grep version version: 0.6.4-1-2-wheezy srcversion: 6189308C5B9F92C2EE5B9F0 vermagic: 3.16.0-0.bpo.4-amd64 SMP mod_unload modversions 0-0-17:07:15, Sun May 03 tconnors@fs:~ (bash) modinfo spl | grep version version: 0.6.4-1-wheezy srcversion: 88A374A9A6ABDC4BD14DF0E vermagic: 3.16.0-0.bpo.4-amd64 SMP mod_unload modversions

As documented in #3235,

here:

http://rather.puzzling.org/~tconnors/tmp/zfs-stats.tar.gz

2 runs of perf record -ag, perf report --stdio here:

http://rather.puzzling.org/~tconnors/tmp/perf.data.old.gz http://rather.puzzling.org/~tconnors/tmp/perf.data.gz http://rather.puzzling.org/~tconnors/tmp/perf.txt.old.gz http://rather.puzzling.org/~tconnors/tmp/perf.txt.gz

Workload was absolutely nil at the time, except for the 5-minutely munin runs, which most likely didn't correspond to these times.

uptime is 13 days. Previous reboot was because same symptoms developed after 7 days. Box is an NFS server, and does full/incremental backups of a network of 3 other machines with up to ~800GB filesystems over rsync, last backup probably having run 12 hours earlier (certainly not running at this time).

Tim Connors

woffs commented 9 years ago

same here, v0.6.4-12-0c60cc-wheezy, linux-3.16, two pools, dedup=off, xattr=sa. when arc_meta is full (beyond its own limits: arc_meta_limit=38092588032, arc_meta_max=43158548680) performance is gone. drop_caches=3 helps. problem is new with 0.6.4, wasn't there with 0.6.0 ... 0.6.3 (or at least 0.6.2). Reminds me of old 0.5.x times, when I had to do drop_caches as well. 0.6.0ff didn't need that

woffs commented 9 years ago

Still same problem here. Setting zfs_arc_meta_prune to large values doesn't help. Setting zfs_arc_meta_limit to a rather low value doesn't help. The problem shows up especially when traversing and renaming many entries in large directories. (I still have to use drop_caches=2, but I already experienced a deadlock there, so this is no real solution.) Is there a chance to circumvent this condition by setting parameters or to get it fixed inside ZoL? Or by patching linux or choosing another kernel version? I'm running v0.6.4-16-544f71-wheezy at the moment, kernel is 3.16.7-ckt9-3~deb8u1~bpo70+1 (Debian Wheezy). Thank you!

behlendorf commented 9 years ago

@woffs there's still some work underway for this and patches may be available in the next couple days. Although they're targeted primarily at older kernels. Could you check if HAVE_SHRINK is defined in the zfs_config.h file in the zfs build directory.

$ grep HAVE_SHRINK zfs_config.h
#define HAVE_SHRINK 1
woffs commented 9 years ago

kernel 3.2 (3.2.68-1+deb7u1) (which I consider to reboot in) has #define HAVE_SHRINK 1 kernel 3.16 (3.16.7-ckt9-3~deb8u1~bpo70+1) (which is just running) has /* #undef HAVE_SHRINK */

What does that mean for our problem here? Ah, I see, that failed autoconf test under 3.16. HAVE_SHRINK should be 1 here, too, right?

So, would it help me to boot into 3.2 until this is fixed? In other words, does the per-filesystem shrinker help us out of our initial arc_adjust_meta problem?

angstymeat commented 9 years ago

@behlendorf , @dweeezil :

I applied the patch from #3481 this morning and I'm seeing a night and day difference. CPU & IO is smoother, not bursty. I got through my entire backup without having the hideous performance problems I had been reporting.

The backup performance seems to be back to where it was in 0.6.3, which is good. Up until this point, I had been trying every nightly update of #3115 with only sporadic improvements. My backups are back to about an hour to complete instead of the 8+ hours they had been at (assuming I didn't get deadlocked somewhere).

I'm running a patch stack that consists of #3344, #3374, #3432, and #3481.

behlendorf commented 9 years ago

@woffs yes either HAVE_SPLIT_SHRINKER_CALLBACK or HAVE_SHRINK should be defined. If neither are then you'll fallback in to some compatibility code which has a known bug which prevents dentries from being reclaimed. I have a patch to address that but it still needs a little cleanup and testing.

@angstymeat thanks for the feedback. That's very good news. Hopefully we should be able to cut down on your patch stack in the next few days by getting these patches finalized and merged.

woffs commented 9 years ago

@behlendorf , @angstymeat That's great news. Although I am willing to test patches and compile something (I was doing this in the old pre-0.5.x-days, but now I'm just using the debian packages), I have kept my Debian daily zfs version and have just rebooted into Linux 3.2 and will monitor what will happen. First impression is good, but it'll take a few days to be sure.

angstymeat commented 9 years ago

I might have spoken too soon. The first backup script run I did earlier went great. Now, the scheduled one at 2am can't seem to keep up. I have arc_adapt running between 25%-60% CPU and my memory is holding stead at around 12.5GB out of 16GB in use. There are 64 arc_prune threads running between 1.5%-6% each.

Overall CPU usage is almost 50% greater than it was earlier, and I don't know what the difference is.

arc_meta_used is 600MB over arc_meta_limit. I know that arc_meta_limit is now a soft cap, but it's something that didn't happen during the earlier run.

It still seems a bit better overall than it was, but not near what I was getting under 0.6.3.

I tried bumping zfs_arc_meta_prune up to 100000, 1000000, and even 10000000, but I'm not seeing a difference.

The whole backup has been running for 2.5 hours right now, and I'll check it again in the morning.

dweeezil commented 9 years ago

@angstymeat Just a hunch, could you please install perf and run perf top -ag while the system is under high CPU load and see where all the time is really being spent. I've got a hunch you'll see a lot of time spent in _raw_spin_lock_irq. In fact, after starting perf top, you can type /spin_lock (in the perf curses interface) to restrict the view to only those related functions. The number of interest is listed under the column labeled "Self". If that's not the hog, it would be interesting to know which functions have the highest "Self" values. I suspect it's not the ZFS code, proper.

angstymeat commented 9 years ago

arc_adapt and the arc_prune threads are still running...

I'm showing:

53.51% _raw_spin_lock_irqsave
0.42% _raw_spin_lock
0.25% _raw_spin_lock_irq
woffs commented 9 years ago

arc_adapt and the arc_prune threads are still running...

I'm showing:

53.51% _raw_spin_lock_irqsave
0.42% _raw_spin_lock
0.25% _raw_spin_lock_irq

This looks somehow familiar to me.

kernelOfTruth commented 9 years ago

@angstymeat https://bugzilla.redhat.com/show_bug.cgi?id=879801#c13

Regarding to this isolate_freepages tries to defrag the ram to create contiguous memory space for the transparent_hugepages. After

echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/defrag
echo never > /sys/kernel/mm/transparent_hugepage/defrag
echo never > /sys/kernel/mm/transparent_hugepage/enabled

I don't see this problem on my system anymore...could someone double check please?

https://bugzilla.redhat.com/show_bug.cgi?id=888380 https://bugzilla.redhat.com/show_bug.cgi?id=879801

https://bugzilla.redhat.com/show_bug.cgi?id=879801#c14

Yes, as per http://forums.fedoraforum.org/showthread.php?t=285246, the following is a successful work-around for the issue:

echo never > /sys/kernel/mm/transparent_hugepage/defrag

angstymeat commented 9 years ago

At the least, echoing these settings while I'm in this state doesn't appear to make a difference.

dweeezil commented 9 years ago

@angstymeat Seems my hunch was right. The next step is to do the same thing, filter the perf top display down to those functions and then use the "E" command to expand the display and show the places from which the spin locks are being acquired. Obviously, they're somewhere in the threads you mentioned but it would be nice to know what call path is causing all the spinning.

angstymeat commented 9 years ago
-   42.93%    42.93%  [kernel]  [k] _raw_spin_lock_irqsave                                                                    ▒
   - _raw_spin_lock_irqsave                                                                                                   ▒
      - 70.71% taskq_thread                                                                                                   ▒
           kthread                                                                                                            ▒
           ret_from_fork                                                                                                      ▒
      - 23.04% taskq_dispatch                                                                                                 ▒
           arc_prune_async                                                                                                    ▒
           arc_adjust                                                                                                         ▒
           arc_adapt_thread                                                                                                   ▒
           thread_generic_wrapper                                                                                             ▒
           kthread                                                                                                            ▒
           ret_from_fork                                                                                                      ▒
      - 2.22% __wake_up                                                                                                       ▒
         - 50.75% taskq_thread                                                                                                ▒
              kthread                                                                                                         ▒
              ret_from_fork                                                                                                   ▒
         - 44.74% task_done                                                                                                   ▒
              taskq_thread                                                                                                    ▒
              kthread                                                                                                         ▒
              ret_from_fork                                                                                                   ▒
         - 3.33% taskq_dispatch                                                                                               ▒
              arc_prune_async                                                                                                 ▒
              arc_adjust                                                                                                      ▒
              arc_adapt_thread                                                                                                ▒
              thread_generic_wrapper                                                                                          ▒
              kthread                                                                                                         ▒
              ret_from_fork                                                                                                   ▒
      - 1.55% try_to_wake_up                                                                                                  ▒
           default_wake_function                                                                                              ▒
      - 1.09% arc_prune_async                                                                                                 ▒
           arc_adjust                                                                                                         ▒
           arc_adapt_thread                                                                                                   ▒
           thread_generic_wrapper                                                                                             ▒
           kthread                                                                                                            ▒
           ret_from_fork                                                                                                      ▒
      - 0.65% add_wait_queue_exclusive                                                                                        ▒
           taskq_thread                                                                                                       ▒
           kthread                                                                                                            ▒
-    0.62%     0.62%  [kernel]  [k] _raw_spin_lock                                                                            ▒
   - _raw_spin_lock                                                                                                           ▒
        14.18% list_lru_count_node                                                                                            ▒
           super_cache_scan                                                                                                   ▒
           zfs_sb_prune                                                                                                       ▒
           zpl_prune_sb                                                                                                       ▒
           arc_prune_task                                                                                                     ▒
           taskq_thread                                                                                                       ▒
           kthread                                                                                                            ▒
           ret_from_fork                                                                                                      ▒
        10.79% zfs_sb_prune                                                                                                   ▒
           zpl_prune_sb                                                                                                       ▒
           arc_prune_task                                                                                                     ▒
           taskq_thread                                                                                                       ▒
           kthread                                                                                                            ▒
           ret_from_fork                                                                                                      ▒
        6.22% zio_done                                                                                                        ▒
        6.16% taskq_dispatch                                                                                                  ▒
           arc_prune_async                                                                                                    ▒
           arc_adjust                                                                                                         ▒
           arc_adapt_thread                                                                                                   ▒
           thread_generic_wrapper                                                                                             ▒
           kthread                                                                                                            ▒
           ret_from_fork                                                                                                      ▒
        5.33% super_cache_scan                                                                                                ▒
           zfs_sb_prune                                                                                                       ▒
           zpl_prune_sb                                                                                                       ▒
           arc_prune_task                                                                                                     ▒
           taskq_thread                                                                                                       ▒
           kthread                                                                                                            ▒
           ret_from_fork                                                                                                      ▒
        4.52% list_lru_walk_node                                                                                              ▒
        3.75% get_next_timer_interrupt                                                                                        ▒
           tick_nohz_stop_sched_tick                                                                                          ▒
           __tick_nohz_idle_enter                                                                                             ▒
        3.74% try_to_wake_up                                                                                                  ▒
        2.49% dbuf_rele_and_unlock                                                                                            ▒
        2.39% zio_wait_for_children                                                                                           ▒
        2.10% rrw_enter_read                                                                                                  ▒
        2.09% put_super                                                                                                       ▒
           super_cache_scan                                                                                                   ▒
           zfs_sb_prune                                                                                                       ▒
           zpl_prune_sb                                                                                                       ▒
           arc_prune_task                                                                                                     ▒
           taskq_thread                                                                                                       ▒
           kthread                                                                                                            ▒
           ret_from_fork                                                                                                      ▒
        2.00% grab_super_passive                                                                                              ▒
           super_cache_scan                                                                                                   ▒
           zfs_sb_prune                                                                                                       ▒
           zpl_prune_sb                                                                                                       ▒
           arc_prune_task                                                                                                     ▒
           taskq_thread                                                                                                       ▒
           kthread                                                                                                            ▒
           ret_from_fork                                                                                                      ▒
-    0.25%     0.25%  [kernel]  [k] _raw_spin_lock_irq                                                                        ▒
   - _raw_spin_lock_irq                                                                                                       ▒
      - 92.82% __schedule                                                                                                     ▒
         - 89.04% schedule                                                                                                    ▒
            - taskq_thread                                                                                                    ▒
              kthread                                                                                                         ▒
              ret_from_fork                                                                                                   ▒
         - 10.79% schedule_preempt_disabled                                                                                   ▒
            - cpu_startup_entry                                                                                               ▒
                 85.16% start_secondary                                                                                       ▒
               - 14.84% rest_init                                                                                             ▒
                    start_kernel                                                                                              ▒
                    x86_64_start_reservations                                                                                 ▒
                    x86_64_start_kernel                                                                                       ▒
      - 2.82% schedule_preempt_disabled                                                                                       ▒
         - cpu_startup_entry                                                                                                  ▒
              90.05% start_secondary                                                                                          ▒
            - 9.95% rest_init                                                                                                 ▒
                 start_kernel                                                                                                 ▒
                 x86_64_start_reservations                                                                                    ▒
                 x86_64_start_kernel                                                                                          ▒
      - 2.06% schedule                                                                                                        ▒
         - 94.29% taskq_thread                                                                                                ▒
              kthread                                                                                                         ▒
              ret_from_fork                                                                                                   ▒
         - 3.18% schedule_timeout                                                                                             ▒
            - 51.62% __cv_timedwait_common                                                                                    ▒
                 __cv_timedwait_interruptible                                                                                 ▒
                 arc_user_evicts_thread                                                                                       ▒
                 thread_generic_wrapper                                                                                       ▒
                 kthread                                                                                                      ▒
                 ret_from_fork                                                                                                ▒
            - 48.38% rcu_gp_kthread                                                                                           ▒
                 kthread                                                                                                      ▒
                 ret_from_fork                                                                                                ▒
           1.68% schedule_hrtimeout_range_clock                                                                               ▒
              schedule_hrtimeout_range                                                                                        ▒
              poll_schedule_timeout                                                                                           ▒
           0.86% rcu_gp_kthread                                                                                               ▒
              kthread                                                                                                         ▒
              ret_from_fork                                                                                                   ▒
      - 0.50% __do_softirq                                                                                                    ▒
angstymeat commented 9 years ago

@kernelOfTruth , disabling transparent huge pages doesn't appear to have made a difference.

angstymeat commented 9 years ago

I was compiling ZFS & SPL with today's commits, and I just noticed that I'm booting with slub_nomerge on my grub command line. It looks like I added it September last year when @dweeezil was helping me with #2725.

I don't know if it's causing any problems or not, but I just removed it.

dweeezil commented 9 years ago

@angstymeat How many filesystems are mounted? I have a feeling you're getting burned by concurrent pruning of a lot of different filesystems and, to a lesser degree, because we may hold arc_prune_mtx for a long time if there are a lot of filesystems. This thread is getting a lot of activity and it's become difficult for me to keep track of your specific details. You mentioned using almost 15GiB of ARC above but your original posting suggests the system has only 16GiB of RAM. Have you increased the max limits? Also, does the system really have 64 threads available (which would likely imply a 4-node NUMA system with an 8/16 CPU in each node but only 16GiB of RAM total)? I'm a bit concerned too many threads are being created for the system, especially if it really has that balance between CPU power and RAM.

angstymeat commented 9 years ago

I've got 1 pool with 35 filesystems.

My arc is using 7.7GB with arc_max at its default of almost 8G. The system has 2 Opteron 4122 processors (2 CPUs with 4 cores each). I haven't touch any of the defaults, except zfs_arc_meta_prune which is currently at 100000.

I do have another 16GB for it that I was waiting to put in until this was debugged; I have to send it to the remote site and have the sysadmin up there install it.

My main concern is that I wasn't seeing this happen under 0.6.3, and I don't think I was seeing it on 0.6.4 until after the kmem rework. I would like to go back and run tests under 0.6.3, but I wasn't paying attention early on and upgraded the pools.

I've thought about offloading all of its data, going back to 0.6.3, reloading and running tests, but it will take quite a while to do, and I'm not sure that 0.6.3 runs under the newer kernels (3.18+) or not.

dweeezil commented 9 years ago

@angstymeat Could you please post the output of dmesg | grep nr_cpuids. It looks like the kernel thinks it's possible to hot-plug up to 64 CPUs given that you've got 64 arc_prune threads running. If you can tolerate a reboot, adding nr_cpus=8 to the kernel command line might provide some relief. There are other places where max_possible_cpus() is used for scaling so it's likely your CPU power is being overcommitted pretty badly. Your 35 filesystems isn't quite as high as I expected but it's still plenty to overwhelm the CPUs when we're going to allow up to 64 threads.

I'll note that ZoL's use of max_ncpus (particularly as it relates to mutexes) is something I'm planning on investigating now that the arc mutex contention patch has been committed. Among other things, it causes trouble on systems where the kernel is built with CONFIG_MAXSMP set, particularly in EC2/Xen environments.

I'm going to try to come up with a patch to keep the pruning under better control, particularly on systems with lots of filesystems. With ZFS, it's not uncommon for there to be hundreds or even thousands of filesystems mounted, particularly when considering snapshots.

angstymeat commented 9 years ago

I've got this for dmesg | grep nr_cpu_ids:

[    0.000000] setup_percpu: NR_CPUS:1024 nr_cpumask_bits:64 nr_cpu_ids:64 nr_node_ids:2
[    0.000000]  RCU restricting CPUs from NR_CPUS=1024 to nr_cpu_ids=64.
[    0.000000] RCU: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=64

Rebooting this machine isn't a problem. I'll make the change and try the backups again.

dweeezil commented 9 years ago

@angstymeat At least it was cut down to 64. It looks like your kernel was built with CONFIG_NR_CPUS=1024.

angstymeat commented 9 years ago

It's the stock Fedora 20 kernel. I'm currently trying a script run with the nr_cpus set to 8.

angstymeat commented 9 years ago

And I've got it in the same state again with arc_adapt at 17%-30% CPU and 8 arc_prune threads running. It also seems to happen a few minutes into the mail directory backup.

dweeezil commented 9 years ago

@angstymeat I didn't expect it would help all that much but at least the number of various threads and mutexes on your system are more sane with the nr_cpus setting.