Closed angstymeat closed 9 years ago
@angstymeat I've started looking over your debug output. It's not really spinning in arc_adapt_thread()
but, instead, what's likely happening is that arc_adjust_meta()
is running through all 4096 iterations and not doing much of anything. The drop_caches is likely only making things worse which is why I never recommend using it as a workaround anymore.
You might try increasing zfs_arc_meta_prune
from its default of 10000 to maybe 100000 or even higher. Unfortunately, I didn't ask you to put the arcstats in your debugging output but I've got a very good idea of what the numbers would look like. Your dnode cache (dnode_t) has got over 3 million entries and in order to beat down the ARC metadata, they've got to be cleaned out a lot faster.
I'll be interested to hear whether increasing zfs_arc_meta_prune
helps. If it does, it seems we could add a heuristic to progressively multiply it internally if things aren't being freed rapidly enough.
If all else fails and you want to try something which should immediately free some metadata, try settinng zfs_arc_meta_limit
to a low-ish value.
Regardless, I think you should avoid using drop_caches
if possible.
I had been running with zfs_arc_meta_prune
set to 100000, but with the same results. When I rebooted this system last, I set it back to the default.
I really want to avoid using drop_cache
since it is such a hack. I've been using it because the backup times had become so high.
If 100,000 isn't enough, do you have a recommendation for a higher number? Should I try increasing the order of magnitude each time I test it? Is there an upper limit value that will cause more problems than it solves?
Also, does zfs_arc_meta_prune
take effect immediately, or do I need to reboot?
I just tried setting `zfs_arc_meta_limitto 100MB and upped
zfs_arc_meta_prune`` to 1,000,000 without rebooting, but nothing has changed. arc_adapt is still running, and there's no change in the amount of memory in use.
@dweezil what's your reasoning behind going as far as to say it would be best to avoid drop_caches completely? Since the VFS dentry cache can't be bypassed I see no other option, than to use this as a workaround. I've set vfs_cache_pressure to really high number (IIRC 1M), but in the end the dentry cache over time causes ARC as whole to shrink. My use-case is OpenVZ containers, where others use KVM. If the shrinkers registered to VM can be prioritized, which is the only sensible solution I can think of, that wouldn't still do as much, as there are other dynamic caches in the system.
I'm just being curious, what negative side effects does drop_caches have? Besides obviously dropping even dentries for non-ZFS data (though that's not a problem in my scenario).
On 16 Apr 2015, at 01:36, Tim Chase notifications@github.com wrote:
@angstymeat I've started looking over your debug output. It's not really spinning in arc_adapt_thread() but, instead, what's likely happening is that arc_adjust_meta() is running through all 4096 iterations and not doing much of anything. The drop_caches is likely only making things worse which is why I never recommend using it as a workaround anymore.
You might try increasing zfs_arc_meta_prune from its default of 10000 to maybe 100000 or even higher. Unfortunately, I didn't ask you to put the arcstats in your debugging output but I've got a very good idea of what the numbers would look like. Your dnode cache (dnode_t) has got over 3 million entries and in order to beat down the ARC metadata, they've got to be cleaned out a lot faster. I'll be interested to hear whether increasing zfs_arc_meta_prune helps. If it does, it seems we could add a heuristic to progressively multiply it internally if things aren't being freed rapidly enough.
If all else fails and you want to try something which should immediately free some metadata, try settinng zfs_arc_meta_limit to a low-ish value.
Regardlesss, I think you should avoid using drop_caches if possible.
— Reply to this email directly or view it on GitHub.
I've set zfs_arc_meta_prune
up to 50,000,000 and it's not helping much with performance, but I think I've seen larger amounts of cache cleared in arcstat.py.
My rsyncs are still running so I'm going to wait until it's finished and see if arc_adapt continues to run afterwards.
So far, I've tried this with zfs_arc_meta_prune
set to 100,000 and 1,000,000 with the same results of arc_adapt continuing to run long after the rsync processes have finished.
Also, I'm seeing zfs_arc_meta_used
exceeding zfs_arc_meta_limit
by almost 600MB right now.
@snajpa Regarding potential issues when using drop-caches
, see https://github.com/zfsonlinux/spl/issues/420#issuecomment-66284710.
@angstymeat Your issue certainly appears to be the same as that which bc88866 and 2cbb06b were intended to fix. The question is why it's not working in your case. I supposed a good first step would be to watch the value of arc_prune
in arcstats
and make sure it's increasing.
As to increasing zfs_arc_meta_prune
, now that I think about it, the default tuning of zfs_arc_meta_prune=10000
ought to be OK given that you've only got about 2 million znodes. With 2048 metadata iterations, in arc_adjust_meta()
, the final prune count would be 20.48M which is much higher than the number of cached objects in your case.
I think I found the problem. torvalds/linux@9b17c6238 (appeared first in kernel 3.12) added a node id argument to nr_cached_objects
and is causing our autoconf test to fail.
As a hack, you can try:
diff --git a/config/kernel-shrink.m4 b/config/kernel-shrink.m4
index 1c211ed..6e88a7e 100644
--- a/config/kernel-shrink.m4
+++ b/config/kernel-shrink.m4
@@ -72,7 +72,7 @@ AC_DEFUN([ZFS_AC_KERNEL_NR_CACHED_OBJECTS], [
ZFS_LINUX_TRY_COMPILE([
#include <linux/fs.h>
- int nr_cached_objects(struct super_block *sb) { return 0; }
+ int nr_cached_objects(struct super_block *sb, int nid) { return 0; }
static const struct super_operations
sops __attribute__ ((unused)) = {
I'll see if I can work up a proper autoconf test later today (which will handle the intermediate kernels). We'll need yet another flag to handle < 3.12 kernels which still do have the callback.
And to make matters even more fun, torvalds/linux@4101b62 changed it yet again (post 4.0). This issue also applies to the free_cached_objects callback. I'm working on an enhanced autoconf test.
@dweeezil If understand this correctly, echoing 1 to drop_caches is OK, when we're talking primarily about cached metadata - and that's what I've been doing from start. You're right that echoing 2 might get the system stuck in reclaim ~forever, I've experienced that. Bad idea doing that on a larger system (RHEL6 kernel, 2.6.32 patched heavily).
I just posted pull request #3308 to properly enable the per-superblock shrinker callbacks.
I'm wondering, however, whether we ought to actually do something in the free_cached_objects
function. It seems like we ought to call arc_do_user_prune()
utilizing the count (which, of course, can't be done right now - it's static plus likely a major layering violation). This might obviate the need for the newly-added restart logic in arc_adjust_meta()
.
@snajpa Yes, my comments apply only to the echo 2
or echo 3
cases which I've seen suggested from time to time.
My backup has now been running for 15 hours. I'm going to stop it, apply #3308, leave zfs_arc_meta_prune at its default value, and try it again.
@angstymeat Please hold off just a bit on applying that patch. I'm trying to get a bit of initial testing done right now and am going to try to do something to zpl_free_cached_objects()
as well.
I just finished compiling and I'm rebooting, but I'll hold off on doing anything.
@angstymeat I'm not going to have a chance to work with this more until later. The latest commit in the branch (6318203) should be safe, however, I'll be surprised it if helps. That said, it certainly shouldn't hurt matters any.
I'll try it out just to make sure nothing breaks, then.
@angstymeat As I pointed out in a recent comment in #3308, it's very unlikely to make any difference at all and may even make things worse. Have you been able to grab arcstats (arc_prune in particular) during the problem yet?
I wasn't expecting anything, and I tried it with the April 17th ZFS commits. I ran the backups and arc_adapt continued running afterwards. It's hard to tell, but it could be running with a little more CPU usage than before. It's range before was around 3% to 3.8%, not it looks like it is between 3.8% to 4.3%.
It's been going for about 6 hours now, and arc_prune
is incrementing. It looks like it's incrementing around 3680 per second.
Here's arcstats:
name type data
hits 4 44003792
misses 4 35268521
demand_data_hits 4 2409447
demand_data_misses 4 296772
demand_metadata_hits 4 25134591
demand_metadata_misses 4 30445808
prefetch_data_hits 4 3111351
prefetch_data_misses 4 1837834
prefetch_metadata_hits 4 13348403
prefetch_metadata_misses 4 2688107
mru_hits 4 11009034
mru_ghost_hits 4 119660
mfu_hits 4 16535007
mfu_ghost_hits 4 145394
deleted 4 34539715
recycle_miss 4 32133498
mutex_miss 4 131726
evict_skip 4 113792369794
evict_l2_cached 4 0
evict_l2_eligible 4 727929288704
evict_l2_ineligible 4 74060988416
hash_elements 4 126993
hash_elements_max 4 619388
hash_collisions 4 2202365
hash_chains 4 3824
hash_chain_max 4 6
p 4 8338591744
c 4 8338591744
c_min 4 4194304
c_max 4 8338591744
size 4 7512477800
hdr_size 4 51871896
data_size 4 0
meta_size 4 2098882560
other_size 4 5361723344
anon_size 4 18759680
anon_evict_data 4 0
anon_evict_metadata 4 0
mru_size 4 1571022848
mru_evict_data 4 0
mru_evict_metadata 4 0
mru_ghost_size 4 0
mru_ghost_evict_data 4 0
mru_ghost_evict_metadata 4 0
mfu_size 4 509100032
mfu_evict_data 4 0
mfu_evict_metadata 4 0
mfu_ghost_size 4 0
mfu_ghost_evict_data 4 0
mfu_ghost_evict_metadata 4 0
l2_hits 4 0
l2_misses 4 0
l2_feeds 4 0
l2_rw_clash 4 0
l2_read_bytes 4 0
l2_write_bytes 4 0
l2_writes_sent 4 0
l2_writes_done 4 0
l2_writes_error 4 0
l2_writes_hdr_miss 4 0
l2_evict_lock_retry 4 0
l2_evict_reading 4 0
l2_free_on_write 4 0
l2_cdata_free_on_write 4 0
l2_abort_lowmem 4 0
l2_cksum_bad 4 0
l2_io_error 4 0
l2_size 4 0
l2_asize 4 0
l2_hdr_size 4 0
l2_compress_successes 4 0
l2_compress_zeros 4 0
l2_compress_failures 4 0
memory_throttle_count 4 0
duplicate_buffers 4 0
duplicate_buffers_size 4 0
duplicate_reads 4 0
memory_direct_count 4 0
memory_indirect_count 4 0
arc_no_grow 4 0
arc_tempreserve 4 0
arc_loaned_bytes 4 0
arc_prune 4 963352932
arc_meta_used 4 7512477800
arc_meta_limit 4 6253943808
arc_meta_max 4 7628319744
and again from 30 seconds later:
name type data
hits 4 44003792
misses 4 35268521
demand_data_hits 4 2409447
demand_data_misses 4 296772
demand_metadata_hits 4 25134591
demand_metadata_misses 4 30445808
prefetch_data_hits 4 3111351
prefetch_data_misses 4 1837834
prefetch_metadata_hits 4 13348403
prefetch_metadata_misses 4 2688107
mru_hits 4 11009034
mru_ghost_hits 4 119660
mfu_hits 4 16535007
mfu_ghost_hits 4 145394
deleted 4 34539715
recycle_miss 4 32133498
mutex_miss 4 131726
evict_skip 4 113792369794
evict_l2_cached 4 0
evict_l2_eligible 4 727929288704
evict_l2_ineligible 4 74060988416
hash_elements 4 126993
hash_elements_max 4 619388
hash_collisions 4 2202365
hash_chains 4 3824
hash_chain_max 4 6
p 4 8338591744
c 4 8338591744
c_min 4 4194304
c_max 4 8338591744
size 4 7512477800
hdr_size 4 51871896
data_size 4 0
meta_size 4 2098882560
other_size 4 5361723344
anon_size 4 18759680
anon_evict_data 4 0
anon_evict_metadata 4 0
mru_size 4 1571022848
mru_evict_data 4 0
mru_evict_metadata 4 0
mru_ghost_size 4 0
mru_ghost_evict_data 4 0
mru_ghost_evict_metadata 4 0
mfu_size 4 509100032
mfu_evict_data 4 0
mfu_evict_metadata 4 0
mfu_ghost_size 4 0
mfu_ghost_evict_data 4 0
mfu_ghost_evict_metadata 4 0
l2_hits 4 0
l2_misses 4 0
l2_feeds 4 0
l2_rw_clash 4 0
l2_read_bytes 4 0
l2_write_bytes 4 0
l2_writes_sent 4 0
l2_writes_done 4 0
l2_writes_error 4 0
l2_writes_hdr_miss 4 0
l2_evict_lock_retry 4 0
l2_evict_reading 4 0
l2_free_on_write 4 0
l2_cdata_free_on_write 4 0
l2_abort_lowmem 4 0
l2_cksum_bad 4 0
l2_io_error 4 0
l2_size 4 0
l2_asize 4 0
l2_hdr_size 4 0
l2_compress_successes 4 0
l2_compress_zeros 4 0
l2_compress_failures 4 0
memory_throttle_count 4 0
duplicate_buffers 4 0
duplicate_buffers_size 4 0
duplicate_reads 4 0
memory_direct_count 4 0
memory_indirect_count 4 0
arc_no_grow 4 0
arc_tempreserve 4 0
arc_loaned_bytes 4 0
arc_prune 4 963412324
arc_meta_used 4 7512477800
arc_meta_limit 4 6253943808
arc_meta_max 4 7628319744
@angstymeat Does this system have selinux enabled (and using a policy which supports zfs) and/or are there a lot of xattrs set on the files on the zfs filesystem? Is xattr=sa
set?
The problem you're facing is that the new logic in arc_adjust_meta()
is not making any progress because the kernel isn't able to free dentries which, effectively pins a whole lot of metadata into memory. My hunch is this may have something to do with dir-style xattrs.
I'm going to modify my test suite to use a dir-style xattr on each file to see whether I can duplicate this behavior.
No selinux, but xattr=sa is set on all of the filesystems since we're backing up a number of systems that have them.
@angstymeat Do you have any non-default settings for primarycache
or any module parameters, especially any related to prefetch? I'd also like to confirm that there is only one pool on the system. I'm asking these questions because of how large evict_skip
is and also that data_size
is zero.
I've not been able to duplicate this problem yet but it sounds like the key is that you've also likely got a pretty heavy write load going on at some time as well as all the filesystem traversal.
There is one pool called "storage".
I have atime=off
, xattr=sa
, relatime=off
, and acltype=posixacl
. I also have com.sun:auto-snapshot
set to true or false on various filesystems.
EDIT: All of my module parameters should be standard. I've removed any modprobe.d settings so they don't interfere with the debugging.
I do see that after the first 25 to 30 minutes of my backups running that I lose all of the data-portion of the ARC and I'm only caching metadata.
I use Cacti to graph some of my arc stats like cache hits & misses, size of the cache, arc size, meta_used, etc. Is there anything I can add to my graphs that would help?
I had arc_adapt spinning at 80% cpu after a week of uptime on 0.6.4
xattr=on, but not being used AFAIK.
I'll try grabbing stats next time instead of emergency-rebooting.
On Sun, 19 Apr 2015, Tim Chase wrote:
@angstymeat Does this system have selinux enabled (and using a policy which supports zfs) and/or are there a lot of xattrs set on the files on the zfs filesystem? Is
xattr=sa
set?The problem you're facing is that the new logic in
arc_adjust_meta()
is not making any progress because the kernel isn't able to free dentries which, effectively pins a whole lot of metadata into memory. My hunch is this may have something to do with dir-style xattrs.I'm going to modify my test suite to use a dir-style xattr on each file to see whether I can duplicate this behavior.
arc_adapt 3% to 66% usage here (occasionally 100%). nfs serving of 600kB
files incredibly slow (seconds). 3 devices in pool have to do 160tps in
iostat -k 1
for the several seconds it takes to load up a single 600kB
file. NFS serving is of tank3 filesystems below:
zpool list NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT tank 2.72T 2.21T 522G - 63% 81% 1.00x ONLINE - tank3 5.45T 3.24T 2.22T - 15% 59% 1.00x ONLINE - tank4 17.5G 960K 17.5G - 0% 0% 1.00x ONLINE -
zfs list NAME USED AVAIL REFER MOUNTPOINT tank 2.21T 479G 120K /tank tank/backuppc 2.21T 479G 2.20T legacy tank/backuppc@20150430 3.94G - 2.19T - tank/backuppc@20150501 900M - 2.19T - tank/backuppc@20150502 222M - 2.19T - tank/backuppc@20150503 94.8M - 2.20T - tank3 2.16T 1.42T 256K /tank3 tank3/.mp3.low_qual 7.09G 1.42T 7.09G /home/tconnors/.mp3.low_qual tank3/apt_archives 28.9G 1.42T 28.9G /var/cache/apt/archives tank3/background 268G 1.42T 268G /home/tconnors/background tank3/mp3 159G 1.42T 159G /home/tconnors/mp3 tank3/images 820G 1.42T 820G /home/tconnors/images tank3/photos 194G 1.42T 194G /home/tconnors/photos tank3/pvr 365G 1.42T 365G /home/tconnors/movies/kaffeine tank3/qBT_dir 165G 1.42T 165G /home/tconnors/qBT_dir tank3/raid 26.0G 1.42T 26.0G - tank3/scratch 128K 1.42T 128K /home/tconnors/scratch tank3/sysadmin 7.72G 1.42T 7.72G /home/tconnors/sysadmin tank3/thesis 166G 1.42T 166G /home/tconnors/thesis tank4 852K 17.2G 120K /tank4
zfs get xattr NAME PROPERTY VALUE SOURCE tank xattr on default tank/backuppc xattr on default tank/backuppc@20150430 xattr on default tank/backuppc@20150501 xattr on default tank/backuppc@20150502 xattr on default tank/backuppc@20150503 xattr on default tank3 xattr on default tank3/.mp3.low_qual xattr on default tank3/apt_archives xattr on default tank3/background xattr on default tank3/mp3 xattr on default tank3/images xattr on default tank3/photos xattr on default tank3/pvr xattr on default tank3/qBT_dir xattr on default tank3/raid xattr - - tank3/scratch xattr on default tank3/sysadmin xattr on default tank3/thesis xattr on default tank4 xattr on default
zfs get primarycache NAME PROPERTY VALUE SOURCE tank primarycache all default tank/backuppc primarycache all default tank/backuppc@20150430 primarycache all default tank/backuppc@20150501 primarycache all default tank/backuppc@20150502 primarycache all default tank/backuppc@20150503 primarycache all default tank3 primarycache all default tank3/.mp3.low_qual primarycache all default tank3/apt_archives primarycache all default tank3/background primarycache all default tank3/mp3 primarycache all default tank3/images primarycache all default tank3/photos primarycache all default tank3/pvr primarycache all default tank3/qBT_dir primarycache all default tank3/raid primarycache all default tank3/scratch primarycache all default tank3/sysadmin primarycache all default tank3/thesis primarycache all default tank4 primarycache none local
(tank4 has nothing on it, tank3/raid is a zvol, tank3 is a 3 disk pool with log and cache, tank is a 1 disk pool with log and cache)
modinfo zfs | grep version version: 0.6.4-1-2-wheezy srcversion: 6189308C5B9F92C2EE5B9F0 vermagic: 3.16.0-0.bpo.4-amd64 SMP mod_unload modversions 0-0-17:07:15, Sun May 03 tconnors@fs:~ (bash) modinfo spl | grep version version: 0.6.4-1-wheezy srcversion: 88A374A9A6ABDC4BD14DF0E vermagic: 3.16.0-0.bpo.4-amd64 SMP mod_unload modversions
As documented in #3235,
here:
http://rather.puzzling.org/~tconnors/tmp/zfs-stats.tar.gz
2 runs of perf record -ag
, perf report --stdio
here:
http://rather.puzzling.org/~tconnors/tmp/perf.data.old.gz http://rather.puzzling.org/~tconnors/tmp/perf.data.gz http://rather.puzzling.org/~tconnors/tmp/perf.txt.old.gz http://rather.puzzling.org/~tconnors/tmp/perf.txt.gz
Workload was absolutely nil at the time, except for the 5-minutely munin runs, which most likely didn't correspond to these times.
uptime is 13 days. Previous reboot was because same symptoms developed after 7 days. Box is an NFS server, and does full/incremental backups of a network of 3 other machines with up to ~800GB filesystems over rsync, last backup probably having run 12 hours earlier (certainly not running at this time).
Tim Connors
same here, v0.6.4-12-0c60cc-wheezy, linux-3.16, two pools, dedup=off, xattr=sa. when arc_meta is full (beyond its own limits: arc_meta_limit=38092588032, arc_meta_max=43158548680) performance is gone. drop_caches=3 helps. problem is new with 0.6.4, wasn't there with 0.6.0 ... 0.6.3 (or at least 0.6.2). Reminds me of old 0.5.x times, when I had to do drop_caches as well. 0.6.0ff didn't need that
Still same problem here. Setting zfs_arc_meta_prune to large values doesn't help. Setting zfs_arc_meta_limit to a rather low value doesn't help. The problem shows up especially when traversing and renaming many entries in large directories. (I still have to use drop_caches=2, but I already experienced a deadlock there, so this is no real solution.) Is there a chance to circumvent this condition by setting parameters or to get it fixed inside ZoL? Or by patching linux or choosing another kernel version? I'm running v0.6.4-16-544f71-wheezy at the moment, kernel is 3.16.7-ckt9-3~deb8u1~bpo70+1 (Debian Wheezy). Thank you!
@woffs there's still some work underway for this and patches may be available in the next couple days. Although they're targeted primarily at older kernels. Could you check if HAVE_SHRINK is defined in the zfs_config.h file in the zfs build directory.
$ grep HAVE_SHRINK zfs_config.h
#define HAVE_SHRINK 1
kernel 3.2 (3.2.68-1+deb7u1) (which I consider to reboot in) has #define HAVE_SHRINK 1
kernel 3.16 (3.16.7-ckt9-3~deb8u1~bpo70+1) (which is just running) has /* #undef HAVE_SHRINK */
What does that mean for our problem here? Ah, I see, that failed autoconf test under 3.16. HAVE_SHRINK should be 1 here, too, right?
So, would it help me to boot into 3.2 until this is fixed? In other words, does the per-filesystem shrinker help us out of our initial arc_adjust_meta problem?
@behlendorf , @dweeezil :
I applied the patch from #3481 this morning and I'm seeing a night and day difference. CPU & IO is smoother, not bursty. I got through my entire backup without having the hideous performance problems I had been reporting.
The backup performance seems to be back to where it was in 0.6.3, which is good. Up until this point, I had been trying every nightly update of #3115 with only sporadic improvements. My backups are back to about an hour to complete instead of the 8+ hours they had been at (assuming I didn't get deadlocked somewhere).
I'm running a patch stack that consists of #3344, #3374, #3432, and #3481.
@woffs yes either HAVE_SPLIT_SHRINKER_CALLBACK or HAVE_SHRINK should be defined. If neither are then you'll fallback in to some compatibility code which has a known bug which prevents dentries from being reclaimed. I have a patch to address that but it still needs a little cleanup and testing.
@angstymeat thanks for the feedback. That's very good news. Hopefully we should be able to cut down on your patch stack in the next few days by getting these patches finalized and merged.
@behlendorf , @angstymeat That's great news. Although I am willing to test patches and compile something (I was doing this in the old pre-0.5.x-days, but now I'm just using the debian packages), I have kept my Debian daily zfs version and have just rebooted into Linux 3.2 and will monitor what will happen. First impression is good, but it'll take a few days to be sure.
I might have spoken too soon. The first backup script run I did earlier went great. Now, the scheduled one at 2am can't seem to keep up. I have arc_adapt
running between 25%-60% CPU and my memory is holding stead at around 12.5GB out of 16GB in use. There are 64 arc_prune
threads running between 1.5%-6% each.
Overall CPU usage is almost 50% greater than it was earlier, and I don't know what the difference is.
arc_meta_used
is 600MB over arc_meta_limit
. I know that arc_meta_limit
is now a soft cap, but it's something that didn't happen during the earlier run.
It still seems a bit better overall than it was, but not near what I was getting under 0.6.3.
I tried bumping zfs_arc_meta_prune
up to 100000
, 1000000
, and even 10000000
, but I'm not seeing a difference.
The whole backup has been running for 2.5 hours right now, and I'll check it again in the morning.
@angstymeat Just a hunch, could you please install perf
and run perf top -ag
while the system is under high CPU load and see where all the time is really being spent. I've got a hunch you'll see a lot of time spent in _raw_spin_lock_irq
. In fact, after starting perf top
, you can type /spin_lock
(in the perf curses interface) to restrict the view to only those related functions. The number of interest is listed under the column labeled "Self". If that's not the hog, it would be interesting to know which functions have the highest "Self" values. I suspect it's not the ZFS code, proper.
arc_adapt
and the arc_prune
threads are still running...
I'm showing:
53.51% _raw_spin_lock_irqsave
0.42% _raw_spin_lock
0.25% _raw_spin_lock_irq
arc_adapt
and thearc_prune
threads are still running...I'm showing:
53.51% _raw_spin_lock_irqsave 0.42% _raw_spin_lock 0.25% _raw_spin_lock_irq
This looks somehow familiar to me.
@angstymeat https://bugzilla.redhat.com/show_bug.cgi?id=879801#c13
Regarding to this isolate_freepages tries to defrag the ram to create contiguous memory space for the transparent_hugepages. After
echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/defrag echo never > /sys/kernel/mm/transparent_hugepage/defrag echo never > /sys/kernel/mm/transparent_hugepage/enabled
I don't see this problem on my system anymore...could someone double check please?
https://bugzilla.redhat.com/show_bug.cgi?id=888380 https://bugzilla.redhat.com/show_bug.cgi?id=879801
https://bugzilla.redhat.com/show_bug.cgi?id=879801#c14
Yes, as per http://forums.fedoraforum.org/showthread.php?t=285246, the following is a successful work-around for the issue:
echo never > /sys/kernel/mm/transparent_hugepage/defrag
At the least, echoing these settings while I'm in this state doesn't appear to make a difference.
@angstymeat Seems my hunch was right. The next step is to do the same thing, filter the perf top
display down to those functions and then use the "E" command to expand the display and show the places from which the spin locks are being acquired. Obviously, they're somewhere in the threads you mentioned but it would be nice to know what call path is causing all the spinning.
- 42.93% 42.93% [kernel] [k] _raw_spin_lock_irqsave ▒
- _raw_spin_lock_irqsave ▒
- 70.71% taskq_thread ▒
kthread ▒
ret_from_fork ▒
- 23.04% taskq_dispatch ▒
arc_prune_async ▒
arc_adjust ▒
arc_adapt_thread ▒
thread_generic_wrapper ▒
kthread ▒
ret_from_fork ▒
- 2.22% __wake_up ▒
- 50.75% taskq_thread ▒
kthread ▒
ret_from_fork ▒
- 44.74% task_done ▒
taskq_thread ▒
kthread ▒
ret_from_fork ▒
- 3.33% taskq_dispatch ▒
arc_prune_async ▒
arc_adjust ▒
arc_adapt_thread ▒
thread_generic_wrapper ▒
kthread ▒
ret_from_fork ▒
- 1.55% try_to_wake_up ▒
default_wake_function ▒
- 1.09% arc_prune_async ▒
arc_adjust ▒
arc_adapt_thread ▒
thread_generic_wrapper ▒
kthread ▒
ret_from_fork ▒
- 0.65% add_wait_queue_exclusive ▒
taskq_thread ▒
kthread ▒
- 0.62% 0.62% [kernel] [k] _raw_spin_lock ▒
- _raw_spin_lock ▒
14.18% list_lru_count_node ▒
super_cache_scan ▒
zfs_sb_prune ▒
zpl_prune_sb ▒
arc_prune_task ▒
taskq_thread ▒
kthread ▒
ret_from_fork ▒
10.79% zfs_sb_prune ▒
zpl_prune_sb ▒
arc_prune_task ▒
taskq_thread ▒
kthread ▒
ret_from_fork ▒
6.22% zio_done ▒
6.16% taskq_dispatch ▒
arc_prune_async ▒
arc_adjust ▒
arc_adapt_thread ▒
thread_generic_wrapper ▒
kthread ▒
ret_from_fork ▒
5.33% super_cache_scan ▒
zfs_sb_prune ▒
zpl_prune_sb ▒
arc_prune_task ▒
taskq_thread ▒
kthread ▒
ret_from_fork ▒
4.52% list_lru_walk_node ▒
3.75% get_next_timer_interrupt ▒
tick_nohz_stop_sched_tick ▒
__tick_nohz_idle_enter ▒
3.74% try_to_wake_up ▒
2.49% dbuf_rele_and_unlock ▒
2.39% zio_wait_for_children ▒
2.10% rrw_enter_read ▒
2.09% put_super ▒
super_cache_scan ▒
zfs_sb_prune ▒
zpl_prune_sb ▒
arc_prune_task ▒
taskq_thread ▒
kthread ▒
ret_from_fork ▒
2.00% grab_super_passive ▒
super_cache_scan ▒
zfs_sb_prune ▒
zpl_prune_sb ▒
arc_prune_task ▒
taskq_thread ▒
kthread ▒
ret_from_fork ▒
- 0.25% 0.25% [kernel] [k] _raw_spin_lock_irq ▒
- _raw_spin_lock_irq ▒
- 92.82% __schedule ▒
- 89.04% schedule ▒
- taskq_thread ▒
kthread ▒
ret_from_fork ▒
- 10.79% schedule_preempt_disabled ▒
- cpu_startup_entry ▒
85.16% start_secondary ▒
- 14.84% rest_init ▒
start_kernel ▒
x86_64_start_reservations ▒
x86_64_start_kernel ▒
- 2.82% schedule_preempt_disabled ▒
- cpu_startup_entry ▒
90.05% start_secondary ▒
- 9.95% rest_init ▒
start_kernel ▒
x86_64_start_reservations ▒
x86_64_start_kernel ▒
- 2.06% schedule ▒
- 94.29% taskq_thread ▒
kthread ▒
ret_from_fork ▒
- 3.18% schedule_timeout ▒
- 51.62% __cv_timedwait_common ▒
__cv_timedwait_interruptible ▒
arc_user_evicts_thread ▒
thread_generic_wrapper ▒
kthread ▒
ret_from_fork ▒
- 48.38% rcu_gp_kthread ▒
kthread ▒
ret_from_fork ▒
1.68% schedule_hrtimeout_range_clock ▒
schedule_hrtimeout_range ▒
poll_schedule_timeout ▒
0.86% rcu_gp_kthread ▒
kthread ▒
ret_from_fork ▒
- 0.50% __do_softirq ▒
@kernelOfTruth , disabling transparent huge pages doesn't appear to have made a difference.
I was compiling ZFS & SPL with today's commits, and I just noticed that I'm booting with slub_nomerge
on my grub command line. It looks like I added it September last year when @dweeezil was helping me with #2725.
I don't know if it's causing any problems or not, but I just removed it.
@angstymeat How many filesystems are mounted? I have a feeling you're getting burned by concurrent pruning of a lot of different filesystems and, to a lesser degree, because we may hold arc_prune_mtx
for a long time if there are a lot of filesystems. This thread is getting a lot of activity and it's become difficult for me to keep track of your specific details. You mentioned using almost 15GiB of ARC above but your original posting suggests the system has only 16GiB of RAM. Have you increased the max limits? Also, does the system really have 64 threads available (which would likely imply a 4-node NUMA system with an 8/16 CPU in each node but only 16GiB of RAM total)? I'm a bit concerned too many threads are being created for the system, especially if it really has that balance between CPU power and RAM.
I've got 1 pool with 35 filesystems.
My arc is using 7.7GB with arc_max at its default of almost 8G. The system has 2 Opteron 4122 processors (2 CPUs with 4 cores each). I haven't touch any of the defaults, except zfs_arc_meta_prune
which is currently at 100000.
I do have another 16GB for it that I was waiting to put in until this was debugged; I have to send it to the remote site and have the sysadmin up there install it.
My main concern is that I wasn't seeing this happen under 0.6.3, and I don't think I was seeing it on 0.6.4 until after the kmem rework. I would like to go back and run tests under 0.6.3, but I wasn't paying attention early on and upgraded the pools.
I've thought about offloading all of its data, going back to 0.6.3, reloading and running tests, but it will take quite a while to do, and I'm not sure that 0.6.3 runs under the newer kernels (3.18+) or not.
@angstymeat Could you please post the output of dmesg | grep nr_cpuids
. It looks like the kernel thinks it's possible to hot-plug up to 64 CPUs given that you've got 64 arc_prune
threads running. If you can tolerate a reboot, adding nr_cpus=8
to the kernel command line might provide some relief. There are other places where max_possible_cpus()
is used for scaling so it's likely your CPU power is being overcommitted pretty badly. Your 35 filesystems isn't quite as high as I expected but it's still plenty to overwhelm the CPUs when we're going to allow up to 64 threads.
I'll note that ZoL's use of max_ncpus
(particularly as it relates to mutexes) is something I'm planning on investigating now that the arc mutex contention patch has been committed. Among other things, it causes trouble on systems where the kernel is built with CONFIG_MAXSMP
set, particularly in EC2/Xen environments.
I'm going to try to come up with a patch to keep the pruning under better control, particularly on systems with lots of filesystems. With ZFS, it's not uncommon for there to be hundreds or even thousands of filesystems mounted, particularly when considering snapshots.
I've got this for dmesg | grep nr_cpu_ids
:
[ 0.000000] setup_percpu: NR_CPUS:1024 nr_cpumask_bits:64 nr_cpu_ids:64 nr_node_ids:2
[ 0.000000] RCU restricting CPUs from NR_CPUS=1024 to nr_cpu_ids=64.
[ 0.000000] RCU: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=64
Rebooting this machine isn't a problem. I'll make the change and try the backups again.
@angstymeat At least it was cut down to 64. It looks like your kernel was built with CONFIG_NR_CPUS=1024
.
It's the stock Fedora 20 kernel. I'm currently trying a script run with the nr_cpus
set to 8.
And I've got it in the same state again with arc_adapt
at 17%-30% CPU and 8 arc_prune
threads running. It also seems to happen a few minutes into the mail directory backup.
@angstymeat I didn't expect it would help all that much but at least the number of various threads and mutexes on your system are more sane with the nr_cpus
setting.
This is a split off of #3235. The symptom is that after a large set up rsyncs containing lots of small files, I'm left with arc_adapt spinning at around 3.3% CPU time.
I've compiled the info that @dweeezil wanted here: https://cloud.passcal.nmt.edu/index.php/s/l52T2UhZ0K7taY9.
The system is a Dell R515 with 16GB of RAM. The pool is a single raidz2 pool made of of 7 2TB SATA drives. Compression is enabled and set to lz4. atime is off.
The OS is Fedora 20 with the 3.17.8-200.fc20.x86_64 kernel.
The machine this happens on is an offsite server that hosts our offsite backups. There are 20+ rsync processes running that send large and small files from our internal systems to this system. The majority of the files are either large data files or the files you would typically find as part of a linux installation (/usr, /bin, /var, etc.)
Also, about 50 home directories are backed up containing a mix of large and small files.
This takes about 45 minutes. One hour after these jobs are started, the email servers begin their backup (so usually a 15 minute delay between the start of one set of backups and another). Also, our large data collector sends its backups at this time. These are large files, but a lot of them. It is sometime during this 2nd stage backup that this issue occurs.
During the backup, I have another process that periodically runs an
echo 2 > /proc/sys/vm/drop_caches
. I do this because once the ARC gets filled up performance drops drastically. Without doing this, it will take up to 10 hours to perform a backup. With it, it will take less than 2 hours.This happens even if I do not run the periodic drop_caches, but seems to occur less often.
The drop_caches is a relatively new addition to my scripting on this system as this didn't appear to happen under 0.6.3. I don't have a good idea when it started, but I'm pretty sure it was sometime around the kmem rework on 0.6.4.
I am unable to rollback and test under 0.6.3 because while I was testing 0.6.4, the new feature flags were enabled. This unit is not a critical system, so I have quite a bit of leeway with it, as long as I don't have to destroy and recreate the pool. I usually use it to test new versions of ZFS since it gets a lot of disk usage.