Closed angstymeat closed 9 years ago
@angstymeat Your perf
output sure makes it look like one or more of the taskqs are heavily oversubscribed. Is there a chance your system has any snapshots mounted? Bad Things would likely happen if there were hundreds or thousands of mounted filesystems. Could you please do a quick grep zfs/snapshot /proc/mounts
and see if any are mounted. There are a bunch of other possibilities that cross my mind, but I'd like to first make sure there aren't a vast number of mounted filesystems. Also, now that you're running the latest code, could you please repost your arcstats when the problem is occurring.
No snapshots are mounted and grep'ing /proc/mounts returns nothing. I have a lot of files, but not a lot of file systems. Here's zfs list:
storage 7.14T 1.01T 368K /storage
storage/dbserve 29.6G 1.01T 4.62G /storage/dbserve
storage/fs 1.05T 1.01T 320K /storage/fs
storage/fs/TA 1.04T 1.01T 1.04T /storage/fs/TA
storage/fs/office 4.90G 1.01T 4.80G /storage/fs/office
storage/fs/shipping 3.55G 1.01T 3.53G /storage/fs/shipping
storage/home 182G 1.01T 153G /storage/home
storage/local-backup 5.73G 1.01T 1.90G /storage/local-backup
storage/maildir 291G 1.01T 274G /storage/maildir
storage/maildir_old 278G 1.01T 267G /storage/maildir_old
storage/qc-db 49.0G 1.01T 288K /storage/qc-db
storage/qc-db/database 49.0G 1.01T 336K /storage/qc-db/database
storage/qc-db/database/bin-log 288K 1.01T 272K /storage/qc-db/database/bin-log
storage/qc-db/database/data 48.5G 1.01T 48.5G /storage/qc-db/database/data
storage/qc-db/database/innodb 451M 1.01T 371M /storage/qc-db/database/innodb
storage/qc-db/database/logs 1015K 1.01T 823K /storage/qc-db/database/logs
storage/svn 2.12G 1.01T 2.07G /storage/svn
storage/systems 4.65T 1.01T 6.18G /storage/systems
storage/systems/bedrock 5.18G 1.01T 4.81G /storage/systems/bedrock
storage/systems/cloud 476M 1.01T 350M /storage/systems/cloud
storage/systems/durga 3.30T 1.01T 288K /storage/systems/durga
storage/systems/durga/data 3.30T 1.01T 3.07T /storage/systems/durga/data
storage/systems/glisn 1.01T 1.01T 927G /storage/systems/glisn
storage/systems/local-apps 16.4G 1.01T 11.2G /storage/systems/local-apps
storage/systems/mail-dns 39.6G 1.01T 30.1G /storage/systems/mail-dns
storage/systems/matlab 6.29G 1.01T 5.71G /storage/systems/matlab
storage/systems/qandisa 103G 1.01T 102G /storage/systems/qandisa
storage/systems/quartz 760M 1.01T 717M /storage/systems/quartz
storage/systems/support 10.7G 1.01T 9.52G /storage/systems/support
storage/systems/tiamat 121G 1.01T 121G /storage/systems/tiamat
storage/systems/www 10.9G 1.01T 10.3G /storage/systems/www
storage/systems/xeos 33.1G 1.01T 30.9G /storage/systems/xeos
storage/systems/xion 192K 1.01T 192K /storage/systems/xion
storage/vmware 529G 1.01T 529G /storage/vmware
storage/xeos-db 101G 1.01T 39.9G /storage/xeos-db
Each filesystem has about 18 snapshots, but they're only used when we need to recover something.
Forgot the arcstats... This should have everything: https://cloud.passcal.nmt.edu/index.php/s/2pRlIzQ8qov4rcO
@angstymeat Thanks. That rules out the huge number of filesystems. Please try echo 0 > /sys/module/zfs/parameters/zfs_arc_meta_strategy
and see how it changes things. This should take effect on-the-fly.
Your problem is narrowed down a bit by:
arc_prune 4 5442352736
in your arcstats. The value is incremented every time arc_prune_task()
is handed off to its taskq. Needless to say, 5.4 billion instantiations are not a Good Thing. It directly relates to the high CPU loading in SPL's taskq handler.
As an aside, I'm going to write some documentation for this and any other new module parameters which were missed as part of the 06358ea commit, as well as a few other related cleanups to function names, etc. in the arc code.
Well, that stopped arc_prune
and arc_adapt
immediately. Do you want me try running the backup scripts again?
relevant commit: https://github.com/zfsonlinux/zfs/commit/f6046738365571bd647f804958dfdff8a32fbde4
Make arc_prune() asynchronous ... This patch additionally adds the zfs_arc_meta_strategy module option while allows the meta reclaim strategy to be configured. It defaults to a balanced strategy which has been proved to work well under Linux but the illumos meta-only strategy can be enabled.
I had to look up first what that command does ;)
@dweeezil thanks for mentioning it
@angstymeat:
please do
Now I'm super-slow. There's almost no CPU usage and rsync is reading 1 file every few seconds.
@angstymeat Sure, I'd be interested to hear how it behaves. It would be nice to reboot, too, in order that the arcstats be zeroed.
@kernelOfTruth My initial port of the arc mutex patch reverted reclaim to the illumos behavior but it was decided later on to add support for the "balanced" mode which was instituted in bc88866 to deal with the many reports of unevictable metadata. Furthermore, the balanced mode was made the default.
In @angstymeat's arcstats, the condition arc_meta_used > arc_meta_limit
is true and seems to be "stuck" that way. This will cause (by default) 4096 loops in arc_adjust_meta_balanced()
, half of which will call arc_prune_async()
which, itself will dispatch as many threads as there are mounted ZFS filesystems to the the prune taskq. The problem seems to be that it can't free enough metadata for the process to ever stop so the system gets into a thrashing state where it's burning most of its cpu time dispatching threads. With the illumos strategy, the per-sb callbacks aren't invoked as frequently but will still be called by the arc shrinker.
@angstymeat Could we please see arcstats during the "super-slow" time.
Here's arcstats:
name type data
hits 4 70384784
misses 4 15806620
demand_data_hits 4 1608805
demand_data_misses 4 55508
demand_metadata_hits 4 51918024
demand_metadata_misses 4 10586442
prefetch_data_hits 4 111159
prefetch_data_misses 4 856433
prefetch_metadata_hits 4 16746796
prefetch_metadata_misses 4 4308237
mru_hits 4 24851725
mru_ghost_hits 4 334215
mfu_hits 4 28675104
mfu_ghost_hits 4 113754
deleted 4 14523110
mutex_miss 4 21340
evict_skip 4 842402
evict_not_enough 4 177204
evict_l2_cached 4 0
evict_l2_eligible 4 234518568960
evict_l2_ineligible 4 99826913280
evict_l2_skip 4 0
hash_elements 4 255160
hash_elements_max 4 346702
hash_collisions 4 905539
hash_chains 4 14536
hash_chain_max 4 4
p 4 7756597248
c 4 8341868544
c_min 4 33554432
c_max 4 8341868544
size 4 8410224768
hdr_size 4 94918240
data_size 4 0
meta_size 4 2270906368
other_size 4 6044400160
anon_size 4 16924672
anon_evict_data 4 0
anon_evict_metadata 4 0
mru_size 4 1771446272
mru_evict_data 4 0
mru_evict_metadata 4 0
mru_ghost_size 4 619285504
mru_ghost_evict_data 4 245149696
mru_ghost_evict_metadata 4 374135808
mfu_size 4 482535424
mfu_evict_data 4 0
mfu_evict_metadata 4 0
mfu_ghost_size 4 716824576
mfu_ghost_evict_data 4 681854464
mfu_ghost_evict_metadata 4 34970112
l2_hits 4 0
l2_misses 4 0
l2_feeds 4 0
l2_rw_clash 4 0
l2_read_bytes 4 0
l2_write_bytes 4 0
l2_writes_sent 4 0
l2_writes_done 4 0
l2_writes_error 4 0
l2_writes_lock_retry 4 0
l2_evict_lock_retry 4 0
l2_evict_reading 4 0
l2_evict_l1cached 4 0
l2_free_on_write 4 0
l2_cdata_free_on_write 4 0
l2_abort_lowmem 4 0
l2_cksum_bad 4 0
l2_io_error 4 0
l2_size 4 0
l2_asize 4 0
l2_hdr_size 4 0
l2_compress_successes 4 0
l2_compress_zeros 4 0
l2_compress_failures 4 0
memory_throttle_count 4 0
duplicate_buffers 4 0
duplicate_buffers_size 4 0
duplicate_reads 4 0
memory_direct_count 4 0
memory_indirect_count 4 0
arc_no_grow 4 0
arc_tempreserve 4 0
arc_loaned_bytes 4 0
arc_prune 4 5554960868
arc_meta_used 4 8410224768
arc_meta_limit 4 6256401408
arc_meta_max 4 8410750480
arc_meta_min 4 0
@dweeezil so we retained that improvement - great !
Thanks
Is there also an issue with the l2arc or am I misinterpreting this ?
evict_l2_cached 4 0
evict_l2_eligible 4 727929288704
evict_l2_ineligible 4 74060988416
l2_size 4 0
l2_asize 4 0
l2_hdr_size 4 0
Looking at it again, I have a load average of 7, but my overall CPU usage is 0.5%
@angstymeat Unfortunately, your meta_used is blowing by the meta_limit without the balanced mode enabled. I've always been a bit nervous about the default parameters for balanced mode. I think you should try switching back to balanced mode (echo 1 > /sys/module/zfs/parameters/zfs_arc_meta_strategy
) and then lower zfs_arc_meta_adjust_restarts
to 2 and play with different values of arc_meta_prune
, maybe starting at 100000 or so. Larger values shouldn't make a huge difference. The value is ultimately passed into the kernel's generic sb shrinker (super_cache_scan()
) where it's used to proportion the amount of the inode and dentry cache which it attempts to prune.
FWIW, I've not been able to reproduce this problem on my test system. I've created 100 filesystems and populated a source filesystem with 3.5M files. RAM has been limited to 16GiB and CPU's to 8. Then I run 10 concurrent rsyncs from the source filesystem into 10 of the 100 I created. It is rather slow, but I've never been able to get the CPU spinning being reported here. It seems there must be some other variable I'm missing. How many rsyncs are running? Do all the files have xattrs (posixacl)? You are still running with xattr=sa
, right?
Well, that's funny. It's a good way to pause ZFS on my system. :)
I set zfs_arc_meta_strategy
back to 1 and everything started back up again.
The script initially starts 18 rsyncs. Most of them complete in just a few minutes, leaving about 5 that go for 30 minutes. An hour after those are started (the other jobs are complete by now), another cron runs 4 more -- the big ones that pull our email and data-storage systems.
The first group usually runs ok, and I have stalling when the 2nd group runs. Since I applied #3481 and have been running 0.6.4-98_g06358ea, I'm seeing the problems happen sooner while running the initial set of rsyncs. It seems like I hit the memory limit faster with the newer commits than I do with the released 0.6.4.
I'm already running with arc_meta_prune
at 100000, and I just set zfs_arc_meta_adjust_restarts
down to 2. So far, arc_adapt
isn't running very much and the system looks like it's been able to free some memory.
I turned off xattr=sa
a week or two ago on all of the filesystems, except for one. It turns out we only had one machine that was actually using them.
I don't suppose that my pool could somehow have some kind of metadata damage that is causing problems? I run a scrub every month or so and it hasn't reported any problems.
Performance seems to be a bit slower than it was before I set the meta_strategy to 0, but without `arc_pruneor
arc_adapt`` taking up a lot of CPU.
@angstymeat I'd also like to point out the new zfs_arc_overflow_shift
parameter which is likely coming into play here. It is documented, BTW and controls some brand new logic which came along with the arcs_mtx
restructuring.
Here's a brief outline: It controls whether the ARC is considered to be "overflowing". In your last arcstats, for example, arc_c=8341868544 and arc_size=8410224768 so the ARC is overflowing by 68356224 bytes. The default overflow shift of 8 will set the threshold in this case to 32852440 bytes (8410224768/256) so the overflow condition is in force.
The overflow condition is used in arc_get_data_buf()
, where buffers are allocated. If the ARC is in the overflowing condition, the reclaim (adapt) thread is woken up and the allocation pauses until the reclaim is finished. This is likely what's causing your slowdown (little CPU utilization and low I/O rates); at least when you aren't using balanced mode.
The bottom line, however, is that the problem is being caused by not being able to evict enough metadata quickly. The metadata is pinned by the higher level structures in the kernel. Balanced mode definitely seems to help under Linux, but it seems it can cause real problems under some conditions. The key is likely going to be to find better settings for zfs_arc_meta_prune
and zfs_arc_meta_adjust_restarts
. The current scheme effectively calls the generic Linux superblock shrinker with multiples of zfs_arc_meta_prune
on odd-numbered iterations where zfs_arc_meta_adjust_restarts
controls the number of iterations. I had suggested using 100000,2 for the parameters and it seems to mostly help, maybe something like 100000,4 or 100000,6 would be better. Since you can reproduce this condition easily it would be interesting if you could find the best settings. I never cared much for the default iteration count of 4096 but it didn't cause much problem in the past. The reason it's a big deal now is that the pruning is done in separate threads and there is a thread for each mounted filesystem. I'm thinking of opening a separate issue specifically to address the taskq design because it's something that could be changed pretty easily and it seems to be having some unintended consequences.
Thanks for the detailed explanation of when eviction is triggered. I'll try different values for zfs_arc_meta_adjust_restarts
and see if I can find anything that helps.
Is there a reason it cannot catch up after the rsyncs finish? I regularly see that my rsync processes will terminate and it it will keep running arc_adapt
continuously afterwards when the rest of the system is idle. I let it go one time and it finally stopped around 20 hours later.
Meanwhile, my machine is running smoothly for 4 days now and finished the backup process 4 times in 5½h every day, which is quite good. What I did:
I have still zfs-v0.6.4-16-544f71-wheezy running. arc_meta_used is oscillating around arc_meta_limit (=39GB) and never been larger than arc_meta_limit+130MB
@angstymeat I may have found the problem. If you've got RAM in each of your Opteron NUMA nodes, the zfs_sb_prune()
function will only prune from NUMA node 0 when running on >= 3.12 kernels. You can check with dmesg | grep "On node"
and see whether the kernel is using memory on each node. If you've got a 2-node NUMA system with RAM on each node, please try #3495 and see if it doesn't allow for more metadata to be freed.
Wow, that's a good catch! I'm trying to now!
So far, so good! It's a big difference. I posted a small comment over at #3495 which pretty much says this seems to be working.
htop
is showing the system had been hovering around 11GB in use and it suddenly jumped up to 15.5GB when my mail backups started. arc_adapt
ran for a second or two along with some arc_prune
threads, and I regained 1.5GB without impacting the running processes.
This would really explain why not all of our systems are experiencing this. Also, I think our virtual machines don't suffer from this since it appears that no matter how many CPUs they have the machines only see that they have 1 NUMA node.
The jobs are starting to finish up now, and my memory in use has dropped to 11.5GB with a couple still running. For the last 6 months I wouldn't see it go under 14GB at this point.
I'm really thinking this is it.
Also, yes, I'm showing that there's memory in 2 NUMA nodes on this system.
On Sat, 13 Jun 2015, Tim Chase wrote:
FWIW, I've not been able to reproduce this problem on my test system. I've created 100 filesystems and populated a source filesystem with 3.5M files. RAM has been limited to 16GiB and CPU's to 8. Then I run 10 concurrent rsyncs from the source filesystem into 10 of the 100 I created. It is rather slow, but I've never been able to get the CPU spinning being reported here. It seems there must be some other variable I'm missing. How many rsyncs are running? Do all the files have xattrs (posixacl)? You are still running with
xattr=sa
, right?
Do you run with any uptime in your tests? I always get initial good behaviour from ZFS, but then after a week it all blows up. I assume you create those millions of files in a complex directory structure too, just to make sure your testing caching of directory entries vs files?
I'm closing this because it looks like it's been solved with #3495.
Hmmm, I've recently had this bug pop up, and I'm not running a numa machine, so #3495 won't be my solution. I'll have to open a new bug when the next release is out (for now, master was so bad that I had to downgrade back to 0.6.4)
@spacelama You might want to look at #3501. They mention in one of the comments that is solves a similar issue for older kernels that #3495 addresses.
This is a split off of #3235. The symptom is that after a large set up rsyncs containing lots of small files, I'm left with arc_adapt spinning at around 3.3% CPU time.
I've compiled the info that @dweeezil wanted here: https://cloud.passcal.nmt.edu/index.php/s/l52T2UhZ0K7taY9.
The system is a Dell R515 with 16GB of RAM. The pool is a single raidz2 pool made of of 7 2TB SATA drives. Compression is enabled and set to lz4. atime is off.
The OS is Fedora 20 with the 3.17.8-200.fc20.x86_64 kernel.
The machine this happens on is an offsite server that hosts our offsite backups. There are 20+ rsync processes running that send large and small files from our internal systems to this system. The majority of the files are either large data files or the files you would typically find as part of a linux installation (/usr, /bin, /var, etc.)
Also, about 50 home directories are backed up containing a mix of large and small files.
This takes about 45 minutes. One hour after these jobs are started, the email servers begin their backup (so usually a 15 minute delay between the start of one set of backups and another). Also, our large data collector sends its backups at this time. These are large files, but a lot of them. It is sometime during this 2nd stage backup that this issue occurs.
During the backup, I have another process that periodically runs an
echo 2 > /proc/sys/vm/drop_caches
. I do this because once the ARC gets filled up performance drops drastically. Without doing this, it will take up to 10 hours to perform a backup. With it, it will take less than 2 hours.This happens even if I do not run the periodic drop_caches, but seems to occur less often.
The drop_caches is a relatively new addition to my scripting on this system as this didn't appear to happen under 0.6.3. I don't have a good idea when it started, but I'm pretty sure it was sometime around the kmem rework on 0.6.4.
I am unable to rollback and test under 0.6.3 because while I was testing 0.6.4, the new feature flags were enabled. This unit is not a critical system, so I have quite a bit of leeway with it, as long as I don't have to destroy and recreate the pool. I usually use it to test new versions of ZFS since it gets a lot of disk usage.