Closed gbkersey closed 5 years ago
Well that did not work... I've tried it with running rsync like this.... nocache rsync --lotsofoptions..... nocache -n2 rsync --lotsofoptions.....
Still seeing this.... :(
@gbkersey try going lower than half of your RAM for ARC - I suggest testing 40% (around 6 GiB):
# 1GB in hex: 0x40000000 (1073741824, in bytes)
# 1.5GB in hex: 0x60000000 (1610612736, in bytes)
# 2GB in hex: 0x80000000 (2147483648, in bytes)
# 3GB in hex: 0xC0000000 (3221225472, in bytes)
# 4GB in hex: 0x100000000 (4294967296, in bytes)
# 6GB in hex: 0x180000000 (6442450944 , in bytes)
# 8GB in hex: 0x200000000 (8589934592 , in bytes)
# 10GB in hex: 0x280000000 (10737418240 , in bytes)
# 12GB in hex: 0x300000000 (1273534355592, in bytes)
Also set zfs_arc_min to at least 1 GiB to prevent collapsing of ARC and zfs_arc_meta_limit - if things haven't changed default for meta is 1/4 , I've set it to 1/3
Please post output of /proc/spl/kstat/zfs/arcstats when this is happening.
That's roughly 600 MiB of swap used - to raise efficiency of swap you could try using zswap with lz4 compression
zswap.enabled=1 zswap.compressor=lz4
If ARC collapse is the problem, you might want to try dweeezil/spl@08807f8.
I have made the changes to the zfs_arc parameters.... zfs_arc_max=0x199999999 (0.4*16G) zfs_arc_min=0x40000000 (1G)
I really don't think I need to use zswap. The problem is arc_adapt fighting with kswapd and really any memory being used by arc should not be swapped.
I'll see how the backup run goes tonight.
Here's what I'm seeing when the crash starts.....
arc_adapt running @ 100% CPU
The machine has not started hitting swap yet, so I can still access it.....
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
13:22:55 13 0 4 0 4 0 0 0 0 8.7G 6.4G
13:23:00 7.1K 225 3 225 3 0 0 2 29 8.8G 6.4G
13:23:05 4.2K 132 3 132 3 0 0 1 30 8.8G 6.4G
13:23:10 6.6K 210 3 210 3 0 0 4 32 8.8G 6.4G
13:23:15 5.7K 178 3 178 3 0 0 1 23 8.9G 6.4G
13:23:20 7.2K 230 3 230 3 0 0 4 51 8.9G 6.4G
13:23:25 6.2K 194 3 194 3 0 0 2 41 9.0G 6.4G
13:23:30 4.3K 135 3 135 3 0 0 2 42 9.0G 6.4G
13:23:35 5.5K 176 3 176 3 0 0 2 41 9.0G 6.4G
13:23:40 6.0K 189 3 189 3 0 0 1 41 9.1G 6.4G
13:23:45 6.0K 191 3 191 3 0 0 2 41 9.1G 6.4G
13:23:50 6.6K 212 3 212 3 0 0 4 38 9.1G 6.4G
13:23:55 5.4K 172 3 172 3 0 0 2 41 9.2G 6.4G
13:24:00 5.4K 171 3 171 3 0 0 3 45 9.2G 6.4G
13:24:05 6.0K 190 3 190 3 0 0 2 41 9.2G 6.4G
13:24:10 7.3K 230 3 230 3 0 0 2 41 9.3G 6.4G
13:24:15 5.6K 176 3 176 3 0 0 2 41 9.3G 6.4G
13:24:20 970 31 3 31 3 0 0 1 41 9.3G 6.4G
13:24:25 3.5K 106 3 106 3 0 0 3 53 9.3G 6.4G
13:24:30 6.1K 191 3 191 3 0 0 1 23 9.4G 6.4G
13:24:35 4.6K 146 3 146 3 0 0 2 41 9.4G 6.4G
13:24:40 7.1K 225 3 225 3 0 0 3 47 9.4G 6.4G
13:24:45 5.6K 177 3 177 3 0 0 3 32 9.5G 6.4G
13:24:50 5.7K 179 3 179 3 0 0 2 29 9.5G 6.4G
13:24:55 6.5K 204 3 204 3 0 0 3 30 9.6G 6.4G
13:25:00 7.1K 225 3 225 3 0 0 2 35 9.6G 6.4G
13:25:05 5.5K 176 3 176 3 0 0 3 53 9.6G 6.4G
13:25:10 5.5K 176 3 176 3 0 0 3 47 9.7G 6.4G
13:25:15 5.1K 161 3 161 3 0 0 2 29 9.7G 6.4G
13:25:20 5.7K 179 3 179 3 0 0 1 41 9.7G 6.4G
13:25:25 6.3K 200 3 200 3 0 0 2 41 9.8G 6.4G
13:25:30 5.3K 168 3 168 3 0 0 2 35 9.8G 6.4G
13:25:35 5.4K 170 3 170 3 0 0 2 35 9.8G 6.4G
13:25:40 6.5K 206 3 206 3 0 0 2 41 9.9G 6.4G
13:25:45 5.7K 181 3 181 3 0 0 4 48 9.9G 6.4G
13:25:50 6.5K 206 3 206 3 0 0 3 37 10.0G 6.4G
13:25:55 5.1K 161 3 161 3 0 0 1 41 10.0G 6.4G
13:26:00 6.2K 195 3 195 3 0 0 3 47 10G 6.4G
13:26:05 5.7K 181 3 181 3 0 0 2 41 10G 6.4G
13:26:10 5.5K 172 3 172 3 0 0 1 41 10G 6.4G
13:26:15 3.1K 98 3 98 3 0 0 2 41 10G 6.4G
13:26:20 5.5K 169 3 169 3 0 0 1 41 10G 6.4G
13:26:25 5.2K 160 3 160 3 0 0 2 35 10G 6.4G
13:26:30 5.0K 158 3 158 3 0 0 3 44 10G 6.4G
13:26:35 3.2K 101 3 101 3 0 0 1 41 10G 6.4G
13:26:40 5.4K 171 3 171 3 0 0 1 41 10G 6.4G
13:26:45 3.4K 109 3 109 3 0 0 2 41 10G 6.4G
13:26:50 5.4K 169 3 169 3 0 0 3 53 10G 6.4G
13:26:55 5.9K 186 3 186 3 0 0 2 41 10G 6.4G
13:27:01 6.7K 213 3 213 3 0 0 2 29 10G 6.4G
13:27:06 5.2K 165 3 165 3 0 0 2 41 10G 6.4G
/proc/spl/kstat/zfs/arcstats
5 1 0x01 86 4128 22296906999 44346460030532
name type data
hits 4 54692874
misses 4 2921520
demand_data_hits 4 50057094
demand_data_misses 4 604370
demand_metadata_hits 4 3738501
demand_metadata_misses 4 668711
prefetch_data_hits 4 336750
prefetch_data_misses 4 1486477
prefetch_metadata_hits 4 560529
prefetch_metadata_misses 4 161962
mru_hits 4 4580245
mru_ghost_hits 4 35430
mfu_hits 4 49292225
mfu_ghost_hits 4 174485
deleted 4 993385
recycle_miss 4 282227
mutex_miss 4 58
evict_skip 4 1924942
evict_l2_cached 4 127198536192
evict_l2_eligible 4 167759510016
evict_l2_ineligible 4 7909453824
hash_elements 4 2085408
hash_elements_max 4 2111375
hash_collisions 4 1206520
hash_chains 4 549558
hash_chain_max 4 8
p 4 0
c 4 6871947673
c_min 4 1073741824
c_max 4 6871947673
size 4 10915962400
hdr_size 4 838922520
data_size 4 6388711424
meta_size 4 1222101504
other_size 4 2441626272
anon_size 4 40910848
anon_evict_data 4 0
anon_evict_metadata 4 0
mru_size 4 697645056
mru_evict_data 4 0
mru_evict_metadata 4 0
mru_ghost_size 4 79454260736
mru_ghost_evict_data 4 76747343872
mru_ghost_evict_metadata 4 2706916864
mfu_size 4 6872257024
mfu_evict_data 4 6347816960
mfu_evict_metadata 4 0
mfu_ghost_size 4 116707381248
mfu_ghost_evict_data 4 116168853504
mfu_ghost_evict_metadata 4 538527744
l2_hits 4 63503
l2_misses 4 2857972
l2_feeds 4 48390
l2_rw_clash 4 4
l2_read_bytes 4 3696026112
l2_write_bytes 4 77620348928
l2_writes_sent 4 9059
l2_writes_done 4 9059
l2_writes_error 4 0
l2_writes_hdr_miss 4 7
l2_evict_lock_retry 4 0
l2_evict_reading 4 0
l2_free_on_write 4 5336
l2_cdata_free_on_write 4 497
l2_abort_lowmem 4 0
l2_cksum_bad 4 0
l2_io_error 4 0
l2_size 4 80188120576
l2_asize 4 55961596416
l2_hdr_size 4 24600680
l2_compress_successes 4 643908
l2_compress_zeros 4 0
l2_compress_failures 4 364689
memory_throttle_count 4 0
duplicate_buffers 4 0
duplicate_buffers_size 4 0
duplicate_reads 4 0
memory_direct_count 4 0
memory_indirect_count 4 0
arc_no_grow 4 0
arc_tempreserve 4 0
arc_loaned_bytes 4 0
arc_prune 4 88
arc_meta_used 4 4527250976
arc_meta_limit 4 5153960754
arc_meta_max 4 5431007824
Argh... Again....
Still have zfs_arc_max set to 6.4GB... arcsz is 10GB?
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
23:26:27 4.7K 965 20 744 18 220 39 736 18 10G 6.4G
23:26:32 6.2K 1.2K 19 377 7 802 80 1.1K 18 10G 6.4G
23:26:38 4.7K 1.2K 25 637 17 565 57 989 23 10G 6.4G
23:26:43 11K 1.2K 10 781 10 375 10 942 8 10G 6.4G
23:26:48 12K 1.2K 9 767 9 409 9 963 7 10G 6.4G
23:26:53 12K 1.0K 8 529 6 496 14 854 7 10G 6.4G
23:26:59 13K 1.2K 8 652 5 557 21 982 8 10G 6.4G
23:27:05 21K 1.3K 6 550 3 766 10 1.2K 5 10G 6.4G
23:27:10 8.5K 1.3K 14 803 10 454 46 1.0K 15 10G 6.4G
23:27:16 9.2K 1.1K 11 676 9 403 22 787 8 10G 6.4G
23:27:22 7.5K 2.1K 27 1.5K 23 537 45 1.6K 23 10G 6.4G
23:27:27 9.9K 2.5K 24 1.8K 21 678 40 2.0K 20 10G 6.4G
23:27:33 13K 1.5K 11 999 9 451 15 1.1K 8 10G 6.4G
23:27:42 13K 2.3K 16 1.7K 17 601 15 1.7K 13 10G 6.4G
23:27:47 16K 2.2K 13 1.5K 12 707 16 1.8K 11 10G 6.4G
23:27:52 21K 3.0K 13 2.5K 21 551 5 2.6K 12 10G 6.4G
23:27:58 7.4K 2.3K 31 1.9K 34 393 22 1.9K 27 10G 6.4G
23:28:07 21K 2.8K 12 2.2K 14 523 8 2.3K 10 10G 6.4G
23:28:15 15K 2.1K 13 1.5K 12 617 20 1.6K 12 10G 6.4G
23:28:37 17K 3.3K 19 2.5K 16 827 33 2.2K 15 10G 6.4G
top - 08:16:22 up 17:09, 6 users, load average: 58.82, 55.78, 50.01
Tasks: 446 total, 6 running, 440 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 37.7 sy, 0.0 ni, 17.3 id, 45.0 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem: 16415464 total, 16231988 used, 183476 free, 592 buffers
KiB Swap: 3999740 total, 111652 used, 3888088 free. 1108 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
24383 root 20 0 49920 4 4 R 100.0 0.0 541:28.91 ssh
4198 root 20 0 7220 0 0 R 100.0 0.0 27:01.73 updatedb.m+
1236 root 0 -20 0 0 0 R 100.0 0.0 523:03.67 arc_adapt
1249 root 0 -20 0 0 0 R 100.0 0.0 29:20.43 l2arc_feed
118 root 20 0 0 0 0 R 100.0 0.0 507:20.57 kswapd0
119 root 20 0 0 0 0 S 23.2 0.0 13:50.16 kswapd1
@gbkersey You've got a NUMA system. Try cherry-picking 90947b2.
@dweeezil I'll take a look at that, thanks.
I just upgraded my system to 3.19 and the latest zfs daily and seem to be having a similar problem - about 20 mins after boot (while tracker is starting up inotify watches on my home directory) arc_adapt will suddenly shoot to 100% cpu usage and my desktop will become.e unresponsive (though the mouse will still move). Switching to a TTY s nigh impossible though as the shell doesn't appear after a 30m wait from a very slow login.
It doesn't appear to be an out of memory situation, though I have the arc constrained to 1gb with 16gb ram. This looks like a regression since I recall the same issue happening the previously.
Have you tried to remove the ssd devices ? I have also problems with freezes.
Argh... It died again.... Top showing arc_adapt and kswapd fighting....
top - 04:42:57 up 1 day, 7:36, 3 users, load average: 129.59, 131.21, 127.61
Tasks: 523 total, 4 running, 519 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 19.4 sy, 0.0 ni, 29.3 id, 51.2 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 16415460 total, 16289720 used, 125740 free, 640 buffers
KiB Swap: 3999740 total, 153912 used, 3845828 free. 7456 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1232 root 0 -20 0 0 0 R 100.0 0.0 206:29.44 arc_adapt
118 root 20 0 0 0 0 R 100.0 0.0 95:00.58 kswapd0
119 root 20 0 0 0 0 R 34.2 0.0 13:47.14 kswapd1
1642 root 38 18 0 0 0 S 3.1 0.0 8:21.33 z_wr_iss/8
1640 root 38 18 0 0 0 S 2.9 0.0 8:16.01 z_wr_iss/6
1638 root 38 18 0 0 0 S 2.9 0.0 8:19.31 z_wr_iss/4
1641 root 38 18 0 0 0 S 2.9 0.0 8:18.43 z_wr_iss/7
arstats.py showint arcsz > c
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
23:40:01 13K 917 6 421 3 495 67 277 20 9.9G 6.4G
23:40:06 14K 942 6 519 3 422 53 282 17 10G 6.4G
23:40:11 14K 878 6 458 3 419 67 225 17 10G 6.4G
23:40:16 15K 1.1K 7 636 4 512 64 355 18 10G 6.4G
23:40:21 14K 1.0K 7 599 4 443 68 335 20 10G 6.4G
23:40:26 15K 1.1K 6 554 3 511 70 331 19 10G 6.4G
23:40:31 13K 982 7 543 4 438 55 353 17 10G 6.4G
23:40:36 6.4K 1.4K 21 1.2K 20 212 45 570 20 10G 6.4G
23:40:41 5.4K 1.2K 21 868 17 285 76 554 24 9.8G 6.4G
23:40:46 6.5K 1.3K 19 1.0K 17 238 39 589 20 9.7G 6.4G
23:40:51 6.5K 1.5K 22 1.2K 20 285 53 661 21 9.6G 6.4G
23:40:56 11K 1.3K 11 1.0K 12 256 7 612 6 9.6G 6.4G
23:41:01 9.1K 1.2K 13 950 13 266 11 608 9 9.6G 6.4G
23:41:06 27K 1.2K 4 867 5 328 2 633 2 9.7G 6.4G
23:41:11 5.0K 852 17 585 12 266 62 450 22 9.7G 6.4G
23:41:16 4.9K 1.1K 21 877 19 179 57 482 26 9.7G 6.4G
23:41:21 4.9K 905 18 685 15 219 70 440 29 9.7G 6.4G
23:41:26 4.9K 779 15 642 13 137 47 346 25 9.8G 6.4G
23:41:31 3.8K 629 16 454 12 174 81 307 25 9.8G 6.4G
23:41:36 3.9K 522 13 446 11 75 80 300 31 9.8G 6.4G
23:41:41 5.4K 1.0K 19 905 17 143 78 673 30 9.8G 6.4G
23:41:46 4.3K 689 15 594 14 94 43 390 21 9.9G 6.4G
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
23:41:51 5.3K 840 15 655 12 185 80 379 26 9.9G 6.4G
23:41:56 5.6K 1.1K 19 961 17 109 53 431 24 10.0G 6.4G
23:42:01 7.1K 805 11 603 8 202 64 278 15 10G 6.4G
23:42:06 13K 575 4 203 1 372 67 95 11 10G 6.4G
23:42:11 14K 674 4 262 1 412 54 132 12 10G 6.4G
23:42:16 14K 711 4 371 2 339 50 121 10 10G 6.4G
23:42:22 12K 583 4 268 2 314 52 126 12 10G 6.4G
23:42:27 14K 688 4 320 2 368 52 137 11 11G 6.4G
23:42:32 14K 807 5 366 2 441 59 223 16 11G 6.4G
23:42:38 16K 822 5 405 2 417 59 220 14 11G 6.4G
23:42:45 17K 1.0K 5 444 2 573 59 276 14 12G 6.4G
23:42:50 13K 608 4 241 1 366 57 135 12 12G 6.4G
23:43:11 6.6K 440 6 82 1 357 52 97 12 12G 6.4G
23:43:38 15K 995 6 349 2 646 61 124 11 11G 6.4G
23:43:43 5.2K 260 4 171 3 89 13 37 17 11G 6.4G
23:43:48 13K 639 4 352 2 287 38 89 8 11G 6.4G
23:43:54 14K 760 5 366 2 393 50 153 11 10G 6.4G
23:43:59 15K 701 4 318 2 383 57 112 10 10G 6.4G
23:44:04 13K 583 4 227 1 356 61 76 8 10.0G 6.4G
23:44:09 14K 690 4 289 2 401 45 150 11 9.9G 6.4G
23:44:14 14K 546 3 162 1 383 55 65 6 9.9G 6.4G
23:44:19 14K 573 4 206 1 366 61 69 8 9.9G 6.4G
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
23:44:24 14K 696 4 300 2 395 65 131 12 9.9G 6.4G
23:44:29 14K 807 5 298 2 509 58 280 12 9.9G 6.4G
23:44:34 13K 571 4 231 1 340 66 92 11 10G 6.4G
23:44:39 15K 874 5 457 3 417 62 223 17 10G 6.4G
23:44:44 15K 958 6 461 3 497 61 319 14 10G 6.4G
23:44:49 13K 816 5 389 2 427 57 198 14 10G 6.4G
23:44:54 14K 732 5 361 2 371 66 138 13 10G 6.4G
23:44:59 14K 850 5 446 3 403 79 178 14 10G 6.4G
23:45:04 14K 794 5 415 2 379 66 176 15 11G 6.4G
23:45:09 14K 887 6 469 3 417 71 214 17 11G 6.4G
23:45:14 14K 798 5 369 2 429 70 172 15 11G 6.4G
23:45:19 14K 789 5 392 2 397 78 168 15 11G 6.4G
23:45:24 14K 901 6 499 3 402 57 219 15 11G 6.4G
23:45:30 11K 588 4 264 2 323 51 154 14 11G 6.4G
23:45:35 15K 852 5 410 2 442 65 189 14 11G 6.4G
23:45:40 10K 572 5 297 2 275 51 148 15 12G 6.4G
23:45:45 11K 730 6 316 3 413 64 158 13 11G 6.4G
23:45:50 12K 742 5 349 2 393 58 212 16 11G 6.4G
23:45:55 11K 696 5 370 3 325 52 173 15 11G 6.4G
23:46:01 15K 803 5 376 2 427 54 192 13 11G 6.4G
23:46:06 13K 678 5 330 2 347 54 142 12 11G 6.4G
23:46:11 16K 916 5 464 3 451 59 242 15 11G 6.4G
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
23:46:17 13K 848 6 332 2 515 60 261 16 10G 6.4G
23:46:22 15K 795 5 273 1 521 73 203 14 10G 6.4G
23:46:27 14K 660 4 323 2 336 60 112 10 10G 6.4G
23:46:32 15K 876 5 389 2 486 48 248 15 10G 6.4G
23:46:37 14K 790 5 341 2 449 60 208 16 10G 6.4G
23:46:42 14K 710 4 321 2 389 66 125 12 10G 6.4G
23:46:47 14K 701 4 321 2 380 70 134 12 10G 6.4G
23:46:52 14K 800 5 406 2 393 52 178 14 10G 6.4G
23:46:57 14K 741 4 318 2 423 72 181 12 10G 6.4G
23:47:02 15K 986 6 481 3 505 55 286 16 10G 6.4G
23:47:07 14K 925 6 481 3 444 63 323 22 10G 6.4G
23:47:12 15K 906 5 499 3 407 61 203 15 10G 6.4G
23:47:17 14K 774 5 377 2 396 67 146 12 10G 6.4G
23:47:22 14K 705 4 281 2 423 79 132 13 10G 6.4G
23:47:27 15K 992 6 452 3 540 65 323 14 10G 6.4G
23:47:32 16K 867 5 383 2 483 57 252 13 10G 6.4G
23:47:37 13K 791 5 336 2 454 61 193 16 10G 6.4G
23:47:42 16K 1.1K 6 562 3 530 52 365 17 10G 6.4G
23:47:47 16K 912 5 342 2 569 71 295 15 10G 6.4G
23:47:52 13K 814 5 440 3 373 50 173 12 10G 6.4G
23:47:57 15K 827 5 375 2 452 78 172 12 10G 6.4G
23:48:02 15K 863 5 423 2 439 72 190 15 10G 6.4G
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
23:48:07 15K 947 6 519 3 427 63 239 15 10G 6.4G
23:48:12 15K 853 5 452 2 400 78 196 16 10G 6.4G
23:48:17 14K 807 5 370 2 437 47 225 18 10G 6.4G
23:48:22 15K 876 5 439 3 437 49 276 16 10G 6.4G
23:48:27 16K 947 5 460 3 487 68 285 17 10G 6.4G
23:48:33 14K 689 4 287 2 402 61 184 16 10G 6.4G
23:48:38 15K 822 5 281 1 541 62 284 15 10G 6.4G
23:48:43 15K 909 5 377 2 531 66 267 16 10G 6.4G
23:48:48 13K 718 5 378 2 340 49 246 19 10G 6.4G
23:48:53 5.2K 964 18 638 13 326 68 649 22 10G 6.4G
23:48:58 20K 1.3K 6 1.1K 10 204 2 887 4 10G 6.4G
23:49:03 6.0K 1.3K 21 1.1K 20 239 35 878 18 10G 6.4G
23:49:08 7.7K 1.5K 19 1.2K 20 298 19 1.3K 21 10G 6.4G
23:49:13 7.4K 1.6K 21 1.3K 21 279 26 1.3K 24 10G 6.4G
23:49:18 8.1K 1.1K 13 721 11 379 19 731 12 10G 6.4G
23:49:23 6.3K 1.4K 22 1.0K 19 353 40 970 23 10G 6.4G
23:49:28 12K 1.2K 10 903 11 318 7 817 8 10G 6.4G
23:49:33 8.2K 1.3K 15 953 15 352 17 845 14 10G 6.4G
23:49:38 5.8K 1.3K 21 976 18 284 48 835 23 10G 6.4G
23:49:43 5.0K 1.2K 23 914 19 251 72 744 24 10G 6.4G
23:49:48 7.2K 1.2K 16 945 16 238 16 771 15 10G 6.4G
23:49:53 5.9K 1.4K 24 1.2K 22 253 54 859 25 10G 6.4G
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
23:49:58 26K 1.9K 7 1.4K 8 464 4 1.2K 4 10G 6.4G
23:50:03 11K 1.5K 13 1.1K 14 437 11 1.0K 10 10G 6.4G
23:50:08 21K 1.6K 7 1.2K 9 323 3 1.2K 6 10G 6.4G
23:50:13 9.3K 1.2K 13 1.1K 15 185 7 912 13 10G 6.4G
23:50:18 18K 1.3K 7 954 7 366 5 840 5 10G 6.4G
23:50:23 9.4K 1.3K 14 1.0K 13 322 17 831 13 10G 6.4G
23:50:28 11K 1.5K 12 1.1K 10 384 52 658 22 10G 6.4G
23:50:33 8.0K 1.5K 19 1.3K 17 214 57 675 25 10G 6.4G
23:50:38 7.7K 1.5K 19 1.1K 17 390 29 950 18 10G 6.4G
23:50:43 7.9K 1.7K 21 1.3K 20 389 29 1.2K 21 10G 6.4G
23:50:48 7.1K 1.4K 19 881 14 508 42 883 19 10G 6.4G
23:50:53 6.9K 1.6K 22 1.2K 20 336 35 988 22 10G 6.4G
23:50:58 15K 1.6K 10 1.3K 11 306 6 990 7 10G 6.4G
23:51:03 7.7K 1.6K 21 1.3K 19 368 32 1.1K 21 11G 6.4G
23:51:08 17K 1.1K 6 786 7 350 5 649 4 11G 6.4G
23:51:13 6.0K 1.5K 24 1.1K 21 315 40 874 25 11G 6.4G
23:51:18 12K 1.6K 12 1.2K 11 397 14 1.0K 10 11G 6.4G
23:51:23 6.0K 1.4K 23 1.2K 20 258 57 948 28 10G 6.4G
23:51:28 5.7K 1.5K 25 1.2K 22 223 87 1.1K 38 10G 6.4G
23:51:33 8.8K 1.2K 14 847 11 389 25 817 14 10G 6.4G
23:51:38 7.8K 1.8K 23 1.5K 20 335 62 1.2K 29 10G 6.4G
23:51:43 18K 1.3K 7 940 9 342 4 837 5 10G 6.4G
time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c
23:51:49 18K 1.4K 7 989 8 370 5 848 5 11G 6.4G
23:51:54 9.0K 1.4K 15 965 13 431 26 852 13 11G 6.4G
23:52:07 6.5K 832 12 560 10 272 24 520 14 11G 6.4G
23:52:19 12K 1.5K 12 1.1K 10 391 22 819 13 11G 6.4G
23:52:28 10K 1.3K 13 884 11 436 20 800 12 11G 6.4G
23:53:07 8.1K 1.0K 12 702 10 323 27 610 13 11G 6.4G
23:53:18 24K 1.9K 7 1.2K 6 624 12 1.1K 7 11G 6.4G
23:54:12 38K 5.8K 15 4.1K 12 1.7K 33 3.5K 18 11G 6.4G
Nothing much going on with the file system though....
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
export10 7.37T 10.8T 0 7 233 28.6K
raidz1 7.37T 10.8T 0 7 233 28.6K
sdb - - 0 3 222 15.0K
sdc - - 0 3 201 15.1K
sdf - - 0 3 187 15.2K
sdg - - 0 3 208 15.0K
sdd - - 0 3 194 15.0K
logs - - - - - -
mirror 184K 3.72G 0 0 0 0
sde1 - - 0 0 0 0
sdi1 - - 0 0 0 0
cache - - - - - -
sde2 26.0G 82.3M 0 0 201 503
sdi2 26.0G 85.0M 0 0 165 473
---------- ----- ----- ----- ----- ----- -----
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
export10 7.37T 10.8T 0 2 56 8.80K
raidz1 7.37T 10.8T 0 2 56 8.80K
sdb - - 0 1 46 5.71K
sdc - - 0 1 67 5.65K
sdf - - 0 1 46 5.68K
sdg - - 0 1 43 5.59K
sdd - - 0 1 41 5.66K
logs - - - - - -
mirror 184K 3.72G 0 0 0 0
sde1 - - 0 0 0 0
sdi1 - - 0 0 0 0
cache - - - - - -
sde2 26.0G 82.5M 0 0 49 131
sdi2 26.0G 85.3M 0 0 89 157
---------- ----- ----- ----- ----- ----- -----
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
export10 7.37T 10.8T 0 0 2 136
raidz1 7.37T 10.8T 0 0 2 136
sdb - - 0 0 1 99
sdc - - 0 0 3 117
sdf - - 0 0 1 114
sdg - - 0 0 3 103
sdd - - 0 0 1 113
logs - - - - - -
mirror 184K 3.72G 0 0 0 0
sde1 - - 0 0 0 0
sdi1 - - 0 0 0 0
cache - - - - - -
sde2 26.0G 82.5M 0 0 2 595
sdi2 26.0G 85.4M 0 0 4 35
---------- ----- ----- ----- ----- ----- -----
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
export10 7.37T 10.8T 0 0 1 0
raidz1 7.37T 10.8T 0 0 1 0
sdb - - 0 0 5 0
sdc - - 0 0 1 0
sdf - - 0 0 0 0
sdg - - 0 0 2 0
sdd - - 0 0 4 0
logs - - - - - -
mirror 184K 3.72G 0 0 0 0
sde1 - - 0 0 0 0
sdi1 - - 0 0 0 0
cache - - - - - -
sde2 26.0G 82.5M 0 0 0 0
sdi2 26.0G 85.4M 0 0 1 0
---------- ----- ----- ----- ----- ----- -----
I've been having his problem for quite some time and I believe that I have found an easy solution. Hopefully this will save some other folks some time.
Here are the details:
Hardware:
Supermicro H8DCL-iF 16GB ECC RAM 2x CPU AMD Opteron(tm) Processor 4334 OS Disk - WDC WD10EZEX-08M2NA0 (1TB) 7200 rpm SATA drive @ 3.0Gb/s connected to the mobo SATA controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode]
Software:
Ubuntu 14.04.3 LTS Linux sequoia 3.13.0-61-generic #100-Ubuntu SMP Wed Jul 29 11:21:34 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux ZFS version: 0.6.4.2-1~trusty
Zpool Hardware Info:
2x Marvell Technology Group Ltd. 88SE9480 SAS/SATA 6Gb/s RAID controller (rev c2) 6x HGST HDN724040ALE640 (4TB) 7200 rpm SATA drive @ 6.0Gb/s connected to Marvel Controllers 2x SanDisk SDSSDRC032G (32GB) SSD SATA drive @ 6.0GB/s connected to Marvel Controllers
Zpool:
ZFS tuning:
options zfs zfs_arc_max=8589934592
Benchmark:
running lots of rsync backup jobs
Results:
After a couple of hours of running, arc_adapt starts taking 100% CPU. When this happens, the system runs out of RAM and starts swapping to the OS disk. The swapping is so severe that no other processes can access the OS disk and the system has to be power cycled in order to get it running again.
Solution (I hope):
I found a reply to https://github.com/zfsonlinux/zfs/issues/3320 by @kernelOfTruth that mentioned Tobi Oetiker's article on preserving buffer state cache http://insights.oetiker.ch/linux/fadvise/ which looked like it would solve the problem. The reply mentioned using rsync with Tobi's fadvise patch (--drop-cache) and that looked great. However, applying that patch on the backup server and using the --drop-cache option, would require that I install a version of rsync with the patch on all systems being backed up.
I started looking for references to that patch in the rsync mailing list and came upon a this bug entry: https://bugzilla.samba.org/show_bug.cgi?id=9560#C3
It appears that @FeH took Tobi's rsync patch and built a wrapper called nocache https://github.com/Feh/nocache This wrapper appears to solve the problem and I don't have to package and upgrade rsync on all of the hosts that are being backed up....
YMMV Comments apreciated.... Thanks!