stevecs commented 12 years ago

Ok, bad description I know. Been playing with a system for a while and have just upgraded from 48GB of ram to 192GB ram. System is a dual cpu X5690 box running ubuntu 10.04LTS server with zfs daily ppa:

Linux loki 2.6.32-36-server #79-Ubuntu SMP Tue Nov 8 22:44:38 UTC 2011 x86_64 GNU/Linux ii libzfs1 0.6.0.60-0ubuntu1~lucid1 Native ZFS filesystem library for Linux ii mountall 2.15.3-zfs1~lucid1 filesystem mounting tool ii ubuntu-zfs 6~lucid Native ZFS filesystem metapackage for Ubuntu ii zfs-dkms 0.6.0.60-0ubuntu1~lucid1 Native ZFS filesystem kernel modules for Lin ii zfsutils 0.6.0.60-0ubuntu1~lucid1 Native ZFS management utilities for Linux

system has 112 drives, 108 of which are in a raidz2 zpool (18x 6-disk vdev's). Attached across 7 LSI2008 SAS HBA's.

Ran into this doing a full system backup (bacula) of ~80TB of data. Pool went to >90% used (4TB free) which is a seperate question on how to modify the metaslab_df_free_pct under linux?

When I got the info messages below for txg_quiesce, all running processes just hung at 100% cpu (needed reboot to clear them):

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1053 ftpadmin 20 0 42416 4552 1944 R 100 0.0 60:17.45 lftp 1145 root 20 0 42988 2148 1644 R 100 0.0 53:32.45 sshd 6003 root 20 0 132m 22m 1228 S 100 0.0 1636:53 bacula-sd 6162 root 20 0 31192 2312 1316 R 100 0.0 175:12.45 zpool 32746 root 20 0 44376 2000 1580 R 100 0.0 174:36.38 exim4 211 root 20 0 0 0 0 R 100 0.0 200:06.49 kswapd0 212 root 20 0 0 0 0 R 100 0.0 184:01.87 kswapd1 3354 root 20 0 11288 664 508 R 100 0.0 175:29.54 irqbalance 7161 ftpadmin 20 0 4296 576 476 R 100 0.0 175:05.57 xargs

7162 ftpadmin 20 0 30332 4592 1004 R 100 0.0 175:15.35 ls

New sessions worked to a limited extent (could log in for example) but when trying to do anything to the pool would just hang.

So system was configured w/ 160GB of arc leaving 32GB free. Previous setups before adding the memory worked fine w/ 24GB system ram (8G arc min/max) and 48GB system ram (24GB arc min/max). In this case I did add a swap partition when I added the extra ram (previous versions did not have a swap location) just to see if it was even attempted to be used, it was here (small <1MiB) of 16GB swap (swap on separate ssd drive from pool).

Just to have this reported in case it helps though imagine this may be hard to track down. I'm going to try again now with 128GiB of arc min/max (64GB OS ram) and see if that has any ameliorative affect.

ZFS custom settings were: options zfs zfs_arc_min=171798691840 options zfs zfs_arc_max=171798691840 options zfs zfs_top_maxinflight=128 options zfs zfetch_array_rd_sz=4194304 options zfs zfetch_block_cap=2048

options zfs zfetch_max_streams=64

arcstat shows: root@loki:~# cat /proc/spl/kstat/zfs/arcstats 4 1 0x01 77 3696 63118408639 156203082535533 name type data hits 4 2211563994 misses 4 526140486 demand_data_hits 4 1606766774 demand_data_misses 4 5647331 demand_metadata_hits 4 544845821 demand_metadata_misses 4 8371719 prefetch_data_hits 4 29701065 prefetch_data_misses 4 510277983 prefetch_metadata_hits 4 30250334 prefetch_metadata_misses 4 1843453 mru_hits 4 1040951925 mru_ghost_hits 4 1879469 mfu_hits 4 1110661165 mfu_ghost_hits 4 2447821 deleted 4 968134308 recycle_miss 4 4024238 mutex_miss 4 11807438 evict_skip 4 297904385 evict_l2_cached 4 0 evict_l2_eligible 4 125924524250112 evict_l2_ineligible 4 386157863424 hash_elements 4 6746821 hash_elements_max 4 6749013 hash_collisions 4 672055815 hash_chains 4 2003605 hash_chain_max 4 13 p 4 161060454912 c 4 171798691840 c_min 4 171798691840 c_max 4 171798691840 size 4 171798633608 hdr_size 4 1993682080 data_size 4 165863642624 other_size 4 3941308904 anon_size 4 2641920 anon_evict_data 4 0 anon_evict_metadata 4 0 mru_size 4 156541091840 mru_evict_data 4 155760739328 mru_evict_metadata 4 14545920 mru_ghost_size 4 14701937664 mru_ghost_evict_data 4 3750756352 mru_ghost_evict_metadata 4 10951181312 mfu_size 4 9319908864 mfu_evict_data 4 8003387392 mfu_evict_metadata 4 295767552 mfu_ghost_size 4 157096591872 mfu_ghost_evict_data 4 99655352320 mfu_ghost_evict_metadata 4 57441239552 l2_hits 4 0 l2_misses 4 0 l2_feeds 4 0 l2_rw_clash 4 0 l2_read_bytes 4 0 l2_write_bytes 4 0 l2_writes_sent 4 0 l2_writes_done 4 0 l2_writes_error 4 0 l2_writes_hdr_miss 4 0 l2_evict_lock_retry 4 0 l2_evict_reading 4 0 l2_free_on_write 4 0 l2_abort_lowmem 4 0 l2_cksum_bad 4 0 l2_io_error 4 0 l2_size 4 0 l2_hdr_size 4 0 memory_throttle_count 4 221055 memory_direct_count 4 738609 memory_indirect_count 4 2962929 arc_no_grow 4 0 arc_tempreserve 4 2031616 arc_loaned_bytes 4 0 arc_prune 4 0 arc_meta_used 4 8031885448 arc_meta_limit 4 42949672960

arc_meta_max 4 8630690600

In dmesg this is what I got out of the blue today:

[154493.736696] INFO: task txg_quiesce:5644 blocked for more than 120 seconds. [154493.736990] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [154493.741890] txg_quiesce D 0000000000000000 0 5644 2 0x00000000 [154493.741902] ffff882fcac2dd60 0000000000000046 0000000000015f00 0000000000015f00 [154493.741918] ffff882fc0941ad0 ffff882fcac2dfd8 0000000000015f00 ffff882fc0941700 [154493.741934] 0000000000015f00 ffff882fcac2dfd8 0000000000015f00 ffff882fc0941ad0 [154493.741951] Call Trace: [154493.741973] [] cv_wait_common+0x9c/0x1a0 [spl] [154493.741984] [] ? autoremove_wake_function+0x0/0x40 [154493.741996] [] ? bitmap_weight+0x50/0xb0 [154493.742009] [] cv_wait+0x13/0x20 [spl] [154493.742049] [] txg_quiesce_thread+0x1eb/0x330 [zfs] [154493.742058] [] ? set_user_nice+0xe5/0x160 [154493.742087] [] ? txg_quiesce_thread+0x0/0x330 [zfs] [154493.742100] [] thread_generic_wrapper+0x68/0x80 [spl] [154493.742114] [] ? thread_generic_wrapper+0x0/0x80 [spl] [154493.742124] [] kthread+0x96/0xa0 [154493.742133] [] child_rip+0xa/0x20 [154493.742142] [] ? kthread+0x0/0xa0 [154493.742150] [] ? child_rip+0x0/0x20 [154613.587224] INFO: task txg_quiesce:5644 blocked for more than 120 seconds. [154613.590456] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [154613.597731] txg_quiesce D 0000000000000000 0 5644 2 0x00000000 [154613.597734] ffff882fcac2dd60 0000000000000046 0000000000015f00 0000000000015f00 [154613.597737] ffff882fc0941ad0 ffff882fcac2dfd8 0000000000015f00 ffff882fc0941700 [154613.597740] 0000000000015f00 ffff882fcac2dfd8 0000000000015f00 ffff882fc0941ad0 [154613.597743] Call Trace: [154613.597761] [] cv_wait_common+0x9c/0x1a0 [spl] [154613.597767] [] ? autoremove_wake_function+0x0/0x40 [154613.597771] [] ? bitmap_weight+0x50/0xb0 [154613.597777] [] cv_wait+0x13/0x20 [spl] [154613.597808] [] txg_quiesce_thread+0x1eb/0x330 [zfs] [154613.597812] [] ? set_user_nice+0xe5/0x160 [154613.597833] [] ? txg_quiesce_thread+0x0/0x330 [zfs] [154613.597839] [] thread_generic_wrapper+0x68/0x80 [spl] [154613.597845] [] ? thread_generic_wrapper+0x0/0x80 [spl] [154613.597848] [] kthread+0x96/0xa0 [154613.597851] [] child_rip+0xa/0x20 [154613.597854] [] ? kthread+0x0/0xa0 [154613.597856] [] ? child_rip+0x0/0x20 [154733.440246] INFO: task txg_quiesce:5644 blocked for more than 120 seconds. [154733.444666] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [154733.454323] txg_quiesce D 0000000000000000 0 5644 2 0x00000000 [154733.454327] ffff882fcac2dd60 0000000000000046 0000000000015f00 0000000000015f00 [154733.454329] ffff882fc0941ad0 ffff882fcac2dfd8 0000000000015f00 ffff882fc0941700 [154733.454332] 0000000000015f00 ffff882fcac2dfd8 0000000000015f00 ffff882fc0941ad0 [154733.454334] Call Trace: [154733.454352] [] cv_wait_common+0x9c/0x1a0 [spl] [154733.454358] [] ? autoremove_wake_function+0x0/0x40 [154733.454362] [] ? bitmap_weight+0x50/0xb0 [154733.454369] [] cv_wait+0x13/0x20 [spl] [154733.454404] [] txg_quiesce_thread+0x1eb/0x330 [zfs] [154733.454408] [] ? set_user_nice+0xe5/0x160 [154733.454430] [] ? txg_quiesce_thread+0x0/0x330 [zfs] [154733.454437] [] thread_generic_wrapper+0x68/0x80 [spl] [154733.454443] [] ? thread_generic_wrapper+0x0/0x80 [spl] [154733.454446] [] kthread+0x96/0xa0 [154733.454450] [] child_rip+0xa/0x20 [154733.454452] [] ? kthread+0x0/0xa0 [154733.454454] [] ? child_rip+0x0/0x20 [154853.308245] INFO: task txg_quiesce:5644 blocked for more than 120 seconds. [154853.313862] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [154853.325899] txg_quiesce D 0000000000000000 0 5644 2 0x00000000 [154853.325903] ffff882fcac2dd60 0000000000000046 0000000000015f00 0000000000015f00 [154853.325906] ffff882fc0941ad0 ffff882fcac2dfd8 0000000000015f00 ffff882fc0941700 [154853.325909] 0000000000015f00 ffff882fcac2dfd8 0000000000015f00 ffff882fc0941ad0 [154853.325911] Call Trace: [154853.325929] [] cv_wait_common+0x9c/0x1a0 [spl] [154853.325935] [] ? autoremove_wake_function+0x0/0x40 [154853.325947] [] ? bitmap_weight+0x50/0xb0 [154853.325961] [] cv_wait+0x13/0x20 [spl] [154853.326001] [] txg_quiesce_thread+0x1eb/0x330 [zfs] [154853.326012] [] ? set_user_nice+0xe5/0x160 [154853.326040] [] ? txg_quiesce_thread+0x0/0x330 [zfs] [154853.326054] [] thread_generic_wrapper+0x68/0x80 [spl] [154853.326064] [] ? thread_generic_wrapper+0x0/0x80 [spl] [154853.326074] [] kthread+0x96/0xa0 [154853.326084] [] child_rip+0xa/0x20 [154853.326093] [] ? kthread+0x0/0xa0 [154853.326101] [] ? child_rip+0x0/0x20 [154973.168758] INFO: task txg_quiesce:5644 blocked for more than 120 seconds. [154973.175567] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [154973.189995] txg_quiesce D 0000000000000000 0 5644 2 0x00000000 [154973.189999] ffff882fcac2dd60 0000000000000046 0000000000015f00 0000000000015f00 [154973.190002] ffff882fc0941ad0 ffff882fcac2dfd8 0000000000015f00 ffff882fc0941700 [154973.190004] 0000000000015f00 ffff882fcac2dfd8 0000000000015f00 ffff882fc0941ad0 [154973.190007] Call Trace: [154973.190025] [] cv_wait_common+0x9c/0x1a0 [spl] [154973.190031] [] ? autoremove_wake_function+0x0/0x40 [154973.190035] [] ? bitmap_weight+0x50/0xb0 [154973.190042] [] cv_wait+0x13/0x20 [spl] [154973.190075] [] txg_quiesce_thread+0x1eb/0x330 [zfs] [154973.190086] [] ? set_user_nice+0xe5/0x160 [154973.190115] [] ? txg_quiesce_thread+0x0/0x330 [zfs] [154973.190129] [] thread_generic_wrapper+0x68/0x80 [spl] [154973.190143] [] ? thread_generic_wrapper+0x0/0x80 [spl] [154973.190154] [] kthread+0x96/0xa0 [154973.190163] [] child_rip+0xa/0x20 [154973.190172] [] ? kthread+0x0/0xa0 [154973.190181] [] ? child_rip+0x0/0x20 [155093.041742] INFO: task txg_quiesce:5644 blocked for more than 120 seconds. [155093.049741] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [155093.066593] txg_quiesce D 0000000000000000 0 5644 2 0x00000000 [155093.066597] ffff882fcac2dd60 0000000000000046 0000000000015f00 0000000000015f00 [155093.066600] ffff882fc0941ad0 ffff882fcac2dfd8 0000000000015f00 ffff882fc0941700 [155093.066602] 0000000000015f00 ffff882fcac2dfd8 0000000000015f00 ffff882fc0941ad0 [155093.066605] Call Trace: [155093.066624] [] cv_wait_common+0x9c/0x1a0 [spl] [155093.066632] [] ? autoremove_wake_function+0x0/0x40 [155093.066645] [] ? bitmap_weight+0x50/0xb0 [155093.066657] [] cv_wait+0x13/0x20 [spl] [155093.066698] [] txg_quiesce_thread+0x1eb/0x330 [zfs] [155093.066708] [] ? set_user_nice+0xe5/0x160 [155093.066736] [] ? txg_quiesce_thread+0x0/0x330 [zfs] [155093.066751] [] thread_generic_wrapper+0x68/0x80 [spl] [155093.066764] [] ? thread_generic_wrapper+0x0/0x80 [spl] [155093.066774] [] kthread+0x96/0xa0 [155093.066783] [] child_rip+0xa/0x20 [155093.066791] [] ? kthread+0x0/0xa0 [155093.066800] [] ? child_rip+0x0/0x20 [155212.919723] INFO: task txg_quiesce:5644 blocked for more than 120 seconds. [155212.928939] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [155212.948184] txg_quiesce D 0000000000000000 0 5644 2 0x00000000 [155212.948188] ffff882fcac2dd60 0000000000000046 0000000000015f00 0000000000015f00 [155212.948191] ffff882fc0941ad0 ffff882fcac2dfd8 0000000000015f00 ffff882fc0941700 [155212.948193] 0000000000015f00 ffff882fcac2dfd8 0000000000015f00 ffff882fc0941ad0 [155212.948196] Call Trace: [155212.948214] [] cv_wait_common+0x9c/0x1a0 [spl] [155212.948220] [] ? autoremove_wake_function+0x0/0x40 [155212.948224] [] ? bitmap_weight+0x50/0xb0 [155212.948231] [] cv_wait+0x13/0x20 [spl] [155212.948264] [] txg_quiesce_thread+0x1eb/0x330 [zfs] [155212.948275] [] ? set_user_nice+0xe5/0x160 [155212.948304] [] ? txg_quiesce_thread+0x0/0x330 [zfs] [155212.948317] [] thread_generic_wrapper+0x68/0x80 [spl] [155212.948330] [] ? thread_generic_wrapper+0x0/0x80 [spl] [155212.948341] [] kthread+0x96/0xa0 [155212.948350] [] child_rip+0xa/0x20 [155212.948359] [] ? kthread+0x0/0xa0 [155212.948368] [] ? child_rip+0x0/0x20 [155332.802699] INFO: task txg_quiesce:5644 blocked for more than 120 seconds. [155332.813100] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [155332.834728] txg_quiesce D 0000000000000000 0 5644 2 0x00000000 [155332.834732] ffff882fcac2dd60 0000000000000046 0000000000015f00 0000000000015f00 [155332.834735] ffff882fc0941ad0 ffff882fcac2dfd8 0000000000015f00 ffff882fc0941700 [155332.834737] 0000000000015f00 ffff882fcac2dfd8 0000000000015f00 ffff882fc0941ad0 [155332.834740] Call Trace: [155332.834758] [] cv_wait_common+0x9c/0x1a0 [spl] [155332.834764] [] ? autoremove_wake_function+0x0/0x40 [155332.834768] [] ? bitmap_weight+0x50/0xb0 [155332.834774] [] cv_wait+0x13/0x20 [spl] [155332.834805] [] txg_quiesce_thread+0x1eb/0x330 [zfs] [155332.834809] [] ? set_user_nice+0xe5/0x160 [155332.834832] [] ? txg_quiesce_thread+0x0/0x330 [zfs] [155332.834838] [] thread_generic_wrapper+0x68/0x80 [spl] [155332.834844] [] ? thread_generic_wrapper+0x0/0x80 [spl] [155332.834846] [] kthread+0x96/0xa0 [155332.834849] [] child_rip+0xa/0x20 [155332.834851] [] ? kthread+0x0/0xa0 [155332.834853] [] ? child_rip+0x0/0x20 [155452.690660] INFO: task txg_quiesce:5644 blocked for more than 120 seconds. [155452.702258] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [155452.726060] txg_quiesce D 0000000000000000 0 5644 2 0x00000000 [155452.726065] ffff882fcac2dd60 0000000000000046 0000000000015f00 0000000000015f00 [155452.726071] ffff882fc0941ad0 ffff882fcac2dfd8 0000000000015f00 ffff882fc0941700 [155452.726077] 0000000000015f00 ffff882fcac2dfd8 0000000000015f00 ffff882fc0941ad0 [155452.726083] Call Trace: [155452.726103] [] cv_wait_common+0x9c/0x1a0 [spl] [155452.726111] [] ? autoremove_wake_function+0x0/0x40 [155452.726118] [] ? bitmap_weight+0x50/0xb0 [155452.726126] [] cv_wait+0x13/0x20 [spl] [155452.726162] [] txg_quiesce_thread+0x1eb/0x330 [zfs] [155452.726167] [] ? set_user_nice+0xe5/0x160 [155452.726191] [] ? txg_quiesce_thread+0x0/0x330 [zfs] [155452.726200] [] thread_generic_wrapper+0x68/0x80 [spl] [155452.726207] [] ? thread_generic_wrapper+0x0/0x80 [spl] [155452.726212] [] kthread+0x96/0xa0 [155452.726217] [] child_rip+0xa/0x20 [155452.726220] [] ? kthread+0x0/0xa0 [155452.726224] [] ? child_rip+0x0/0x20 [155572.581119] INFO: task txg_quiesce:5644 blocked for more than 120 seconds. [155572.593487] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [155572.593489] txg_quiesce D 0000000000000000 0 5644 2 0x00000000 [155572.593492] ffff882fcac2dd60 0000000000000046 0000000000015f00 0000000000015f00 [155572.593494] ffff882fc0941ad0 ffff882fcac2dfd8 0000000000015f00 ffff882fc0941700 [155572.593496] 0000000000015f00 ffff882fcac2dfd8 0000000000015f00 ffff882fc0941ad0 [155572.593497] Call Trace: [155572.593515] [] cv_wait_common+0x9c/0x1a0 [spl] [155572.593520] [] ? autoremove_wake_function+0x0/0x40 [155572.593525] [] ? bitmap_weight+0x50/0xb0 [155572.593531] [] cv_wait+0x13/0x20 [spl] [155572.593568] [] txg_quiesce_thread+0x1eb/0x330 [zfs] [155572.593572] [] ? set_user_nice+0xe5/0x160 [155572.593592] [] ? txg_quiesce_thread+0x0/0x330 [zfs] [155572.593599] [] thread_generic_wrapper+0x68/0x80 [spl] [155572.593605] [] ? thread_generic_wrapper+0x0/0x80 [spl] [155572.593608] [] kthread+0x96/0xa0 [155572.593611] [] child_rip+0xa/0x20 [155572.593614] [] ? kthread+0x0/0xa0

[155572.593616] [] ? child_rip+0x0/0x20

ryao commented 12 years ago

I suggest setting zfs_arc_max to half of your system memory. See pull request #660 for an explanation. I am not certain that that would prevent this problem, but I do believe that it is a change that you should make.

stevecs commented 12 years ago

Thanks. Yes, saw that which was why I was doing to try decreasig further to 128GB, but will go down to 96GB. Is there any means to get the slab fragmentation amount out of the module? or how much memory it's trying to use as a hard number?

ryao commented 12 years ago

Memory usage can be seen as c_size in /proc/spl/kstat/zfs/arcstats, but as far as I know, the the SLUB allocator does not provide modules with an ability to track internal fragmentation. The worst case upper bound is a factor of 2, so the best you can do is assume that it is that large. The allocations are biased toward powers of two, so the theoretical upper bound should never be reached in practice.

stevecs commented 12 years ago

Just an update here. Was trying to find out where this stopped being a problem lowering down the arc sizes until I was able to do the large copies/backups without a crash which took forever. Anyway, some background. With 48GB of physical ram, I was able to set ARC MIN/MAX to 20GB and that would work without a crash. No other changes to the system (same pool/size/application mix/etc) I uppped the physical ram in the system to 192GB. system would hang as above all the way until I set ARC MIN/MAX to 64GiB. So basically 128GB of ram had to be 'wasted' for this to work when with a 48GB system, 28GB had to be 'wasted'. So between 66% and 58% overhead. More overhead with more ram?

Don't know how this would help tracking this down but /something/ is really inefficient here.

ryao commented 12 years ago

I suspect that the use of PF_MEMALLOC in the code might be causing problems for you. The patches in pull request #726 eliminate PF_MEMALLOC from much of the code. Those patches should be merged soon, although you are welcome to test them ahead of that.

stevecs commented 12 years ago

I'll see if I can give them a try after I get done doing some fulls here. This has interrupted some of the full backups for a while and need to get about 250TB backed up as a fresh set which will take some time. Would really like to get up to using 128-160GB of ram as that would let me use 32GB+ as dirty cache which really cuts down on fragmentation in my workloads (and thus the slowdowns with ZFS over time which is a constant battle).

behlendorf commented 12 years ago

@stevecs One other thing you might try is increasing /proc/sys/vm/min_free_kbytes to perhaps 256MiB. You might then be able to increase the your zfs_arc_max. As Richard says we're working towards resolving this issue, but until we do that may be a viable workaround for you.

stevecs commented 12 years ago

@behlendorf thanks, yes that's what I'm trying now, I'm back at 96GB for the ARC and have increased the min_free_kbytes to 512MiB (was already at 256MiB). I'm about 100TiB into testing right now (this is using the daily PPA 0.6.0.63-0) once the data gets back then will try another backup/export of the full array to make sure there's no issues there. With each test taking 1-2 months, may just wait until the slub issue is fixed opposed to going further.

behlendorf commented 12 years ago

Presumably increasing min_free_kbytes improved things. Regardless, this situating should be improved by the next round of slab improvements.

openzfs / zfs

Task hang 100% cpu with lots of memory #695

7162 ftpadmin 20 0 30332 4592 1004 R 100 0.0 175:15.35 ls

options zfs zfetch_max_streams=64

arc_meta_max 4 8630690600

In dmesg this is what I got out of the blue today:

[155572.593616] [] ? child_rip+0x0/0x20