Closed yarikoptic closed 9 years ago
if of any help -- here is a historical overview for today/week from munin http://www.onerussian.com/tmp/munin-locdown-20150307.html to see eg (if any relevant, could be just an effect of the deadlock) there was a spike of interrupts right around the deadlock point
please ignore above comment about interrupts -- a bit confusing plot rendering, they just actually went down
fresh dump from the system info (with stacktraces.txt now, thanks @DeHackEd for spotting their omission) is available from http://www.onerussian.com/tmp//zfs_system_details_20150308/ . If I don't hear back I will reboot the beast in a few hours -- need to get it usable again
BTW -- it seems that this stall feels different from before -- unlike before I can still do some actions on those mount ZFS partitions/datasets (e.g. modify files, git commit) while other actions get stuck (docker now can't do stuff even though not directly associated with ZFS, ZFS snapshots/arc_adapt/l2arc_feed/txg_sync and that lengthy rm -r directory seems to be stuck in D)
@yarikoptic it appears that #2983 was significantly improved by #3121 but we didn't quite get every possible deadlock. It's clear from the backtraces you provided the threads are all blocked on the ARC hash lock but what isn't clear is what task is holding that lock and why it's not giving it up. Normally, I'm highly suspicious of the direct reclaim paths like the one taken by the l2arc_feed
thread, but in this case I don't see where it would have taken the lock. I put together a debug patch to try and catch the offending task, can you try running with the patch in #3161.
NOTE: This may be a blocker for 0.6.4. There is clearly a potential deadlock still remaining in the code although to my knowledge @yarikoptic's workload in the only workload which has triggered it.
Is there a description of the workload which triggers this problem? My current testing rigs for the recent illumos ARC work (purposely not referring to those pull requests to minimize non-relevant links) would allow me to test this type of thing pretty easily.
One thing that sticks out to me in the original posting is the other_size of 35627695184 out of an ARC size of 40485186048 (ouch). It would have been interesting to see /proc/slabinfo
as well.
workload was primarily a
tar -cf- hcp-ls.20141020-500subjects-addurl |pv | pigz -9 >| hcp-ls.20141020-500subjects-addurl.tar.gz
command where that directory contains millions of files, symlinks and subdirectories (probably around 40mil entries altogether). Each file within there is very small (contains just a path with it). resultant tar.gz is 16GB. In parallel or before that load I could have had also had a find/chmod/rm sweeps on similarly "large" directories. If necessary -- I could share that generated (there was a succesfull run after all) for your testing so you could simulate similar load. but for the deadlock to happen it took a few days usually.
Unfortunately I have rebooted the beast so I guess slabinfo wouldn't be of interest atm, but I have added it now to the http://github.com/yarikoptic/zfs-system-details/blob/HEAD/zfs-system-details script so it will be reported next time
@yarikoptic you might also want to pick up #3163. The automated testing uncovered another issue, which looks different from yours, but is still possible.
gotcha, rebuilding with 3163 and will restart/resume bombardment shortly
Another workload involving the traversal (statting of) million(s) of files and other_size
is way out of control. YABITP (yet another big inode traversal problem). Of course, the deadlock you enountered shouldn't have happened, but with c_max
at 67GB, size
at 86.8GB and other_size
being 35.6GB of that, it's no wonder that Bad Things™ happen.
ain't I a unique precious user exposing myself to those Bad Things, @dweeezil ? ;) (beast rebooted with both patches in place and tar/chmod running)
ok -- found unable to ssh into the box today and attaching to ipmi console showed that it seems we hit the target:
[133273.994931] BUG: soft lockup - CPU#7 stuck for 23s! [l2arc_feed:1098]
but the problem is that it seems I can't even login (I should have ran open ipmi console logged in I guess :-/) -- I will keep trying
ok -- upon attempt to login from physical console -- it managed! http://www.onerussian.com/tmp//zfs_system_details_20150311/ has some logs for you guys, including the messages in kern.log from that debug patch. looking at munin reports, awkward enough L2ARC stats reporting 270GB L2ARC size seems to exceed amount of available L2ARC space (I have 200GB intel SSD drive): http://www.onerussian.com/tmp/munin-lockdown-20150311.html#zfs
@yarikoptic Could you please add /proc/slabinfo
.
@dweeezil http://www.onerussian.com/tmp//zfs_system_details_20150311-2/ now should contain all prev information (updated) + slabinfo and other goodies ;) I am about to try flushing the caches as it was suggested on IRC in attempt to release unevitable data
and FWIW http://www.onerussian.com/tmp//zfs_system_details_20150311-3/ should have the same info after I issues echo 3 >/proc/sys/vm/drop_caches which I believe never returned yet.
since the box is not reachable via network, and I wouldn't be able to run down for another 24h, I am inclined to reboot it within half an hour. If there is any other information I could provide before then -- please let me know. Upon reboot I will not bring the same load up (since I know that it would lock it up killing remote access) unless there is a new patch to check or need for more information.
Points of interest from zfs_system_details_20150311-2
:
zio_buf_512 24203409 24249872 512 8 1 : tunables 54 27 8 : slabdata 3031234 3031234 0
zio_buf_16384 2221818 2221818 16384 1 4 : tunables 8 4 0 : slabdata 2221818 2221818 0
zfs_znode_cache 0x00020 14141956096 11751261120 8192 1136 1726313 1726313 1983777 10357878 10344420 11902662 1 0 0
c_max 4 67685584896
size 4 81237672816
mru_size 4 28278091264
mru_evict_data 4 512
mru_evict_metadata 4 49152
mfu_size 4 8521976320
mfu_evict_data 4 5632
mfu_evict_metadata 4 3311104
arc_meta_used 4 81051423600
arc_meta_limit 4 50764188672
arc_meta_max 4 81997413160
l2_hdr_size 4 6246311920
I'm inclined to believe the workload breaks the ARC, as it seems to consist mainly of 512b buffers (perhaps spill blocks). The interesting thing, is how much data is contained in the MRU and MFU but is unevictable, so the system will just spin trying to reclaim, without making any progress. These ARC buffers are unevictable likely because the dbuf cache is holding references. Some of this unevictable data is probably blocks for the dnode object (pinned by dnode_t structs), or indirect blocks (pinned by data blocks). This is all speculation, though, just quickly glancing at the linked files.
Spill blocks generally wind up in kmalloc-512
which is low in this case. The zio_buf_impl_t
are likely bonus buffers which should generally track dnode_t
. I've been running my testing lately with a patch which breaks down the components of other_size
to help in matters like this. I've also considered putting the bonus buffers in their own cache instead of letting them share the zio cache (mainly to track them better).
We're having problems with ZFS induced lock-ups really often too, there's at least one lock up a day, but it's gotten worse than it has been. Since about ~4 weeks ago on every lockup I can't even SSH into the box to get the traces (rootfs isn't on ZFS, it's ext4, ZFS is only for containers).
It happens mostly when the workload shifts - we're balancing with all of the RAM used for applications and the rest goes to ARC, during daytime, it all works OK, but the problems come in early morning hours, when the backups get to run. We're backing up by rsync+NFS combo, since we haven't implemented send/recv backups yet. Also people are doing their own backups during the night.
This needs to be resolved soon, as it now begins to impact our reputation, we've been so much down-time that people are becoming nervous and are thinking about moving away to another host. Although we're a nonprofit org, so people cut us some slack, it still isn't a good position, if nothing else, it causes a lot of stress to me as an admin.
If anyone has any suggestions on what I can do, when the box locks up and SSH connections time out, I'm all game.
@snajpa I know it's not really a solution but would rate-limiting help as a temporary mitigation ?
(provided of course that the backups are not that large and e.g. everything can be transferred despite that limit)
rsync --bwlimit=<kb/second> <source> <dest>
I remember having read about a similar issue (rsync- & load-related) of Btrfs where this temporarily helped
@kernelOfTruth I've thought about that, but then we wouldn't manage to back up all the CTs on time before the daily peak load, when the IO capacity is needed for more useful things, than doing backups. Also this doesn't do anything about custom backups, which people do on their down in their containers.
I agree with @prakashsurya , for this workload it definitely looks like for some reason there is a large amount of metadata being pinned in the ARC. This is probably related to the huge number of inodes (10+ million) in the Linux VFS inode hash. Other filesystems on Linux have run in to similar kinds of issues when the hash get's this large.
If you are willing to roll a custom kernel one possible solution would be to add a hard limit in the kernel's VFS layer. Patches to do this have been proposed several times in the past on LKML. Unfortunately, those changes have always been rejected because in theory they should not be needed. In practice there seem to be legitimate reasons why you may want to do this, for example on a low memory embedded system.
Unfortunately, although I know I've seen these patches posted before I'm not have any luck turning them up with a google search. Perhaps someone else might have better luck... Unfortunately, since the VFS manages this cache there's only so much we can do on the ZFS side.
@snajpa I assume you must be seeing a similar issue based on the workload you've described. I know @yarikoptic is testing with master and the ABD patches, but what version of ZoL are you using?
I also want to say I definitely agree we need to get to the bottom on this and get it resolved.
@behlendorf not sure if that's exactly what you're looking for but search turned up the following (some interesting results among them):
https://lkml.org/lkml/2011/8/14/110 http://serverfault.com/questions/561350/unusually-high-dentry-cache-usage https://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.7/2.6.7-mm1/broken-out/vfs-shrinkage-tuning.patch
edit: some more finds on how e.g. other filesystems handle things: https://lists.samba.org/archive/linux-cifs-client/2010-May/006111.html http://www.gossamer-threads.com/lists/linux/kernel/1800236?do=post_view_threaded http://xfs.org/index.php/Improving_inode_Caching http://www.linuxforums.org/forum/kernel/189619-nfs-inode-cache-eats-up-memory-design-flaw.html
edit2: search terms used were:
patch vfs hard limit out of memory inodes inode cache limit patch
via Austrian Google entry (each country's Google index seems to lead to different results)
On Wed, 11 Mar 2015, Brian Behlendorf wrote:
I agree with @prakashsurya , for this workload it definitely looks like for some reason there is a large amount of metadata being pinned in the ARC. This is probably related to the huge number of inodes (10+ million) in the Linux VFS inode hash.
but in theory why "large amount" could be of relevance here -- some int overflow deep inside?
Other filesystems on Linux have run in to similar kinds of issues when the hash get's this large.
If you are willing to roll a custom kernel one possible solution would be
I could try a custom kernel build. But in the long run I hope to get some more sustainable resolution... could anyone help to locate any of the patches Brian had in mind?
Unfortunately, since the VFS manages this cache there's only so much we can do on the ZFS side.
so you think it is just a stock linux kernel issue and not ZFS to blame? then it definitely should be brought up to/upon them -- but I would not be able to send a cohesive report myself. FWIW -- I could try replicating the situation on some other drive with similar file structure/load if necessary (although I still hope not to be ask to do so ;) ). Just let me know if you think this is necessary.
Yaroslav O. Halchenko, Ph.D. http://neuro.debian.net http://www.pymvpa.org http://www.fail2ban.org Research Scientist, Psychological and Brain Sciences Dept. Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755 Phone: +1 (603) 646-9834 Fax: +1 (603) 646-1419 WWW: http://www.linkedin.com/in/yarik
@behlendorf I'm usually on HEAD or very close to it, with AIO function mappings commented out (https://github.com/vpsfreecz/zfs)
I personally have no problem with rolling a custom kernel, as long as it has any chance of smoothing things out. At this point, I'm willing to move a mountain to get the situation stabilized.
I also forgot to mention that I've set up periodic dentry cache flushing by echoing 1 to drop_caches every 5 minutes and the issues still persist. So I'm not really sure there's (only) dcache to be blamed. Any suggestions on how to help you guys chase this madness are more than welcome, this problem is on the absolute top of my priorities list.
FWIW, I have inserted a cold spare to that box, formatting it ext4 and filing it up with that junk which brings down zfs... let's see how that single drive stands... any particular metrics to look out for?
@snajpa @yarikoptic this afternoon I took a careful look at this and I think I've spotted the root cause of the problem. I'd like to verify I'm on the right track tomorrow morning so I can potentially turn around a patch before the weekend. However, to do that it would be very helpful if you could post the exact kernel version you're using.
@behlendorf sounds promising! ;) mine is
$> apt-cache policy linux-image-3.16.0-4-amd64
linux-image-3.16.0-4-amd64:
Installed: 3.16.7-ckt7-1
Candidate: 3.16.7-ckt7-1
Version table:
3.16.7-ckt7-1 0
500 http://debian.csail.mit.edu/debian/ jessie/main amd64 Packages
*** 3.16.7-ckt7-1 0
100 /var/lib/dpkg/status
$> uname -a
Linux smaug 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt7-1 (2015-03-01) x86_64 GNU/Linux
update 1: it was re-built manually to facilitate debugging:
--- debian/build/build_amd64_none_amd64/.config 2015-03-09 08:50:36.071047129 -0400
+++ ../../config-3.16.0-4-amd64 2015-03-09 08:47:52.807044961 -0400
CONFIG_SYSFS_SYSCALL=y
# CONFIG_SYSCTL_SYSCALL is not set
CONFIG_KALLSYMS=y
-# CONFIG_KALLSYMS_ALL is not set
+CONFIG_KALLSYMS_ALL=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
@@ -6319,14 +6246,14 @@
CONFIG_ENABLE_WARN_DEPRECATED=y
CONFIG_ENABLE_MUST_CHECK=y
CONFIG_FRAME_WARN=2048
-CONFIG_STRIP_ASM_SYMS=y
+# CONFIG_STRIP_ASM_SYMS is not set
# CONFIG_READABLE_ASM is not set
CONFIG_UNUSED_SYMBOLS=y
CONFIG_DEBUG_FS=y
# CONFIG_HEADERS_CHECK is not set
# CONFIG_DEBUG_SECTION_MISMATCH is not set
CONFIG_ARCH_WANT_FRAME_POINTERS=y
-# CONFIG_FRAME_POINTER is not set
+CONFIG_FRAME_POINTER=y
# CONFIG_DEBUG_FORCE_WEAK_PER_CPU is not set
CONFIG_MAGIC_SYSRQ=y
CONFIG_MAGIC_SYSRQ_DEFAULT_ENABLE=0x01b6
We're using OpenVZ RHEL6, from 2.6.32-042stab093.4 to 2.6.32-042stab104.1. I've had one node to lock up just now, it was clearly because it has run out of memory and this dentry/dnode cache problem is highly visible there to be blamed.
Take a look at the top of slabtop -o -s c:
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
15773040 15773040 100% 0.84K 1752560 9 14020480K dnode_t
628394 628394 100% 16.00K 628394 1 10054304K zio_buf_16384
15353824 15353824 100% 0.50K 1919228 8 7676912K zio_buf_512
619659 619659 100% 8.00K 619659 1 4957272K size-8192
15989831 15985087 99% 0.30K 1229987 13 4919948K dmu_buf_impl_t
18210094 18091691 99% 0.22K 1071182 17 4284728K dentry
6435420 6380551 99% 0.30K 536285 12 2145140K arc_buf_hdr_t
15073837 15073837 100% 0.10K 407401 37 1629604K sa_cache
921927 921923 99% 1.15K 307309 3 1229236K nfs_inode_cache
10053364 6160374 61% 0.06K 170396 59 681584K size-64
4057830 4008979 98% 0.12K 135261 30 541044K size-128
9261168 9146348 98% 0.03K 82689 112 330756K size-32
4941320 4940492 99% 0.04K 53710 92 214840K l2arc_buf_hdr_t
849243 835175 98% 0.20K 44697 19 178788K vm_area_struct
1448240 635047 43% 0.09K 36206 40 144824K arc_buf_t
175530 174755 99% 0.72K 35106 5 140424K proc_inode_cache
898098 750420 83% 0.06K 15222 59 60888K range_seg_cache
102298 100602 98% 0.55K 14614 7 58456K radix_tree_node
1017555 960753 94% 0.05K 13215 77 52860K anon_vma_chain
37962 37797 99% 1.09K 12654 3 50616K ext4_inode_cache
11398 11326 99% 2.75K 5699 2 45592K task_struct
37912 35769 94% 1.00K 9478 4 37912K size-1024
126225 118656 94% 0.25K 8415 15 33660K filp
479325 468327 97% 0.05K 6225 77 24900K anon_vma
The box had about 300MB of free memory, no swap, so it was stuck in reclaim. I have no idea why zio_buf_16384 uses up this much space, but dnode_t seems to be the bomb here :) And AFAIK there's no instrument how to tell dnode cache to get te f*ck off my lawn (RAM :D ), is there?
@yarikoptic can you please see if the patch in #3181 improves things. I was able to reproduce the issue you've described and the patch should improve things. It does still need some polish and probably more work for 2.6.32 vintage kernels, but I'd like to know how it helps your real workload.
Thank you @behlendorf . I will try it out asap. Meanwhile was bombarding that single spindle with ext4 partition on it with similar load and it having even more inodes that ZFS partition... seems surviving well (and unfortunately delivers almost 10x throughput on that simple tar'ing test than when I did it on ZFS)
hm... probably unrelated to the patch but upon build/install/reboot pool wasn't imported as it should have and did before... may be it is that additional spindle I have inserted to be blamed since I believe I might have initialized pool using sd? names and new drive when imported was sdj which is according to zfs import is now used for one of the system drives... I will give it a fresh look tomorrow and will let that beast rest for now
I was just browsing through all the changes since 0.6.3 tag, because I was wondering why 0.6.3 didn't have this behavior - it wouldn't evict any data in favor of metadata, and I came across https://github.com/zfsonlinux/zfs/commit/da8ccd0ee0d4eed08200e706cb2ca1da3bdfb5cf;
I think this isn't the best approach - consider my use-case -> I have over 80 containers per node and if anyone decides to create gazillion of files and then rsync them, it evicts all of the useful data of eg. MySQL instances, which immediately start to trash the disks. Since the metadata is hard to get rid of, it sends the machine to the IO hell. I don't really like the idea of prioritizing metadata over data just because we can :)
For the time being, I think the best workaround for me is to set a limit for metadata, after walking through all of the servers I think the best case for me is to set zfs_arc_meta_limit to 1/4 of zfs_arc_max (which itself is set to 1/4 of system RAM on my machines).
Well, setting zfs_arc_meta_limit doesn't really do much, ZFS does whatever it wants anyway - it keeps increasing arc_meta_max, arc_meta_used keeps increasing as well. Edit: I see that arc_meta_max just denotes the max size, that the metadata cache has reached.
Reverting https://github.com/zfsonlinux/zfs/commit/da8ccd0ee0d4eed08200e706cb2ca1da3bdfb5cf seems to do the trick - arc_meta_used is now bound at the limit I set, not at what ZFS thinks is the best.
OK, I take that back, it took a while, but arc_meta_used has overgrown the limit and refuses to go back.
@kpande I saw that about an hour ago, but my situation isn't a deadlock (luckily). arc_meta_used has just overgrown the limit and there's nothing which would force it to crawl back down. I've tried echoing 2 to drop_caches, but that bash process is now spinning in kernel space at 100% CPU. Although it did help MFU to gain some space and increase hit rate on it, because it did drop some of the dnode caches. Now I'm at 50G of arc_meta_used, down from 70G - still, it's way over limit (16G).
Just to let you know: I am struggling as well with a kswapd0 at 99% CPU and the kernel tells me about hung task 120 seconds ... txg_quiesce. This just by copying files from an usb disk to zfs pool
Shall I open a ticket or just hope Debian Jessie and ZOL will one day fall in love based on this ticket?
The only strange thing: 3 of 5 SMB shares from ZFS fall offline during the lock. I don't know why, nor why always the same 2 remain... they are almost equal.
ok -- the beast locked up but now it seems a bit differently: I do not remember seeing ZIL mirror drives used anyhow in previous lock-downs (or used at all for that matter -- no applications requiring sync so far) but ATM those poor SSDs get bombarded (not sure if for I or O), and I have missed the point when system was still somewhat responsitive to collect the information -- attempts to login remotely or locally now seems to not go. Not sure if contributed/relevant -- few hours back I added another drive to a spare bay, just to test the drive -- was running a dd from it to /dev/null.
yeah -- would need something like that Iguess although those reports hopefully also end up in syslog/journal files... looking at the iostat reports in one of the screen sessions where I was still connected: may be it wasn't ZIL partitions but OS partitions, since I see high IO on the md0 which is software RAID for the OS... so probably no ZIL to blame, sorry for misjudgement (and it was primarily reading at ~20MBps and writing at ~2MBps), but anyways -- in previous lock downs I didn't observe this high IO on those drives. another screen had the initial stack traces (I believe from tail -f /var/log/kern.log):
Mar 16 10:42:33 smaug kernel: [174401.109358] sd 0:0:11:0: [sdk] Attached SCSI disk
Mar 16 11:09:02 smaug kernel: [175988.647411] INFO: task txg_sync:5066 blocked for more than 120 seconds.
Mar 16 11:09:02 smaug kernel: [175988.647417] Tainted: P CIO 3.16.0-4-amd64 #1
Mar 16 11:09:02 smaug kernel: [175988.647418] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 16 11:09:02 smaug kernel: [175988.647420] txg_sync D ffff88107fd52f00 0 5066 2 0x00000000
Mar 16 11:09:02 smaug kernel: [175988.647426] ffff88202c687bd0 0000000000000046 ffff88203ac38390 0000000000012f00
Mar 16 11:09:02 smaug kernel: [175988.647432] ffff88202c687fd8 0000000000012f00 ffff88203b33e350 ffff88107fd537b0
Mar 16 11:09:02 smaug kernel: [175988.647436] ffff8810115c3928 ffff8810115c3970 0000000000000001 0000000000000000
Mar 16 11:09:02 smaug kernel: [175988.647441] Call Trace:
Mar 16 11:09:02 smaug kernel: [175988.647454] [<ffffffff8153570d>] io_schedule+0x9d/0x120
Mar 16 11:09:02 smaug kernel: [175988.647483] [<ffffffffa04c7d05>] cv_wait_common+0xa5/0x170 [spl]
Mar 16 11:09:02 smaug kernel: [175988.647490] [<ffffffff810aca10>] ? prepare_to_wait_event+0x100/0x100
Mar 16 11:09:02 smaug kernel: [175988.647498] [<ffffffffa04c7e28>] __cv_wait_io+0x18/0x20 [spl]
Mar 16 11:09:02 smaug kernel: [175988.647527] [<ffffffffa07cfe2b>] zio_wait+0x12b/0x260 [zfs]
Mar 16 11:09:02 smaug kernel: [175988.647530] [<ffffffff815373e2>] ? mutex_lock+0x12/0x2f
Mar 16 11:09:02 smaug kernel: [175988.647547] [<ffffffffa0756691>] dsl_pool_sync+0xb1/0x470 [zfs]
Mar 16 11:09:02 smaug kernel: [175988.647565] [<ffffffffa077135d>] spa_sync+0x45d/0xbd0 [zfs]
Mar 16 11:09:02 smaug kernel: [175988.647568] [<ffffffff810aca22>] ? autoremove_wake_function+0x12/0x40
Mar 16 11:09:02 smaug kernel: [175988.647573] [<ffffffff8101bad9>] ? read_tsc+0x9/0x20
Mar 16 11:09:02 smaug kernel: [175988.647592] [<ffffffffa07848f4>] txg_sync_thread+0x364/0x600 [zfs]
Mar 16 11:09:02 smaug kernel: [175988.647608] [<ffffffffa0784590>] ? txg_init+0x310/0x310 [zfs]
Mar 16 11:09:02 smaug kernel: [175988.647616] [<ffffffffa04c3291>] thread_generic_wrapper+0x71/0x80 [spl]
Mar 16 11:09:02 smaug kernel: [175988.647623] [<ffffffffa04c3220>] ? __thread_exit+0x20/0x20 [spl]
Mar 16 11:09:02 smaug kernel: [175988.647629] [<ffffffff8108bd99>] kthread+0xc9/0xe0
Mar 16 11:09:02 smaug kernel: [175988.647634] [<ffffffff8108bcd0>] ? kthread_create_on_node+0x180/0x180
Mar 16 11:09:02 smaug kernel: [175988.647637] [<ffffffff81538c7c>] ret_from_fork+0x7c/0xb0
Mar 16 11:09:02 smaug kernel: [175988.647641] [<ffffffff8108bcd0>] ? kthread_create_on_node+0x180/0x180
Mar 16 11:11:02 smaug kernel: [176108.530163] INFO: task txg_sync:5066 blocked for more than 120 seconds.
Mar 16 11:11:02 smaug kernel: [176108.530168] Tainted: P CIO 3.16.0-4-amd64 #1
Mar 16 11:11:02 smaug kernel: [176108.530169] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 16 11:11:02 smaug kernel: [176108.530170] txg_sync D ffff88107fd52f00 0 5066 2 0x00000000
Mar 16 11:11:02 smaug kernel: [176108.530176] ffff88202c687bd0 0000000000000046 ffff88203ac38390 0000000000012f00
Mar 16 11:11:02 smaug kernel: [176108.530180] ffff88202c687fd8 0000000000012f00 ffff88203b33e350 ffff88107fd537b0
Mar 16 11:11:02 smaug kernel: [176108.530182] ffff8810115c3928 ffff8810115c3970 0000000000000001 0000000000000000
Mar 16 11:11:02 smaug kernel: [176108.530185] Call Trace:
Mar 16 11:11:02 smaug kernel: [176108.530197] [<ffffffff8153570d>] io_schedule+0x9d/0x120
Mar 16 11:11:02 smaug kernel: [176108.530223] [<ffffffffa04c7d05>] cv_wait_common+0xa5/0x170 [spl]
Mar 16 11:11:02 smaug kernel: [176108.530230] [<ffffffff810aca10>] ? prepare_to_wait_event+0x100/0x100
Mar 16 11:11:02 smaug kernel: [176108.530235] [<ffffffffa04c7e28>] __cv_wait_io+0x18/0x20 [spl]
Mar 16 11:11:02 smaug kernel: [176108.530276] [<ffffffffa07cfe2b>] zio_wait+0x12b/0x260 [zfs]
Mar 16 11:11:02 smaug kernel: [176108.530281] [<ffffffff815373e2>] ? mutex_lock+0x12/0x2f
Mar 16 11:11:02 smaug kernel: [176108.530304] [<ffffffffa0756691>] dsl_pool_sync+0xb1/0x470 [zfs]
Mar 16 11:11:02 smaug kernel: [176108.530325] [<ffffffffa077135d>] spa_sync+0x45d/0xbd0 [zfs]
Mar 16 11:11:02 smaug kernel: [176108.530329] [<ffffffff810aca22>] ? autoremove_wake_function+0x12/0x40
Mar 16 11:11:02 smaug kernel: [176108.530336] [<ffffffff8101bad9>] ? read_tsc+0x9/0x20
Mar 16 11:11:02 smaug kernel: [176108.530357] [<ffffffffa07848f4>] txg_sync_thread+0x364/0x600 [zfs]
Mar 16 11:11:02 smaug kernel: [176108.530376] [<ffffffffa0784590>] ? txg_init+0x310/0x310 [zfs]
Mar 16 11:11:02 smaug kernel: [176108.530384] [<ffffffffa04c3291>] thread_generic_wrapper+0x71/0x80 [spl]
Mar 16 11:11:02 smaug kernel: [176108.530393] [<ffffffffa04c3220>] ? __thread_exit+0x20/0x20 [spl]
Mar 16 11:11:02 smaug kernel: [176108.530400] [<ffffffff8108bd99>] kthread+0xc9/0xe0
Mar 16 11:11:02 smaug kernel: [176108.530406] [<ffffffff8108bcd0>] ? kthread_create_on_node+0x180/0x180
Mar 16 11:11:02 smaug kernel: [176108.530413] [<ffffffff81538c7c>] ret_from_fork+0x7c/0xb0
Mar 16 11:11:02 smaug kernel: [176108.530419] [<ffffffff8108bcd0>] ? kthread_create_on_node+0x180/0x180
which then repeated few times before complete lockdown. not sure how much that of help though I am resetting the beast now
note -- that deadlock was somewhat "incremental" -- those mssages started to appear but that "tar" process still was squirting some data from time to time (some times 0bps some times 20kbps) until it went completely dead. my fault that I haven't prepared to run my reporting thing immediately -- may be I should automate that right upon receving any worrisome msg in kern.log
for "during crash" I would need to somehow memorize that particular moment I guess since at times I observe the beast only post-mortem after few hours of deadlock
btw -- here is curent munin report which also shows that problematic point: http://www.onerussian.com/tmp/munin-lockdown-20150316.html#zfs ZFS ARC Size plot shows that size started to grow linearly from some time earlier today and "Frequently used cache size" (size - p from arcstat if I got it right for that point) caught up with it. May be there is anything else which would be of interest
@behlendorf There's seem to be a lock inversion between l2arc_buflist_mtx and HDR_LOCK. In arc_release, we hold HDR_LOCK then l2arc_buflist_mtx. However, if we hold l2arc_buflist_mtx -> direct reclaim -> arc_buf_remove_buf -> HDR_LOCK (deadlock). So, l2arc_buflist_mtx should be FSTRANS.
Also, we need to find if there are other locks that would be held inside a FSTRANS lock. And if they would also be held outside of FSTRANS. They should also be FSTRANS.
@behlendorf I will wait for your input on #3193 to either I should check it out -- in conjunction or instead of previous fixes?
@yarikoptic there seem to be lots of potential places where a deadlocks could occur
since those patches don't seem mutually exclusive - I'd go for both (as many as possible :+1: ): https://github.com/zfsonlinux/zfs/pull/3193 & https://github.com/dweeezil/zfs/commit/f04f972 (referenced in https://github.com/zfsonlinux/zfs/issues/3183)
let's see what @behlendorf 's input is on this
referencing #2990 & #2934 , error messages look similar and related
"rsync, trash cleanup jobs, lots of hard links" "Sometimes seems to occur during moderate continuous IO"
I wonder what really the common denominator is that triggers this (besides rsync that occasionally seem to lead to this)
The hope was that #2983 gets fixed by #3132 and I was recently happy to report that system seemed survived relatively heavy bombarding, but again after 4 days the beast locked down. Here are the gory details: http://www.onerussian.com/tmp//zfs_system_details_20150307/ Let me know if I should provide more information of any kind