Still locks up after a while: Debian jessie 3.16.0-4-amd64

yarikoptic commented 9 years ago

The hope was that #2983 gets fixed by #3132 and I was recently happy to report that system seemed survived relatively heavy bombarding, but again after 4 days the beast locked down. Here are the gory details: http://www.onerussian.com/tmp//zfs_system_details_20150307/ Let me know if I should provide more information of any kind

yarikoptic commented 9 years ago

if of any help -- here is a historical overview for today/week from munin http://www.onerussian.com/tmp/munin-locdown-20150307.html to see eg (if any relevant, could be just an effect of the deadlock) there was a spike of interrupts right around the deadlock point

yarikoptic commented 9 years ago

please ignore above comment about interrupts -- a bit confusing plot rendering, they just actually went down

yarikoptic commented 9 years ago

fresh dump from the system info (with stacktraces.txt now, thanks @DeHackEd for spotting their omission) is available from http://www.onerussian.com/tmp//zfs_system_details_20150308/ . If I don't hear back I will reboot the beast in a few hours -- need to get it usable again

yarikoptic commented 9 years ago

BTW -- it seems that this stall feels different from before -- unlike before I can still do some actions on those mount ZFS partitions/datasets (e.g. modify files, git commit) while other actions get stuck (docker now can't do stuff even though not directly associated with ZFS, ZFS snapshots/arc_adapt/l2arc_feed/txg_sync and that lengthy rm -r directory seems to be stuck in D)

behlendorf commented 9 years ago

@yarikoptic it appears that #2983 was significantly improved by #3121 but we didn't quite get every possible deadlock. It's clear from the backtraces you provided the threads are all blocked on the ARC hash lock but what isn't clear is what task is holding that lock and why it's not giving it up. Normally, I'm highly suspicious of the direct reclaim paths like the one taken by the l2arc_feed thread, but in this case I don't see where it would have taken the lock. I put together a debug patch to try and catch the offending task, can you try running with the patch in #3161.

behlendorf commented 9 years ago

NOTE: This may be a blocker for 0.6.4. There is clearly a potential deadlock still remaining in the code although to my knowledge @yarikoptic's workload in the only workload which has triggered it.

dweeezil commented 9 years ago

Is there a description of the workload which triggers this problem? My current testing rigs for the recent illumos ARC work (purposely not referring to those pull requests to minimize non-relevant links) would allow me to test this type of thing pretty easily.

One thing that sticks out to me in the original posting is the other_size of 35627695184 out of an ARC size of 40485186048 (ouch). It would have been interesting to see /proc/slabinfo as well.

yarikoptic commented 9 years ago

workload was primarily a

tar -cf- hcp-ls.20141020-500subjects-addurl |pv | pigz -9 >| hcp-ls.20141020-500subjects-addurl.tar.gz

command where that directory contains millions of files, symlinks and subdirectories (probably around 40mil entries altogether). Each file within there is very small (contains just a path with it). resultant tar.gz is 16GB. In parallel or before that load I could have had also had a find/chmod/rm sweeps on similarly "large" directories. If necessary -- I could share that generated (there was a succesfull run after all) for your testing so you could simulate similar load. but for the deadlock to happen it took a few days usually.

Unfortunately I have rebooted the beast so I guess slabinfo wouldn't be of interest atm, but I have added it now to the http://github.com/yarikoptic/zfs-system-details/blob/HEAD/zfs-system-details script so it will be reported next time

behlendorf commented 9 years ago

@yarikoptic you might also want to pick up #3163. The automated testing uncovered another issue, which looks different from yours, but is still possible.

yarikoptic commented 9 years ago

gotcha, rebuilding with 3163 and will restart/resume bombardment shortly

dweeezil commented 9 years ago

Another workload involving the traversal (statting of) million(s) of files and other_size is way out of control. YABITP (yet another big inode traversal problem). Of course, the deadlock you enountered shouldn't have happened, but with c_max at 67GB, size at 86.8GB and other_size being 35.6GB of that, it's no wonder that Bad Things™ happen.

yarikoptic commented 9 years ago

ain't I a unique precious user exposing myself to those Bad Things, @dweeezil ? ;) (beast rebooted with both patches in place and tar/chmod running)

yarikoptic commented 9 years ago

ok -- found unable to ssh into the box today and attaching to ipmi console showed that it seems we hit the target:

[133273.994931] BUG: soft lockup - CPU#7 stuck for 23s! [l2arc_feed:1098]

but the problem is that it seems I can't even login (I should have ran open ipmi console logged in I guess :-/) -- I will keep trying

yarikoptic commented 9 years ago

ok -- upon attempt to login from physical console -- it managed! http://www.onerussian.com/tmp//zfs_system_details_20150311/ has some logs for you guys, including the messages in kern.log from that debug patch. looking at munin reports, awkward enough L2ARC stats reporting 270GB L2ARC size seems to exceed amount of available L2ARC space (I have 200GB intel SSD drive): http://www.onerussian.com/tmp/munin-lockdown-20150311.html#zfs

dweeezil commented 9 years ago

@yarikoptic Could you please add /proc/slabinfo.

yarikoptic commented 9 years ago

@dweeezil http://www.onerussian.com/tmp//zfs_system_details_20150311-2/ now should contain all prev information (updated) + slabinfo and other goodies ;) I am about to try flushing the caches as it was suggested on IRC in attempt to release unevitable data

yarikoptic commented 9 years ago

and FWIW http://www.onerussian.com/tmp//zfs_system_details_20150311-3/ should have the same info after I issues echo 3 >/proc/sys/vm/drop_caches which I believe never returned yet.

yarikoptic commented 9 years ago

since the box is not reachable via network, and I wouldn't be able to run down for another 24h, I am inclined to reboot it within half an hour. If there is any other information I could provide before then -- please let me know. Upon reboot I will not bring the same load up (since I know that it would lock it up killing remote access) unless there is a new patch to check or need for more information.

prakashsurya commented 9 years ago

Points of interest from zfs_system_details_20150311-2:

zio_buf_512       24203409 24249872    512    8    1 : tunables   54   27    8 : slabdata 3031234 3031234      0

zio_buf_16384     2221818 2221818  16384    1    4 : tunables    8    4    0 : slabdata 2221818 2221818      0

zfs_znode_cache                       0x00020 14141956096 11751261120     8192     1136  1726313 1726313 1983777  10357878 10344420 11902662      1     0     0

c_max                           4    67685584896
size                            4    81237672816

mru_size                        4    28278091264
mru_evict_data                  4    512
mru_evict_metadata              4    49152

mfu_size                        4    8521976320
mfu_evict_data                  4    5632
mfu_evict_metadata              4    3311104

arc_meta_used                   4    81051423600
arc_meta_limit                  4    50764188672
arc_meta_max                    4    81997413160

l2_hdr_size                     4    6246311920

I'm inclined to believe the workload breaks the ARC, as it seems to consist mainly of 512b buffers (perhaps spill blocks). The interesting thing, is how much data is contained in the MRU and MFU but is unevictable, so the system will just spin trying to reclaim, without making any progress. These ARC buffers are unevictable likely because the dbuf cache is holding references. Some of this unevictable data is probably blocks for the dnode object (pinned by dnode_t structs), or indirect blocks (pinned by data blocks). This is all speculation, though, just quickly glancing at the linked files.

dweeezil commented 9 years ago

Spill blocks generally wind up in kmalloc-512 which is low in this case. The zio_buf_impl_t are likely bonus buffers which should generally track dnode_t. I've been running my testing lately with a patch which breaks down the components of other_size to help in matters like this. I've also considered putting the bonus buffers in their own cache instead of letting them share the zio cache (mainly to track them better).

snajpa commented 9 years ago

We're having problems with ZFS induced lock-ups really often too, there's at least one lock up a day, but it's gotten worse than it has been. Since about ~4 weeks ago on every lockup I can't even SSH into the box to get the traces (rootfs isn't on ZFS, it's ext4, ZFS is only for containers).

It happens mostly when the workload shifts - we're balancing with all of the RAM used for applications and the rest goes to ARC, during daytime, it all works OK, but the problems come in early morning hours, when the backups get to run. We're backing up by rsync+NFS combo, since we haven't implemented send/recv backups yet. Also people are doing their own backups during the night.

This needs to be resolved soon, as it now begins to impact our reputation, we've been so much down-time that people are becoming nervous and are thinking about moving away to another host. Although we're a nonprofit org, so people cut us some slack, it still isn't a good position, if nothing else, it causes a lot of stress to me as an admin.

If anyone has any suggestions on what I can do, when the box locks up and SSH connections time out, I'm all game.

kernelOfTruth commented 9 years ago

@snajpa I know it's not really a solution but would rate-limiting help as a temporary mitigation ?

(provided of course that the backups are not that large and e.g. everything can be transferred despite that limit)

rsync --bwlimit=<kb/second> <source> <dest>

I remember having read about a similar issue (rsync- & load-related) of Btrfs where this temporarily helped

snajpa commented 9 years ago

@kernelOfTruth I've thought about that, but then we wouldn't manage to back up all the CTs on time before the daily peak load, when the IO capacity is needed for more useful things, than doing backups. Also this doesn't do anything about custom backups, which people do on their down in their containers.

behlendorf commented 9 years ago

I agree with @prakashsurya , for this workload it definitely looks like for some reason there is a large amount of metadata being pinned in the ARC. This is probably related to the huge number of inodes (10+ million) in the Linux VFS inode hash. Other filesystems on Linux have run in to similar kinds of issues when the hash get's this large.

If you are willing to roll a custom kernel one possible solution would be to add a hard limit in the kernel's VFS layer. Patches to do this have been proposed several times in the past on LKML. Unfortunately, those changes have always been rejected because in theory they should not be needed. In practice there seem to be legitimate reasons why you may want to do this, for example on a low memory embedded system.

Unfortunately, although I know I've seen these patches posted before I'm not have any luck turning them up with a google search. Perhaps someone else might have better luck... Unfortunately, since the VFS manages this cache there's only so much we can do on the ZFS side.

@snajpa I assume you must be seeing a similar issue based on the workload you've described. I know @yarikoptic is testing with master and the ABD patches, but what version of ZoL are you using?

I also want to say I definitely agree we need to get to the bottom on this and get it resolved.

kernelOfTruth commented 9 years ago

@behlendorf not sure if that's exactly what you're looking for but search turned up the following (some interesting results among them):

https://lkml.org/lkml/2011/8/14/110 http://serverfault.com/questions/561350/unusually-high-dentry-cache-usage https://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.7/2.6.7-mm1/broken-out/vfs-shrinkage-tuning.patch

edit: some more finds on how e.g. other filesystems handle things: https://lists.samba.org/archive/linux-cifs-client/2010-May/006111.html http://www.gossamer-threads.com/lists/linux/kernel/1800236?do=post_view_threaded http://xfs.org/index.php/Improving_inode_Caching http://www.linuxforums.org/forum/kernel/189619-nfs-inode-cache-eats-up-memory-design-flaw.html

edit2: search terms used were:

patch vfs hard limit out of memory inodes inode cache limit patch

via Austrian Google entry (each country's Google index seems to lead to different results)

yarikoptic commented 9 years ago

On Wed, 11 Mar 2015, Brian Behlendorf wrote:

I agree with @prakashsurya , for this workload it definitely looks like for some reason there is a large amount of metadata being pinned in the ARC. This is probably related to the huge number of inodes (10+ million) in the Linux VFS inode hash.

but in theory why "large amount" could be of relevance here -- some int overflow deep inside?

Other filesystems on Linux have run in to similar kinds of issues when the hash get's this large.

If you are willing to roll a custom kernel one possible solution would be

I could try a custom kernel build. But in the long run I hope to get some more sustainable resolution... could anyone help to locate any of the patches Brian had in mind?

Unfortunately, since the VFS manages this cache there's only so much we can do on the ZFS side.

so you think it is just a stock linux kernel issue and not ZFS to blame? then it definitely should be brought up to/upon them -- but I would not be able to send a cohesive report myself. FWIW -- I could try replicating the situation on some other drive with similar file structure/load if necessary (although I still hope not to be ask to do so ;) ). Just let me know if you think this is necessary.

Yaroslav O. Halchenko, Ph.D. http://neuro.debian.net http://www.pymvpa.org http://www.fail2ban.org Research Scientist, Psychological and Brain Sciences Dept. Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755 Phone: +1 (603) 646-9834 Fax: +1 (603) 646-1419 WWW: http://www.linkedin.com/in/yarik

snajpa commented 9 years ago

@behlendorf I'm usually on HEAD or very close to it, with AIO function mappings commented out (https://github.com/vpsfreecz/zfs)

I personally have no problem with rolling a custom kernel, as long as it has any chance of smoothing things out. At this point, I'm willing to move a mountain to get the situation stabilized.

snajpa commented 9 years ago

I also forgot to mention that I've set up periodic dentry cache flushing by echoing 1 to drop_caches every 5 minutes and the issues still persist. So I'm not really sure there's (only) dcache to be blamed. Any suggestions on how to help you guys chase this madness are more than welcome, this problem is on the absolute top of my priorities list.

yarikoptic commented 9 years ago

FWIW, I have inserted a cold spare to that box, formatting it ext4 and filing it up with that junk which brings down zfs... let's see how that single drive stands... any particular metrics to look out for?

behlendorf commented 9 years ago

@snajpa @yarikoptic this afternoon I took a careful look at this and I think I've spotted the root cause of the problem. I'd like to verify I'm on the right track tomorrow morning so I can potentially turn around a patch before the weekend. However, to do that it would be very helpful if you could post the exact kernel version you're using.

yarikoptic commented 9 years ago

@behlendorf sounds promising! ;) mine is

$> apt-cache policy linux-image-3.16.0-4-amd64
linux-image-3.16.0-4-amd64:
  Installed: 3.16.7-ckt7-1
  Candidate: 3.16.7-ckt7-1
  Version table:
     3.16.7-ckt7-1 0
        500 http://debian.csail.mit.edu/debian/ jessie/main amd64 Packages
 *** 3.16.7-ckt7-1 0
        100 /var/lib/dpkg/status

$> uname -a                                   
Linux smaug 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt7-1 (2015-03-01) x86_64 GNU/Linux

update 1: it was re-built manually to facilitate debugging:

--- debian/build/build_amd64_none_amd64/.config 2015-03-09 08:50:36.071047129 -0400
+++ ../../config-3.16.0-4-amd64 2015-03-09 08:47:52.807044961 -0400
 CONFIG_SYSFS_SYSCALL=y
 # CONFIG_SYSCTL_SYSCALL is not set
 CONFIG_KALLSYMS=y
-# CONFIG_KALLSYMS_ALL is not set
+CONFIG_KALLSYMS_ALL=y
 CONFIG_PRINTK=y
 CONFIG_BUG=y
 CONFIG_ELF_CORE=y
@@ -6319,14 +6246,14 @@
 CONFIG_ENABLE_WARN_DEPRECATED=y
 CONFIG_ENABLE_MUST_CHECK=y
 CONFIG_FRAME_WARN=2048
-CONFIG_STRIP_ASM_SYMS=y
+# CONFIG_STRIP_ASM_SYMS is not set
 # CONFIG_READABLE_ASM is not set
 CONFIG_UNUSED_SYMBOLS=y
 CONFIG_DEBUG_FS=y
 # CONFIG_HEADERS_CHECK is not set
 # CONFIG_DEBUG_SECTION_MISMATCH is not set
 CONFIG_ARCH_WANT_FRAME_POINTERS=y
-# CONFIG_FRAME_POINTER is not set
+CONFIG_FRAME_POINTER=y
 # CONFIG_DEBUG_FORCE_WEAK_PER_CPU is not set
 CONFIG_MAGIC_SYSRQ=y
 CONFIG_MAGIC_SYSRQ_DEFAULT_ENABLE=0x01b6

snajpa commented 9 years ago

We're using OpenVZ RHEL6, from 2.6.32-042stab093.4 to 2.6.32-042stab104.1. I've had one node to lock up just now, it was clearly because it has run out of memory and this dentry/dnode cache problem is highly visible there to be blamed.

Take a look at the top of slabtop -o -s c:

OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
15773040 15773040 100%    0.84K 1752560        9  14020480K dnode_t                
628394 628394 100%   16.00K 628394        1  10054304K zio_buf_16384          
15353824 15353824 100%    0.50K 1919228        8   7676912K zio_buf_512            
619659 619659 100%    8.00K 619659        1   4957272K size-8192              
15989831 15985087  99%    0.30K 1229987       13   4919948K dmu_buf_impl_t         
18210094 18091691  99%    0.22K 1071182       17   4284728K dentry                 
6435420 6380551  99%    0.30K 536285       12   2145140K arc_buf_hdr_t          
15073837 15073837 100%    0.10K 407401       37   1629604K sa_cache               
921927 921923  99%    1.15K 307309        3   1229236K nfs_inode_cache        
10053364 6160374  61%    0.06K 170396       59    681584K size-64                
4057830 4008979  98%    0.12K 135261       30    541044K size-128               
9261168 9146348  98%    0.03K  82689      112    330756K size-32                
4941320 4940492  99%    0.04K  53710       92    214840K l2arc_buf_hdr_t        
849243 835175  98%    0.20K  44697       19    178788K vm_area_struct         
1448240 635047  43%    0.09K  36206       40    144824K arc_buf_t              
175530 174755  99%    0.72K  35106        5    140424K proc_inode_cache       
898098 750420  83%    0.06K  15222       59     60888K range_seg_cache        
102298 100602  98%    0.55K  14614        7     58456K radix_tree_node        
1017555 960753  94%    0.05K  13215       77     52860K anon_vma_chain         
 37962  37797  99%    1.09K  12654        3     50616K ext4_inode_cache       
 11398  11326  99%    2.75K   5699        2     45592K task_struct            
 37912  35769  94%    1.00K   9478        4     37912K size-1024              
126225 118656  94%    0.25K   8415       15     33660K filp                   
479325 468327  97%    0.05K   6225       77     24900K anon_vma

The box had about 300MB of free memory, no swap, so it was stuck in reclaim. I have no idea why zio_buf_16384 uses up this much space, but dnode_t seems to be the bomb here :) And AFAIK there's no instrument how to tell dnode cache to get te f*ck off my lawn (RAM :D ), is there?

behlendorf commented 9 years ago

@yarikoptic can you please see if the patch in #3181 improves things. I was able to reproduce the issue you've described and the patch should improve things. It does still need some polish and probably more work for 2.6.32 vintage kernels, but I'd like to know how it helps your real workload.

yarikoptic commented 9 years ago

Thank you @behlendorf . I will try it out asap. Meanwhile was bombarding that single spindle with ext4 partition on it with similar load and it having even more inodes that ZFS partition... seems surviving well (and unfortunately delivers almost 10x throughput on that simple tar'ing test than when I did it on ZFS)

yarikoptic commented 9 years ago

hm... probably unrelated to the patch but upon build/install/reboot pool wasn't imported as it should have and did before... may be it is that additional spindle I have inserted to be blamed since I believe I might have initialized pool using sd? names and new drive when imported was sdj which is according to zfs import is now used for one of the system drives... I will give it a fresh look tomorrow and will let that beast rest for now

snajpa commented 9 years ago

I was just browsing through all the changes since 0.6.3 tag, because I was wondering why 0.6.3 didn't have this behavior - it wouldn't evict any data in favor of metadata, and I came across https://github.com/zfsonlinux/zfs/commit/da8ccd0ee0d4eed08200e706cb2ca1da3bdfb5cf;

I think this isn't the best approach - consider my use-case -> I have over 80 containers per node and if anyone decides to create gazillion of files and then rsync them, it evicts all of the useful data of eg. MySQL instances, which immediately start to trash the disks. Since the metadata is hard to get rid of, it sends the machine to the IO hell. I don't really like the idea of prioritizing metadata over data just because we can :)

For the time being, I think the best workaround for me is to set a limit for metadata, after walking through all of the servers I think the best case for me is to set zfs_arc_meta_limit to 1/4 of zfs_arc_max (which itself is set to 1/4 of system RAM on my machines).

snajpa commented 9 years ago

Well, setting zfs_arc_meta_limit doesn't really do much, ZFS does whatever it wants anyway - it keeps increasing arc_meta_max, arc_meta_used keeps increasing as well. Edit: I see that arc_meta_max just denotes the max size, that the metadata cache has reached.

snajpa commented 9 years ago

Reverting https://github.com/zfsonlinux/zfs/commit/da8ccd0ee0d4eed08200e706cb2ca1da3bdfb5cf seems to do the trick - arc_meta_used is now bound at the limit I set, not at what ZFS thinks is the best.

snajpa commented 9 years ago

OK, I take that back, it took a while, but arc_meta_used has overgrown the limit and refuses to go back.

snajpa commented 9 years ago

@kpande I saw that about an hour ago, but my situation isn't a deadlock (luckily). arc_meta_used has just overgrown the limit and there's nothing which would force it to crawl back down. I've tried echoing 2 to drop_caches, but that bash process is now spinning in kernel space at 100% CPU. Although it did help MFU to gain some space and increase hit rate on it, because it did drop some of the dnode caches. Now I'm at 50G of arc_meta_used, down from 70G - still, it's way over limit (16G).

bachmarc commented 9 years ago

Just to let you know: I am struggling as well with a kswapd0 at 99% CPU and the kernel tells me about hung task 120 seconds ... txg_quiesce. This just by copying files from an usb disk to zfs pool

Shall I open a ticket or just hope Debian Jessie and ZOL will one day fall in love based on this ticket?

The only strange thing: 3 of 5 SMB shares from ZFS fall offline during the lock. I don't know why, nor why always the same 2 remain... they are almost equal.

yarikoptic commented 9 years ago

ok -- the beast locked up but now it seems a bit differently: I do not remember seeing ZIL mirror drives used anyhow in previous lock-downs (or used at all for that matter -- no applications requiring sync so far) but ATM those poor SSDs get bombarded (not sure if for I or O), and I have missed the point when system was still somewhat responsitive to collect the information -- attempts to login remotely or locally now seems to not go. Not sure if contributed/relevant -- few hours back I added another drive to a spare bay, just to test the drive -- was running a dd from it to /dev/null.

yarikoptic commented 9 years ago

yeah -- would need something like that Iguess although those reports hopefully also end up in syslog/journal files... looking at the iostat reports in one of the screen sessions where I was still connected: may be it wasn't ZIL partitions but OS partitions, since I see high IO on the md0 which is software RAID for the OS... so probably no ZIL to blame, sorry for misjudgement (and it was primarily reading at ~20MBps and writing at ~2MBps), but anyways -- in previous lock downs I didn't observe this high IO on those drives. another screen had the initial stack traces (I believe from tail -f /var/log/kern.log):

Mar 16 10:42:33 smaug kernel: [174401.109358] sd 0:0:11:0: [sdk] Attached SCSI disk
Mar 16 11:09:02 smaug kernel: [175988.647411] INFO: task txg_sync:5066 blocked for more than 120 seconds.
Mar 16 11:09:02 smaug kernel: [175988.647417]       Tainted: P         CIO  3.16.0-4-amd64 #1
Mar 16 11:09:02 smaug kernel: [175988.647418] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 16 11:09:02 smaug kernel: [175988.647420] txg_sync        D ffff88107fd52f00     0  5066      2 0x00000000
Mar 16 11:09:02 smaug kernel: [175988.647426]  ffff88202c687bd0 0000000000000046 ffff88203ac38390 0000000000012f00
Mar 16 11:09:02 smaug kernel: [175988.647432]  ffff88202c687fd8 0000000000012f00 ffff88203b33e350 ffff88107fd537b0
Mar 16 11:09:02 smaug kernel: [175988.647436]  ffff8810115c3928 ffff8810115c3970 0000000000000001 0000000000000000
Mar 16 11:09:02 smaug kernel: [175988.647441] Call Trace:
Mar 16 11:09:02 smaug kernel: [175988.647454]  [<ffffffff8153570d>] io_schedule+0x9d/0x120
Mar 16 11:09:02 smaug kernel: [175988.647483]  [<ffffffffa04c7d05>] cv_wait_common+0xa5/0x170 [spl]
Mar 16 11:09:02 smaug kernel: [175988.647490]  [<ffffffff810aca10>] ? prepare_to_wait_event+0x100/0x100
Mar 16 11:09:02 smaug kernel: [175988.647498]  [<ffffffffa04c7e28>] __cv_wait_io+0x18/0x20 [spl]
Mar 16 11:09:02 smaug kernel: [175988.647527]  [<ffffffffa07cfe2b>] zio_wait+0x12b/0x260 [zfs]
Mar 16 11:09:02 smaug kernel: [175988.647530]  [<ffffffff815373e2>] ? mutex_lock+0x12/0x2f
Mar 16 11:09:02 smaug kernel: [175988.647547]  [<ffffffffa0756691>] dsl_pool_sync+0xb1/0x470 [zfs]
Mar 16 11:09:02 smaug kernel: [175988.647565]  [<ffffffffa077135d>] spa_sync+0x45d/0xbd0 [zfs]
Mar 16 11:09:02 smaug kernel: [175988.647568]  [<ffffffff810aca22>] ? autoremove_wake_function+0x12/0x40
Mar 16 11:09:02 smaug kernel: [175988.647573]  [<ffffffff8101bad9>] ? read_tsc+0x9/0x20
Mar 16 11:09:02 smaug kernel: [175988.647592]  [<ffffffffa07848f4>] txg_sync_thread+0x364/0x600 [zfs]
Mar 16 11:09:02 smaug kernel: [175988.647608]  [<ffffffffa0784590>] ? txg_init+0x310/0x310 [zfs]
Mar 16 11:09:02 smaug kernel: [175988.647616]  [<ffffffffa04c3291>] thread_generic_wrapper+0x71/0x80 [spl]
Mar 16 11:09:02 smaug kernel: [175988.647623]  [<ffffffffa04c3220>] ? __thread_exit+0x20/0x20 [spl]
Mar 16 11:09:02 smaug kernel: [175988.647629]  [<ffffffff8108bd99>] kthread+0xc9/0xe0
Mar 16 11:09:02 smaug kernel: [175988.647634]  [<ffffffff8108bcd0>] ? kthread_create_on_node+0x180/0x180
Mar 16 11:09:02 smaug kernel: [175988.647637]  [<ffffffff81538c7c>] ret_from_fork+0x7c/0xb0
Mar 16 11:09:02 smaug kernel: [175988.647641]  [<ffffffff8108bcd0>] ? kthread_create_on_node+0x180/0x180
Mar 16 11:11:02 smaug kernel: [176108.530163] INFO: task txg_sync:5066 blocked for more than 120 seconds.
Mar 16 11:11:02 smaug kernel: [176108.530168]       Tainted: P         CIO  3.16.0-4-amd64 #1
Mar 16 11:11:02 smaug kernel: [176108.530169] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 16 11:11:02 smaug kernel: [176108.530170] txg_sync        D ffff88107fd52f00     0  5066      2 0x00000000
Mar 16 11:11:02 smaug kernel: [176108.530176]  ffff88202c687bd0 0000000000000046 ffff88203ac38390 0000000000012f00
Mar 16 11:11:02 smaug kernel: [176108.530180]  ffff88202c687fd8 0000000000012f00 ffff88203b33e350 ffff88107fd537b0
Mar 16 11:11:02 smaug kernel: [176108.530182]  ffff8810115c3928 ffff8810115c3970 0000000000000001 0000000000000000
Mar 16 11:11:02 smaug kernel: [176108.530185] Call Trace:
Mar 16 11:11:02 smaug kernel: [176108.530197]  [<ffffffff8153570d>] io_schedule+0x9d/0x120
Mar 16 11:11:02 smaug kernel: [176108.530223]  [<ffffffffa04c7d05>] cv_wait_common+0xa5/0x170 [spl]
Mar 16 11:11:02 smaug kernel: [176108.530230]  [<ffffffff810aca10>] ? prepare_to_wait_event+0x100/0x100
Mar 16 11:11:02 smaug kernel: [176108.530235]  [<ffffffffa04c7e28>] __cv_wait_io+0x18/0x20 [spl]
Mar 16 11:11:02 smaug kernel: [176108.530276]  [<ffffffffa07cfe2b>] zio_wait+0x12b/0x260 [zfs]
Mar 16 11:11:02 smaug kernel: [176108.530281]  [<ffffffff815373e2>] ? mutex_lock+0x12/0x2f
Mar 16 11:11:02 smaug kernel: [176108.530304]  [<ffffffffa0756691>] dsl_pool_sync+0xb1/0x470 [zfs]
Mar 16 11:11:02 smaug kernel: [176108.530325]  [<ffffffffa077135d>] spa_sync+0x45d/0xbd0 [zfs]
Mar 16 11:11:02 smaug kernel: [176108.530329]  [<ffffffff810aca22>] ? autoremove_wake_function+0x12/0x40
Mar 16 11:11:02 smaug kernel: [176108.530336]  [<ffffffff8101bad9>] ? read_tsc+0x9/0x20
Mar 16 11:11:02 smaug kernel: [176108.530357]  [<ffffffffa07848f4>] txg_sync_thread+0x364/0x600 [zfs]
Mar 16 11:11:02 smaug kernel: [176108.530376]  [<ffffffffa0784590>] ? txg_init+0x310/0x310 [zfs]
Mar 16 11:11:02 smaug kernel: [176108.530384]  [<ffffffffa04c3291>] thread_generic_wrapper+0x71/0x80 [spl]
Mar 16 11:11:02 smaug kernel: [176108.530393]  [<ffffffffa04c3220>] ? __thread_exit+0x20/0x20 [spl]
Mar 16 11:11:02 smaug kernel: [176108.530400]  [<ffffffff8108bd99>] kthread+0xc9/0xe0
Mar 16 11:11:02 smaug kernel: [176108.530406]  [<ffffffff8108bcd0>] ? kthread_create_on_node+0x180/0x180
Mar 16 11:11:02 smaug kernel: [176108.530413]  [<ffffffff81538c7c>] ret_from_fork+0x7c/0xb0
Mar 16 11:11:02 smaug kernel: [176108.530419]  [<ffffffff8108bcd0>] ? kthread_create_on_node+0x180/0x180

which then repeated few times before complete lockdown. not sure how much that of help though I am resetting the beast now

yarikoptic commented 9 years ago

note -- that deadlock was somewhat "incremental" -- those mssages started to appear but that "tar" process still was squirting some data from time to time (some times 0bps some times 20kbps) until it went completely dead. my fault that I haven't prepared to run my reporting thing immediately -- may be I should automate that right upon receving any worrisome msg in kern.log

yarikoptic commented 9 years ago

for "during crash" I would need to somehow memorize that particular moment I guess since at times I observe the beast only post-mortem after few hours of deadlock

yarikoptic commented 9 years ago

btw -- here is curent munin report which also shows that problematic point: http://www.onerussian.com/tmp/munin-lockdown-20150316.html#zfs ZFS ARC Size plot shows that size started to grow linearly from some time earlier today and "Frequently used cache size" (size - p from arcstat if I got it right for that point) caught up with it. May be there is anything else which would be of interest

tuxoko commented 9 years ago

@behlendorf There's seem to be a lock inversion between l2arc_buflist_mtx and HDR_LOCK. In arc_release, we hold HDR_LOCK then l2arc_buflist_mtx. However, if we hold l2arc_buflist_mtx -> direct reclaim -> arc_buf_remove_buf -> HDR_LOCK (deadlock). So, l2arc_buflist_mtx should be FSTRANS.

Also, we need to find if there are other locks that would be held inside a FSTRANS lock. And if they would also be held outside of FSTRANS. They should also be FSTRANS.

yarikoptic commented 9 years ago

@behlendorf I will wait for your input on #3193 to either I should check it out -- in conjunction or instead of previous fixes?

kernelOfTruth commented 9 years ago

@yarikoptic there seem to be lots of potential places where a deadlocks could occur

since those patches don't seem mutually exclusive - I'd go for both (as many as possible :+1: ): https://github.com/zfsonlinux/zfs/pull/3193 & https://github.com/dweeezil/zfs/commit/f04f972 (referenced in https://github.com/zfsonlinux/zfs/issues/3183)

let's see what @behlendorf 's input is on this

kernelOfTruth commented 9 years ago

referencing #2990 & #2934 , error messages look similar and related

"rsync, trash cleanup jobs, lots of hard links" "Sometimes seems to occur during moderate continuous IO"

I wonder what really the common denominator is that triggers this (besides rsync that occasionally seem to lead to this)

openzfs / zfs

Still locks up after a while: Debian jessie 3.16.0-4-amd64 #3160