Closed dnozay closed 8 years ago
note, I couldn't do anything in between because of issue #1621 I believe.
other symptoms:
ssh
while it seemed hung,@dnozay This is very similar to #1928. Thus far we haven't been able to identify the root cause because it occurs very very very rarely but we're aware of the issue and have been investigating.
in case it is relevant, I had dedup=on
, and compression=on
; as you may have noted there is also ashift: 9
but blockdev --getbsz /dev/sdb
returns 4096
.
@behlendorf, since I need to get this working, I am more than willing to help debug. Yesterday, the jenkins slaves also died in a deadlock-like fashion. Could you please see https://gist.github.com/dnozay/9011860 and let me know if it is related? thank you.
Happened again last night, but couldn't trigger SysRq. Upon reboot, found out it was not even enabled in /etc/sysctl.conf
; set kernel.sysrq = 1
then proceeded with reboot. If it happens again in the next few days I will be ready.
/var/log/messages
portion that survived reboot
https://gist.github.com/dnozay/9084691#file-messages-txt
Feb 18 14:45:36 jenkins-slave2 kernel: INFO: task spl_kmem_cache/:706 blocked for more than 120 seconds.
Feb 18 14:45:36 jenkins-slave2 kernel: Tainted: P --------------- 2.6.32-431.3.1.el6.x86_64 #1
Feb 18 14:45:36 jenkins-slave2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 18 14:45:36 jenkins-slave2 kernel: spl_kmem_cach D 0000000000000000 0 706 2 0x00000000
Feb 18 14:45:36 jenkins-slave2 kernel: ffff88013c9992b0 0000000000000046 0000000000000000 ffffffffa02faa40
Feb 18 14:45:36 jenkins-slave2 kernel: ffff8800bad5d6d0 00000002627e8358 0000000000000004 00000000a03344e3
Feb 18 14:45:36 jenkins-slave2 kernel: ffff8801385bdab8 ffff88013c999fd8 000000000000fbc8 ffff8801385bdab8
Feb 18 14:45:36 jenkins-slave2 kernel: Call Trace:
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa02faa40>] ? vdev_mirror_child_done+0x0/0x30 [zfs]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff815280b3>] io_schedule+0x73/0xc0
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa01cd41c>] cv_wait_common+0xac/0x1c0 [spl]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff8109b2b0>] ? autoremove_wake_function+0x0/0x40
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa01cd548>] __cv_wait_io+0x18/0x20 [spl]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa0335c3b>] zio_wait+0xfb/0x1b0 [zfs]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa029c58d>] dbuf_read+0x3fd/0x750 [zfs]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa02a3fc8>] dmu_buf_hold+0x108/0x1d0 [zfs]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa030060f>] zap_get_leaf_byblk+0x4f/0x2b0 [zfs]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa03008da>] zap_deref_leaf+0x6a/0x80 [zfs]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa03010c7>] fzap_remove+0x37/0xb0 [zfs]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa030437d>] ? zap_name_alloc+0x8d/0xf0 [zfs]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa0306113>] zap_remove_norm+0x153/0x1d0 [zfs]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa03061a3>] zap_remove+0x13/0x20 [zfs]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa02ff551>] zap_remove_int+0x61/0x90 [zfs]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa030efa7>] zfs_rmnode+0x1c7/0x340 [zfs]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa032c765>] zfs_zinactive+0xa5/0x110 [zfs]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa032b75f>] zfs_inactive+0x7f/0x220 [zfs]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa033e11e>] zpl_clear_inode+0xe/0x10 [zfs]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff811a5bec>] clear_inode+0xac/0x140
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff811a5cc0>] dispose_list+0x40/0x120
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff811a6014>] shrink_icache_memory+0x274/0x2e0
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff81138aea>] shrink_slab+0x12a/0x1a0
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff8113aec7>] do_try_to_free_pages+0x3f7/0x610
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff8112d2fc>] ? get_page_from_freelist+0x15c/0x870
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff8113b2b2>] try_to_free_pages+0x92/0x120
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff811411c0>] ? next_zone+0x30/0x40
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff8112f718>] __alloc_pages_nodemask+0x478/0x8d0
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff81167aaa>] alloc_pages_current+0xaa/0x110
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff8112cf4e>] __get_free_pages+0xe/0x50
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff8104ec85>] pte_alloc_one_kernel+0x15/0x20
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff8114634b>] __pte_alloc_kernel+0x1b/0xc0
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff81157579>] vmap_page_range_noflush+0x309/0x370
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff81157612>] map_vm_area+0x32/0x50
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff81159080>] __vmalloc_area_node+0x100/0x190
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa01c2ee9>] ? kv_alloc+0x59/0x60 [spl]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff81158f0d>] __vmalloc_node+0xad/0x120
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa01c2ee9>] ? kv_alloc+0x59/0x60 [spl]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff811592f2>] __vmalloc+0x22/0x30
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa01c2ee9>] kv_alloc+0x59/0x60 [spl]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa01c2f29>] spl_cache_grow_work+0x39/0x410 [spl]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff81058d53>] ? __wake_up+0x53/0x70
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa01c6628>] taskq_thread+0x218/0x4b0 [spl]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff81527920>] ? thread_return+0x4e/0x76e
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa01c6410>] ? taskq_thread+0x0/0x4b0 [spl]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff8109af06>] kthread+0x96/0xa0
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff8109ae70>] ? kthread+0x0/0xa0
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20
/proc/spl/kstat/zfs/arcstats
5min before the crash; couldn't do it afterwards as everything was hung:
https://gist.github.com/dnozay/9084691#file-arcstats-txt
ps
output 5min before the crash:
https://gist.github.com/dnozay/9084691#file-ps-txt
/proc/spl/kmem/slab
5min before the crash:
https://gist.github.com/dnozay/9084691#file-slab-txt
Closing as stale. The issue was originally reported again 0.6.2. If you're able to reproduce this against 0.6.5.x or newly please let us know.
earlier today:
yes, there's nothing in the messages between
08:02:59
and16:18:45
.I am using a CentOS 6.5 with the EPEL packages. The machine is a VM with a second 150G disk (/dev/sdb). The machine has 4 vCPUs and 4G of memory. The machine is used as a jenkins server (
java
task that gets hung)Here is the list of packages
The second disk as I mentioned is used for zfs and the OS should use the other filesystem for logs and other stuff.
/zfs-data/jenkins
is used as the home directory for the jenkins server, this will store the workspaces and builds for multiple concurrent jobs. Everything stopped working around15:40
; webserver, ssh sessions, everything was crawling.After reboot, I also had some issues mounting the zfs volumes.
here is the output of
zdb
andzpool status
(after reboot)no
/proc/spl/kstat/zfs/arcstats
or/proc/spl/kmem/slab
from during the hang.