openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.33k stars 1.72k forks source link

rsync and txg_sync blocked for more than 120 seconds #2611

Closed freakout42 closed 5 years ago

freakout42 commented 9 years ago
SPL: Loaded module v0.6.3-1
ZFS: Loaded module v0.6.3-1, ZFS pool version 5000, ZFS filesystem version 5

doing nothing but some basic rsync's with moderate sizes 4-10G result always in having approx 1MB/sec throughtput (very slow) on a up-date 16G RAM HP server - CentOS 6 with OpenVZ and selfcompiled modules for ZFS - dmesg:

INFO: task rsync:6517 blocked for more than 120 seconds.
      Tainted: P           ---------------    2.6.32-042stab092.3 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
rsync         D ffff8804029bafb0     0  6517   6516    0 0x00000080
 ffff88001e55da18 0000000000000086 0000000000000003 00000003d2dad320
 000000000001ec80 0000000000000001 ffff88001e55dab8 0000000000000082
 ffffc900105a3428 ffff8803fe837d60 ffff8804029bb578 000000000001ec80
Call Trace:
 [<ffffffff810a1dfe>] ? prepare_to_wait_exclusive+0x4e/0x80
 [<ffffffffa019fb35>] cv_wait_common+0x105/0x1c0 [spl]
 [<ffffffff810a1bb0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa019fc45>] __cv_wait+0x15/0x20 [spl]
 [<ffffffffa02ae1fb>] txg_wait_open+0x8b/0x110 [zfs]
 [<ffffffffa027194e>] dmu_tx_wait+0x29e/0x2b0 [zfs]
 [<ffffffff81530bfe>] ? mutex_lock+0x1e/0x50
 [<ffffffffa0271a41>] dmu_tx_assign+0x91/0x490 [zfs]
 [<ffffffffa027fab7>] ? dsl_dataset_block_freeable+0x27/0x60 [zfs]
 [<ffffffffa02e8d3e>] zfs_write+0x43e/0xcf0 [zfs]
 [<ffffffff8100bc4e>] ? apic_timer_interrupt+0xe/0x20
 [<ffffffff811c5e5c>] ? core_sys_select+0x1ec/0x2d0
 [<ffffffffa02fd354>] zpl_write_common+0x54/0xd0 [zfs]
 [<ffffffffa02fd438>] zpl_write+0x68/0xa0 [zfs]
 [<ffffffff811ac798>] vfs_write+0xb8/0x1a0
 [<ffffffff811ad091>] sys_write+0x51/0x90
 [<ffffffff810f4dee>] ? __audit_syscall_exit+0x25e/0x290
 [<ffffffff8100b102>] system_call_fastpath+0x16/0x1b

INFO: task txg_sync:876 blocked for more than 120 seconds.
      Tainted: P           ---------------    2.6.32-042stab092.3 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
txg_sync      D ffff880405d28640     0   876      2    0 0x00000000
 ffff8803fe839b70 0000000000000046 0000000000000001 ffff880405101430
 0000000000000000 0000000000000000 ffff8803fe839af0 ffffffff81060032
 ffff8803fe839b40 ffffffff81054939 ffff880405d28c08 000000000001ec80
Call Trace:
 [<ffffffff81060032>] ? default_wake_function+0x12/0x20
 [<ffffffff81054939>] ? __wake_up_common+0x59/0x90
 [<ffffffff8152f833>] io_schedule+0x73/0xc0
 [<ffffffffa019fadc>] cv_wait_common+0xac/0x1c0 [spl]
 [<ffffffffa02f53e0>] ? zio_execute+0x0/0x140 [zfs]
 [<ffffffff810a1bb0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa019fc08>] __cv_wait_io+0x18/0x20 [spl]
 [<ffffffffa02f561b>] zio_wait+0xfb/0x1b0 [zfs]
 [<ffffffffa02867e3>] dsl_pool_sync+0xb3/0x440 [zfs]
 [<ffffffffa029a67b>] spa_sync+0x40b/0xae0 [zfs]
 [<ffffffffa02aebb4>] txg_sync_thread+0x384/0x5e0 [zfs]
 [<ffffffff8105b309>] ? set_user_nice+0xc9/0x130
 [<ffffffffa02ae830>] ? txg_sync_thread+0x0/0x5e0 [zfs]
 [<ffffffffa01978e8>] thread_generic_wrapper+0x68/0x80 [spl]
 [<ffffffffa0197880>] ? thread_generic_wrapper+0x0/0x80 [spl]
 [<ffffffff810a1596>] kthread+0x96/0xa0
 [<ffffffff8100c34a>] child_rip+0xa/0x20
 [<ffffffff810a1500>] ? kthread+0x0/0xa0
 [<ffffffff8100c340>] ? child_rip+0x0/0x20
wankdanker commented 9 years ago

I hit something similar but maybe different last night. I believe it occurred while KVM was copying disk blocks from another server to this one. This ended up causing actual corruption on at least one of the zvols (as seen by the VM).

[381360.908047] INFO: task zvol/0:331 blocked for more than 120 seconds.
[381360.908147]       Tainted: PF          O 3.13.0-32-generic #57-Ubuntu
[381360.908240] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[381360.908352] zvol/0          D ffff88062fc54440     0   331      2 0x00000000
[381360.908357]  ffff88060796dbe0 0000000000000002 ffff880609aa17f0 ffff88060796dfd8
[381360.908360]  0000000000014440 0000000000014440 ffff880609aa17f0 ffff880609a5eb30
[381360.908362]  ffff880609a5e9f8 ffff880609a5eb38 0000000000000000 0000000000000002
[381360.908365] Call Trace:
[381360.908374]  [<ffffffff817200d9>] schedule+0x29/0x70
[381360.908396]  [<ffffffffa00c0485>] cv_wait_common+0x105/0x1a0 [spl]
[381360.908402]  [<ffffffff810aaf00>] ? prepare_to_wait_event+0x100/0x100
[381360.908408]  [<ffffffffa00c0535>] __cv_wait+0x15/0x20 [spl]
[381360.908459]  [<ffffffffa0209abb>] txg_wait_open+0x8b/0x110 [zfs]
[381360.908476]  [<ffffffffa01cd83b>] dmu_tx_wait+0x29b/0x2a0 [zfs]
[381360.908492]  [<ffffffffa01cd8cc>] dmu_tx_assign+0x8c/0x460 [zfs]
[381360.908520]  [<ffffffffa025a8c7>] zvol_write+0xa7/0x480 [zfs]
[381360.908527]  [<ffffffffa00bab27>] taskq_thread+0x237/0x4b0 [spl]
[381360.908530]  [<ffffffff81097508>] ? finish_task_switch+0x128/0x170
[381360.908534]  [<ffffffff8109a800>] ? wake_up_state+0x20/0x20
[381360.908539]  [<ffffffffa00ba8f0>] ? taskq_cancel_id+0x1f0/0x1f0 [spl]
[381360.908543]  [<ffffffff8108b3d2>] kthread+0xd2/0xf0
[381360.908545]  [<ffffffff8108b300>] ? kthread_create_on_node+0x1d0/0x1d0
[381360.908548]  [<ffffffff8172c5bc>] ret_from_fork+0x7c/0xb0
[381360.908550]  [<ffffffff8108b300>] ? kthread_create_on_node+0x1d0/0x1d0

This was repeated for zvol/0 through zvol/9. There are 31 zvols on this system.

[    4.995804] SPL: Loaded module v0.6.3-1~precise
[    5.130074] ZFS: Loaded module v0.6.3-2~precise, ZFS pool version 5000, ZFS filesystem version 5

I tried migrating a VM again today and all hell broke loose but I did not get these errors. The system load was in the 100's on an 8 core system. Major I/O wait time. I killed the migration but ended up having at least 3 corrupt zvols anyways.

freakout42 commented 9 years ago

i do not have any corruption or problem - the system is stable and running - also the rsync tasks are done 100% perfect - but very slow 1M/second write performance - and while running the whole system is not responding fast - but no errors and no corruption - these "blocked for more than 120 seconds" dmesg's come in pairs for rsync and txg_sync once a day

kernelOfTruth commented 9 years ago

not similar but related message:

Aug 29 05:37:06 morpheus kernel: [46185.239554] ata6.00: configured for UDMA/133 Aug 29 05:37:06 morpheus kernel: [46185.239562] ata6: EH complete Aug 29 05:53:40 morpheus kernel: [47179.890587] INFO: task txg_sync:1462 blocked for more than 180 seconds. Aug 29 05:53:40 morpheus kernel: [47179.890589] Tainted: P O 3.16.0_ck1-smtnice6_BFQ_integra_intel #1 Aug 29 05:53:40 morpheus kernel: [47179.890590] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 29 05:53:40 morpheus kernel: [47179.890591] txg_sync D 0000000000000006 0 1462 2 0x00000000 Aug 29 05:53:40 morpheus kernel: [47179.890594] ffff8805fbda9e40 0000000000000046 ffff88020931a9b0 ffff88065b610000 Aug 29 05:53:40 morpheus kernel: [47179.890596] 0000000000000066 ffff8807fb2a0f20 000000000000b020 0000000000013100 Aug 29 05:53:40 morpheus kernel: [47179.890597] ffff88065b6103a0 00002992b81d47dd ffff880617373fd8 ffff88065b610000 Aug 29 05:53:40 morpheus kernel: [47179.890599] Call Trace: Aug 29 05:53:40 morpheus kernel: [47179.890605] [] ? io_schedule+0x88/0xd0 Aug 29 05:53:40 morpheus kernel: [47179.890613] [] ? cv_timedwait+0x96/0x110 [spl] Aug 29 05:53:40 morpheus kernel: [47179.890616] [] ? finish_wait+0x90/0x90 Aug 29 05:53:40 morpheus kernel: [47179.890623] [] ? zio_wait+0xeb/0x1a0 [zfs] Aug 29 05:53:40 morpheus kernel: [47179.890631] [] ? dsl_pool_sync+0xaa/0x450 [zfs] Aug 29 05:53:40 morpheus kernel: [47179.890639] [] ? spa_sync+0x483/0xb20 [zfs] Aug 29 05:53:40 morpheus kernel: [47179.890642] [] ? default_wake_function+0xd/0x20 Aug 29 05:53:40 morpheus kernel: [47179.890644] [] ? ktime_get_ts+0x3d/0xe0 Aug 29 05:53:40 morpheus kernel: [47179.890652] [] ? txg_sync_start+0x6ea/0x900 [zfs] Aug 29 05:53:40 morpheus kernel: [47179.890655] [] ? switch_to+0x2a/0x560 Aug 29 05:53:40 morpheus kernel: [47179.890662] [] ? txg_sync_start+0x3c0/0x900 [zfs] Aug 29 05:53:40 morpheus kernel: [47179.890664] [] ? spl_kmem_fini+0xa5/0xc0 [spl] Aug 29 05:53:40 morpheus kernel: [47179.890667] [] ? spl_kmem_fini+0x30/0xc0 [spl] Aug 29 05:53:40 morpheus kernel: [47179.890669] [] ? kthread+0xbc/0xe0 Aug 29 05:53:40 morpheus kernel: [47179.890670] [] ? __ww_mutex_lock_slowpath+0x8c/0x2cc Aug 29 05:53:40 morpheus kernel: [47179.890672] [] ? flush_kthread_worker+0x80/0x80 Aug 29 05:53:40 morpheus kernel: [47179.890674] [] ? ret_from_fork+0x7c/0xb0 Aug 29 05:53:40 morpheus kernel: [47179.890675] [] ? flush_kthread_worker+0x80/0x80 Aug 29 06:28:01 morpheus kernel: [49241.775276] usb 1-1.1: USB disconnect, device number 3

this seems to occur from time to time with a rather slow USB3.0 powered 4TB hdd (Touro Desk 3.0, HGST5K4000) in an external case during rsync & transferring of large files (several GiB)

kernelOfTruth commented 9 years ago

clarification for above:

I had this happen with the above mentioned drive - of which I still don't know what causes it (but I suspect it could be related either to the chipset in the external harddrive enclosure where the drive sits in or powersaving features of the XHCI driver & hardware, which I already had issues with in the past)

another drive showed this behavior (a seagate ST3000DM001) which likely underwent a headcrash and did reallocate several sectors (<10). It had been placed in an external enclosure by fantec [db-f8u3e with an incompatible chipset against smartctl] that had shown in the past to have a life & mind of its own: it would occasionally turn off during transfers and causing trouble with other filesystems, on ZFS, however, the files so far seemed fine. The day before yesterday I placed the drive in another external enclosure and it worked well during backups (only transferring several hundreds of MiB of data per backup job incrementally via rsync) until I decided to run a scrub and check everything: after several hours the drive again screamed and made hearable noises of a head-crash and/or sector re-allocation (had those in the past) and access wasn't possible anymore to the drive

that's where the above posted message occured again so at least in the second case (ST3000DM001) ZFS seemingly showed indirectly that something was wrong with the hardware/harddrive

so when encountering this message - make sure to double- or triple-check that it's not a hardware-issue instead a "software"- (ZFS-related) problem

ryao commented 9 years ago

@wankdanker I think that your issue is separate. It might have been caused by the zvol processing occuring inside an interrupt context. Pull request #2484 might resolve it.

ryao commented 9 years ago

@freakout42 Would you tell us more about your pool configuration? Also, do you have data deduplication enabled on this pool?

freakout42 commented 9 years ago
[root@blood ~]# zpool status
  pool: tank
 state: ONLINE
  scan: scrub repaired 0 in 0h53m with 0 errors on Tue Sep  9 12:45:13 2014
config:

    NAME        STATE     READ WRITE CKSUM
    tank        ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        sda4    ONLINE       0     0     0
        sdb4    ONLINE       0     0     0

errors: No known data errors

[root@blood ~]# zpool get all
NAME  PROPERTY               VALUE                  SOURCE
tank  size                   824G                   -
tank  capacity               34%                    -
tank  altroot                -                      default
tank  health                 ONLINE                 -
tank  guid                   3198719639486948540    default
tank  version                -                      default
tank  bootfs                 -                      default
tank  delegation             on                     default
tank  autoreplace            off                    default
tank  cachefile              -                      default
tank  failmode               wait                   default
tank  listsnapshots          off                    default
tank  autoexpand             off                    default
tank  dedupditto             0                      default
tank  dedupratio             1.00x                  -
tank  free                   538G                   -
tank  allocated              286G                   -
tank  readonly               off                    -
tank  ashift                 0                      default
tank  comment                -                      default
tank  expandsize             0                      -
tank  freeing                0                      default
tank  feature@async_destroy  enabled                local
tank  feature@empty_bpobj    active                 local
tank  feature@lz4_compress   enabled                local
ColdCanuck commented 9 years ago

Had similar failures, which occurred during heavy rsync pulls from remote machine.

I was able to make it happen very quickly by starting up the remote pull. Did this three times in a row and caused the fault every time. The symptom was that any userland zfs/zpool commands hang, but the machine was still responsive to other commands.

I set the parameter spl_kmem_cache_slab_limit=0 (it had been spl_kmem_cache_slab_limit=16384), and the problem seems to be gone, or not easliy triggered.

Part of the process which triggers this includes snapshot renaming, but no zvols are involved in this process, although the pool has some.

The pool is a raidz1 pool, and there are no hardware issues on the server.

Sep 23 16:17:51 zephyr kernel: [   27.497382] SPL: Loaded module v0.6.3-1~precise
Sep 23 16:17:51 zephyr kernel: [   27.515752] ZFS: Loaded module v0.6.3-2~precise, ZFS pool version 5000, ZFS filesystem version 5

Sep 26 00:11:23 zephyr kernel: [200583.071397] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 26 00:11:23 zephyr kernel: [200583.093284] txg_sync        D ffffffff8180cbe0     0  4178      2 0x00000000
Sep 26 00:11:23 zephyr kernel: [200583.093291]  ffff8807d9fd1aa0 0000000000000046 00000000000000ff ffffffffa026c280
Sep 26 00:11:23 zephyr kernel: [200583.093297]  ffff8807d9fd1fd8 ffff8807d9fd1fd8 ffff8807d9fd1fd8 00000000000139c0
Sep 26 00:11:23 zephyr kernel: [200583.093303]  ffff8807f3d6ae20 ffff8807e0189710 00000000ffffffff ffffffffa026c280
Sep 26 00:11:23 zephyr kernel: [200583.093309] Call Trace:
Sep 26 00:11:23 zephyr kernel: [200583.093333]  [<ffffffff8169f099>] schedule+0x29/0x70
Sep 26 00:11:23 zephyr kernel: [200583.093338]  [<ffffffff8169f35e>] schedule_preempt_disabled+0xe/0x10
Sep 26 00:11:23 zephyr kernel: [200583.093343]  [<ffffffff8169df77>] __mutex_lock_slowpath+0xd7/0x150
Sep 26 00:11:23 zephyr kernel: [200583.093350]  [<ffffffff8169db8a>] mutex_lock+0x2a/0x50
Sep 26 00:11:23 zephyr kernel: [200583.093395]  [<ffffffffa0230f19>] zvol_rename_minors+0x79/0x180 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.093432]  [<ffffffffa01ad349>] dsl_dataset_rename_snapshot_sync_impl+0x189/0x2c0 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.093460]  [<ffffffffa01ad52f>] dsl_dataset_rename_snapshot_sync+0xaf/0x190 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.093491]  [<ffffffffa01bde52>] dsl_sync_task_sync+0xf2/0x100 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.093521]  [<ffffffffa01b63b3>] dsl_pool_sync+0x2f3/0x420 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.093554]  [<ffffffffa01cc4f4>] spa_sync+0x414/0xb20 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.093590]  [<ffffffffa01db021>] ? spa_txg_history_set+0x21/0xd0 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.093625]  [<ffffffffa01de335>] txg_sync_thread+0x385/0x5d0 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.093659]  [<ffffffffa01ddfb0>] ? txg_init+0x260/0x260 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.093672]  [<ffffffffa00d3fc8>] thread_generic_wrapper+0x78/0x90 [spl]
Sep 26 00:11:23 zephyr kernel: [200583.093682]  [<ffffffffa00d3f50>] ? __thread_create+0x300/0x300 [spl]
Sep 26 00:11:23 zephyr kernel: [200583.093688]  [<ffffffff81077ee3>] kthread+0x93/0xa0
Sep 26 00:11:23 zephyr kernel: [200583.093694]  [<ffffffff816a9b24>] kernel_thread_helper+0x4/0x10
Sep 26 00:11:23 zephyr kernel: [200583.093700]  [<ffffffff81077e50>] ? flush_kthread_worker+0xb0/0xb0
Sep 26 00:11:23 zephyr kernel: [200583.093704]  [<ffffffff816a9b20>] ? gs_change+0x13/0x13
Sep 26 00:11:23 zephyr kernel: [200583.093737] INFO: task zfs:22581 blocked for more than 120 seconds.
Sep 26 00:11:23 zephyr kernel: [200583.104867] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 26 00:11:23 zephyr kernel: [200583.126712] zfs             D ffffffff8180cbe0     0 22581  22253 0x00000000
Sep 26 00:11:23 zephyr kernel: [200583.126717]  ffff880182f2f958 0000000000000082 ffff8807eee9c530 ffff8807deda2550
Sep 26 00:11:23 zephyr kernel: [200583.126722]  ffff880182f2ffd8 ffff880182f2ffd8 ffff880182f2ffd8 00000000000139c0
Sep 26 00:11:23 zephyr kernel: [200583.126728]  ffff8807f3f04530 ffff8807eee9c530 0000000000000292 ffff8807deda2550
Sep 26 00:11:23 zephyr kernel: [200583.126733] Call Trace:
Sep 26 00:11:23 zephyr kernel: [200583.126740]  [<ffffffff8169f099>] schedule+0x29/0x70
Sep 26 00:11:23 zephyr kernel: [200583.126753]  [<ffffffffa00dbd8d>] cv_wait_common+0xfd/0x1b0 [spl]
Sep 26 00:11:23 zephyr kernel: [200583.126768]  [<ffffffff810787c0>] ? add_wait_queue+0x60/0x60
Sep 26 00:11:23 zephyr kernel: [200583.126779]  [<ffffffffa00dbe95>] __cv_wait+0x15/0x20 [spl]
Sep 26 00:11:23 zephyr kernel: [200583.126813]  [<ffffffffa01c4aec>] rrw_enter_read+0x3c/0x130 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.126849]  [<ffffffffa01c4c70>] rrw_enter+0x20/0x30 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.126878]  [<ffffffffa01b51cd>] dsl_pool_config_enter+0x1d/0x20 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.126915]  [<ffffffffa01b709a>] dsl_pool_hold+0x4a/0x60 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.126927]  [<ffffffffa019957b>] dmu_objset_hold+0x2b/0xb0 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.126929]  [<ffffffff816a85d6>] ? ftrace_call+0x5/0x2b
Sep 26 00:11:23 zephyr kernel: [200583.126942]  [<ffffffffa01b8a6f>] dsl_prop_get+0x3f/0x90 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.126944]  [<ffffffff816a85d6>] ? ftrace_call+0x5/0x2b
Sep 26 00:11:23 zephyr kernel: [200583.126956]  [<ffffffffa01b8ade>] dsl_prop_get_integer+0x1e/0x20 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.126970]  [<ffffffffa022ef6f>] __zvol_create_minor+0xbf/0x630 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.126985]  [<ffffffffa0230bc7>] zvol_create_minor+0x27/0x40 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.126998]  [<ffffffffa0230bee>] zvol_create_minors_cb+0xe/0x20 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.127010]  [<ffffffffa0197e6e>] dmu_objset_find_impl+0x37e/0x3f0 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.127023]  [<ffffffffa0230be0>] ? zvol_create_minor+0x40/0x40 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.127034]  [<ffffffffa0197cae>] dmu_objset_find_impl+0x1be/0x3f0 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.127048]  [<ffffffffa0230be0>] ? zvol_create_minor+0x40/0x40 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.127061]  [<ffffffffa0230be0>] ? zvol_create_minor+0x40/0x40 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.127072]  [<ffffffffa0197f32>] dmu_objset_find+0x52/0x80 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.127074]  [<ffffffff816a85d6>] ? ftrace_call+0x5/0x2b
Sep 26 00:11:23 zephyr kernel: [200583.127088]  [<ffffffffa0230d53>] zvol_create_minors+0x33/0x40 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.127102]  [<ffffffffa0202a69>] zfs_ioc_snapshot+0x259/0x2a0 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.127116]  [<ffffffffa0206530>] zfsdev_ioctl+0x180/0x500 [zfs]
Sep 26 00:11:23 zephyr kernel: [200583.127118]  [<ffffffff816a85d6>] ? ftrace_call+0x5/0x2b
Sep 26 00:11:23 zephyr kernel: [200583.127131]  [<ffffffff8119ae9a>] do_vfs_ioctl+0x8a/0x340
Sep 26 00:11:23 zephyr kernel: [200583.127135]  [<ffffffff8119b1e1>] sys_ioctl+0x91/0xa0
Sep 26 00:11:23 zephyr kernel: [200583.127141]  [<ffffffff816a8829>] system_call_fastpath+0x16/0x1b
dweeezil commented 9 years ago

@ColdCanuck Your comments regarding spl_kmem_cache_slab_limit piqued my interest in this issue which until now, I've not had time to follow. In fact, a quick skim of this whole issue makes me wonder how related each of the problem reports are and whether there's too much issue creep.

Back to the point at hand: I'm posting this followup because there have been a disturbing number of seemingly otherwise unrelated problems sporadically seemingly caused by using the Linux slab. Although I've not been able to spend the time on it I've wanted to, I've been rather knee deep investigating the series of issues related to Posix ACLs and SA xattrs and have seen at least one report (#2701) and, more interestingly #2725 which makes me think there may be a tie-in to our use of the Linux slab for <= 16KiB objects. I don't have any other brilliant observations offer at the moment other than to raise concern there may be problems realted to using the Linux slab and to ask @behlendorf, @ryao et al. what your thoughts are on this (particularly given the last few comments in #2725).

kernelOfTruth commented 9 years ago

just posting what comes to mind:

could scheduling a regular cronjob which compacts memory via echo 1 > /proc/sys/vm/compact_memory

change things (provided slab issues and timeouts are related to memory fragmentation)

ioquatix commented 9 years ago

I'm systematically having this issue when trying to RSync when using the latest ZFS from Arch: zfs-git 0.6.3_r170_gd958324f_3.18.2_2-1

[46181.967521] perf interrupt took too long (2506 > 2495), lowering kernel.perf_event_max_sample_rate to 50100
[79468.027144] INFO: task txg_sync:583 blocked for more than 120 seconds.
[79468.027287]       Tainted: P           O   3.18.2-2-ARCH #1
[79468.027363] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[79468.027468] txg_sync        D 0000000000000000     0   583      2 0x00000000
[79468.027476]  ffff88021ce03b18 0000000000000046 ffff88021cd5eeb0 0000000000013640
[79468.027483]  ffff88021ce03fd8 0000000000013640 ffff88021f601420 ffff88021cd5eeb0
[79468.027489]  ffff8801591955b8 ffffffff00000000 ffff880159195570 00000000025e92cf
[79468.027494] Call Trace:
[79468.027509]  [<ffffffff810b2978>] ? __wake_up+0x48/0x60
[79468.027518]  [<ffffffff81550b59>] schedule+0x29/0x70
[79468.027524]  [<ffffffff81550e38>] io_schedule+0x98/0x100
[79468.027547]  [<ffffffffa01cbb45>] __cv_broadcast+0xe5/0x160 [spl]
[79468.027554]  [<ffffffff810b2e30>] ? __wake_up_sync+0x20/0x20
[79468.027565]  [<ffffffffa01cbc18>] __cv_wait_io+0x18/0x180 [spl]
[79468.027588]  [<ffffffffa030367b>] zio_wait+0x11b/0x200 [zfs]
[79468.027616]  [<ffffffffa028ac71>] dsl_pool_sync+0xc1/0x480 [zfs]
[79468.027625]  [<ffffffffa01c60af>] ? spl_kmem_cache_free+0x10f/0x4a0 [spl]
[79468.027656]  [<ffffffffa02a5b60>] spa_sync+0x480/0xbd0 [zfs]
[79468.027663]  [<ffffffff810b2e46>] ? autoremove_wake_function+0x16/0x40
[79468.027694]  [<ffffffffa02b730c>] txg_delay+0x4ec/0xa60 [zfs]
[79468.027724]  [<ffffffffa02b6f90>] ? txg_delay+0x170/0xa60 [zfs]
[79468.027734]  [<ffffffffa01c707a>] __thread_exit+0x9a/0xb0 [spl]
[79468.027742]  [<ffffffffa01c7000>] ? __thread_exit+0x20/0xb0 [spl]
[79468.027749]  [<ffffffff81090e0a>] kthread+0xea/0x100
[79468.027755]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[79468.027761]  [<ffffffff81554bfc>] ret_from_fork+0x7c/0xb0
[79468.027767]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[82589.120097] INFO: task txg_sync:583 blocked for more than 120 seconds.
[82589.120252]       Tainted: P           O   3.18.2-2-ARCH #1
[82589.120347] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[82589.120479] txg_sync        D 0000000000000001     0   583      2 0x00000000
[82589.120489]  ffff88021ce03b18 0000000000000046 ffff88021cd5eeb0 0000000000013640
[82589.120497]  ffff88021ce03fd8 0000000000013640 ffff8802240ceeb0 ffff88021cd5eeb0
[82589.120505]  ffff88022363f7b0 ffff88022363f798 0000000000000000 0000000000000003
[82589.120512] Call Trace:
[82589.120529]  [<ffffffff8109e8e2>] ? default_wake_function+0x12/0x20
[82589.120540]  [<ffffffff810b2715>] ? __wake_up_common+0x55/0x90
[82589.120549]  [<ffffffff810b2978>] ? __wake_up+0x48/0x60
[82589.120558]  [<ffffffff81550b59>] schedule+0x29/0x70
[82589.120564]  [<ffffffff81550e38>] io_schedule+0x98/0x100
[82589.120591]  [<ffffffffa01cbb45>] __cv_broadcast+0xe5/0x160 [spl]
[82589.120599]  [<ffffffff810b2e30>] ? __wake_up_sync+0x20/0x20
[82589.120613]  [<ffffffffa01cbc18>] __cv_wait_io+0x18/0x180 [spl]
[82589.120641]  [<ffffffffa030367b>] zio_wait+0x11b/0x200 [zfs]
[82589.120676]  [<ffffffffa028ac71>] dsl_pool_sync+0xc1/0x480 [zfs]
[82589.120689]  [<ffffffffa01c60af>] ? spl_kmem_cache_free+0x10f/0x4a0 [spl]
[82589.120727]  [<ffffffffa02a5b60>] spa_sync+0x480/0xbd0 [zfs]
[82589.120735]  [<ffffffff810b2e46>] ? autoremove_wake_function+0x16/0x40
[82589.120774]  [<ffffffffa02b730c>] txg_delay+0x4ec/0xa60 [zfs]
[82589.120811]  [<ffffffffa02b6f90>] ? txg_delay+0x170/0xa60 [zfs]
[82589.120823]  [<ffffffffa01c707a>] __thread_exit+0x9a/0xb0 [spl]
[82589.120834]  [<ffffffffa01c7000>] ? __thread_exit+0x20/0xb0 [spl]
[82589.120842]  [<ffffffff81090e0a>] kthread+0xea/0x100
[82589.120849]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[82589.120858]  [<ffffffff81554bfc>] ret_from_fork+0x7c/0xb0
[82589.120864]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[89311.468460] INFO: task txg_sync:583 blocked for more than 120 seconds.
[89311.468644]       Tainted: P           O   3.18.2-2-ARCH #1
[89311.468741] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[89311.468872] txg_sync        D 0000000000000000     0   583      2 0x00000000
[89311.468882]  ffff88021ce03b18 0000000000000046 ffff88021cd5eeb0 0000000000013640
[89311.468890]  ffff88021ce03fd8 0000000000013640 ffff88021f601420 ffff88021cd5eeb0
[89311.468897]  ffff88001afc4498 fffffffe00000000 ffff88001afc4490 00000000025e92cf
[89311.468905] Call Trace:
[89311.468922]  [<ffffffff810b2978>] ? __wake_up+0x48/0x60
[89311.468933]  [<ffffffff81550b59>] schedule+0x29/0x70
[89311.468940]  [<ffffffff81550e38>] io_schedule+0x98/0x100
[89311.468968]  [<ffffffffa01cbb45>] __cv_broadcast+0xe5/0x160 [spl]
[89311.468976]  [<ffffffff810b2e30>] ? __wake_up_sync+0x20/0x20
[89311.468989]  [<ffffffffa01cbc18>] __cv_wait_io+0x18/0x180 [spl]
[89311.469017]  [<ffffffffa030367b>] zio_wait+0x11b/0x200 [zfs]
[89311.469052]  [<ffffffffa028ac71>] dsl_pool_sync+0xc1/0x480 [zfs]
[89311.469065]  [<ffffffffa01c60af>] ? spl_kmem_cache_free+0x10f/0x4a0 [spl]
[89311.469103]  [<ffffffffa02a5b60>] spa_sync+0x480/0xbd0 [zfs]
[89311.469112]  [<ffffffff810b2e46>] ? autoremove_wake_function+0x16/0x40
[89311.469150]  [<ffffffffa02b730c>] txg_delay+0x4ec/0xa60 [zfs]
[89311.469187]  [<ffffffffa02b6f90>] ? txg_delay+0x170/0xa60 [zfs]
[89311.469199]  [<ffffffffa01c707a>] __thread_exit+0x9a/0xb0 [spl]
[89311.469211]  [<ffffffffa01c7000>] ? __thread_exit+0x20/0xb0 [spl]
[89311.469218]  [<ffffffff81090e0a>] kthread+0xea/0x100
[89311.469226]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[89311.469234]  [<ffffffff81554bfc>] ret_from_fork+0x7c/0xb0
[89311.469241]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
hinoki% free -h
              total        used        free      shared  buff/cache   available
Mem:           7.8G        4.3G        533M        1.1M        3.0G        702M
Swap:          4.0G          0B        4.0G
ioquatix commented 9 years ago

Still trying to run rsync:

[89431.510429] INFO: task txg_sync:583 blocked for more than 120 seconds.
[89431.510571]       Tainted: P           O   3.18.2-2-ARCH #1
[89431.510647] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[89431.510753] txg_sync        D 0000000000000000     0   583      2 0x00000000
[89431.510762]  ffff88021ce03b18 0000000000000046 ffff88021cd5eeb0 0000000000013640
[89431.510769]  ffff88021ce03fd8 0000000000013640 ffff88021f601420 ffff88021cd5eeb0
[89431.510774]  ffff88001afc4498 fffffffe00000000 ffff88001afc4490 00000000025e92cf
[89431.510780] Call Trace:
[89431.510796]  [<ffffffff810b2978>] ? __wake_up+0x48/0x60
[89431.510805]  [<ffffffff81550b59>] schedule+0x29/0x70
[89431.510811]  [<ffffffff81550e38>] io_schedule+0x98/0x100
[89431.510836]  [<ffffffffa01cbb45>] __cv_broadcast+0xe5/0x160 [spl]
[89431.510842]  [<ffffffff810b2e30>] ? __wake_up_sync+0x20/0x20
[89431.510853]  [<ffffffffa01cbc18>] __cv_wait_io+0x18/0x180 [spl]
[89431.510877]  [<ffffffffa030367b>] zio_wait+0x11b/0x200 [zfs]
[89431.510906]  [<ffffffffa028ac71>] dsl_pool_sync+0xc1/0x480 [zfs]
[89431.510915]  [<ffffffffa01c60af>] ? spl_kmem_cache_free+0x10f/0x4a0 [spl]
[89431.510946]  [<ffffffffa02a5b60>] spa_sync+0x480/0xbd0 [zfs]
[89431.510953]  [<ffffffff810b2e46>] ? autoremove_wake_function+0x16/0x40
[89431.510984]  [<ffffffffa02b730c>] txg_delay+0x4ec/0xa60 [zfs]
[89431.511014]  [<ffffffffa02b6f90>] ? txg_delay+0x170/0xa60 [zfs]
[89431.511024]  [<ffffffffa01c707a>] __thread_exit+0x9a/0xb0 [spl]
[89431.511033]  [<ffffffffa01c7000>] ? __thread_exit+0x20/0xb0 [spl]
[89431.511039]  [<ffffffff81090e0a>] kthread+0xea/0x100
[89431.511045]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[89431.511052]  [<ffffffff81554bfc>] ret_from_fork+0x7c/0xb0
[89431.511058]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[89551.552404] INFO: task txg_sync:583 blocked for more than 120 seconds.
[89551.552565]       Tainted: P           O   3.18.2-2-ARCH #1
[89551.552660] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[89551.552792] txg_sync        D 0000000000000000     0   583      2 0x00000000
[89551.552803]  ffff88021ce03b18 0000000000000046 ffff88021cd5eeb0 0000000000013640
[89551.552812]  ffff88021ce03fd8 0000000000013640 ffff88021f601420 ffff88021cd5eeb0
[89551.552819]  ffff88001afc4498 fffffffe00000000 ffff88001afc4490 00000000025e92cf
[89551.552826] Call Trace:
[89551.552844]  [<ffffffff810b2978>] ? __wake_up+0x48/0x60
[89551.552854]  [<ffffffff81550b59>] schedule+0x29/0x70
[89551.552862]  [<ffffffff81550e38>] io_schedule+0x98/0x100
[89551.552890]  [<ffffffffa01cbb45>] __cv_broadcast+0xe5/0x160 [spl]
[89551.552898]  [<ffffffff810b2e30>] ? __wake_up_sync+0x20/0x20
[89551.552911]  [<ffffffffa01cbc18>] __cv_wait_io+0x18/0x180 [spl]
[89551.552940]  [<ffffffffa030367b>] zio_wait+0x11b/0x200 [zfs]
[89551.552975]  [<ffffffffa028ac71>] dsl_pool_sync+0xc1/0x480 [zfs]
[89551.552987]  [<ffffffffa01c60af>] ? spl_kmem_cache_free+0x10f/0x4a0 [spl]
[89551.553025]  [<ffffffffa02a5b60>] spa_sync+0x480/0xbd0 [zfs]
[89551.553034]  [<ffffffff810b2e46>] ? autoremove_wake_function+0x16/0x40
[89551.553073]  [<ffffffffa02b730c>] txg_delay+0x4ec/0xa60 [zfs]
[89551.553110]  [<ffffffffa02b6f90>] ? txg_delay+0x170/0xa60 [zfs]
[89551.553122]  [<ffffffffa01c707a>] __thread_exit+0x9a/0xb0 [spl]
[89551.553133]  [<ffffffffa01c7000>] ? __thread_exit+0x20/0xb0 [spl]
[89551.553140]  [<ffffffff81090e0a>] kthread+0xea/0x100
[89551.553148]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[89551.553156]  [<ffffffff81554bfc>] ret_from_fork+0x7c/0xb0
[89551.553163]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[89671.594371] INFO: task txg_sync:583 blocked for more than 120 seconds.
[89671.594507]       Tainted: P           O   3.18.2-2-ARCH #1
[89671.594605] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[89671.594711] txg_sync        D 0000000000000000     0   583      2 0x00000000
[89671.594720]  ffff88021ce03b18 0000000000000046 ffff88021cd5eeb0 0000000000013640
[89671.594727]  ffff88021ce03fd8 0000000000013640 ffff88021f601420 ffff88021cd5eeb0
[89671.594733]  ffff88001afc4498 fffffffe00000000 ffff88001afc4490 00000000025e92cf
[89671.594738] Call Trace:
[89671.594753]  [<ffffffff810b2978>] ? __wake_up+0x48/0x60
[89671.594762]  [<ffffffff81550b59>] schedule+0x29/0x70
[89671.594768]  [<ffffffff81550e38>] io_schedule+0x98/0x100
[89671.594792]  [<ffffffffa01cbb45>] __cv_broadcast+0xe5/0x160 [spl]
[89671.594798]  [<ffffffff810b2e30>] ? __wake_up_sync+0x20/0x20
[89671.594809]  [<ffffffffa01cbc18>] __cv_wait_io+0x18/0x180 [spl]
[89671.594832]  [<ffffffffa030367b>] zio_wait+0x11b/0x200 [zfs]
[89671.594860]  [<ffffffffa028ac71>] dsl_pool_sync+0xc1/0x480 [zfs]
[89671.594870]  [<ffffffffa01c60af>] ? spl_kmem_cache_free+0x10f/0x4a0 [spl]
[89671.594901]  [<ffffffffa02a5b60>] spa_sync+0x480/0xbd0 [zfs]
[89671.594908]  [<ffffffff810b2e46>] ? autoremove_wake_function+0x16/0x40
[89671.594939]  [<ffffffffa02b730c>] txg_delay+0x4ec/0xa60 [zfs]
[89671.594969]  [<ffffffffa02b6f90>] ? txg_delay+0x170/0xa60 [zfs]
[89671.594979]  [<ffffffffa01c707a>] __thread_exit+0x9a/0xb0 [spl]
[89671.594988]  [<ffffffffa01c7000>] ? __thread_exit+0x20/0xb0 [spl]
[89671.594994]  [<ffffffff81090e0a>] kthread+0xea/0x100
[89671.595000]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[89671.595007]  [<ffffffff81554bfc>] ret_from_fork+0x7c/0xb0
[89671.595013]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[89791.636364] INFO: task txg_sync:583 blocked for more than 120 seconds.
[89791.636507]       Tainted: P           O   3.18.2-2-ARCH #1
[89791.636583] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[89791.636688] txg_sync        D 0000000000000000     0   583      2 0x00000000
[89791.636697]  ffff88021ce03b18 0000000000000046 ffff88021cd5eeb0 0000000000013640
[89791.636703]  ffff88021ce03fd8 0000000000013640 ffff88021f601420 ffff88021cd5eeb0
[89791.636709]  ffff88001afc4498 fffffffe00000000 ffff88001afc4490 00000000025e92cf
[89791.636715] Call Trace:
[89791.636729]  [<ffffffff810b2978>] ? __wake_up+0x48/0x60
[89791.636738]  [<ffffffff81550b59>] schedule+0x29/0x70
[89791.636744]  [<ffffffff81550e38>] io_schedule+0x98/0x100
[89791.636767]  [<ffffffffa01cbb45>] __cv_broadcast+0xe5/0x160 [spl]
[89791.636774]  [<ffffffff810b2e30>] ? __wake_up_sync+0x20/0x20
[89791.636785]  [<ffffffffa01cbc18>] __cv_wait_io+0x18/0x180 [spl]
[89791.636808]  [<ffffffffa030367b>] zio_wait+0x11b/0x200 [zfs]
[89791.636836]  [<ffffffffa028ac71>] dsl_pool_sync+0xc1/0x480 [zfs]
[89791.636852]  [<ffffffffa01c60af>] ? spl_kmem_cache_free+0x10f/0x4a0 [spl]
[89791.636883]  [<ffffffffa02a5b60>] spa_sync+0x480/0xbd0 [zfs]
[89791.636890]  [<ffffffff810b2e46>] ? autoremove_wake_function+0x16/0x40
[89791.636921]  [<ffffffffa02b730c>] txg_delay+0x4ec/0xa60 [zfs]
[89791.636951]  [<ffffffffa02b6f90>] ? txg_delay+0x170/0xa60 [zfs]
[89791.636960]  [<ffffffffa01c707a>] __thread_exit+0x9a/0xb0 [spl]
[89791.636969]  [<ffffffffa01c7000>] ? __thread_exit+0x20/0xb0 [spl]
[89791.636975]  [<ffffffff81090e0a>] kthread+0xea/0x100
[89791.636981]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[89791.636989]  [<ffffffff81554bfc>] ret_from_fork+0x7c/0xb0
[89791.636994]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[90105.367103] systemd[1]: systemd-journald.service stop-sigabrt timed out. Terminating.
[90105.490377] systemd[1]: Listening on Journal Audit Socket.
[90105.490436] systemd[1]: Starting Journal Service...
[90151.762415] INFO: task kswapd0:31 blocked for more than 120 seconds.
[90151.762569]       Tainted: P           O   3.18.2-2-ARCH #1
[90151.762663] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[90151.762795] kswapd0         D 0000000000000000     0    31      2 0x00000000
[90151.762805]  ffff8802235a3908 0000000000000046 ffff880224225a90 0000000000013640
[90151.762814]  ffff8802235a3fd8 0000000000013640 ffffffff81818540 ffff880224225a90
[90151.762821]  ffff88022fc93640 ffff88021cd5e4a0 ffff88021cd5ec02 0000000000000000
[90151.762828] Call Trace:
[90151.762845]  [<ffffffff8109e6e7>] ? try_to_wake_up+0x1e7/0x380
[90151.762856]  [<ffffffff8109e8e2>] ? default_wake_function+0x12/0x20
[90151.762865]  [<ffffffff81550b59>] schedule+0x29/0x70
[90151.762893]  [<ffffffffa01cbb8d>] __cv_broadcast+0x12d/0x160 [spl]
[90151.762902]  [<ffffffff810b2e30>] ? __wake_up_sync+0x20/0x20
[90151.762915]  [<ffffffffa01cbbd5>] __cv_wait+0x15/0x20 [spl]
[90151.762956]  [<ffffffffa02b6d03>] txg_wait_open+0x73/0xb0 [zfs]
[90151.762984]  [<ffffffffa027514a>] dmu_tx_wait+0x33a/0x350 [zfs]
[90151.763011]  [<ffffffffa02751f1>] dmu_tx_assign+0x91/0x4b0 [zfs]
[90151.763040]  [<ffffffffa02f2a73>] zfs_inactive+0x163/0x200 [zfs]
[90151.763049]  [<ffffffff810b2e70>] ? autoremove_wake_function+0x40/0x40
[90151.763075]  [<ffffffffa030af18>] zpl_vap_init+0x838/0xa10 [zfs]
[90151.763083]  [<ffffffff811eab68>] evict+0xb8/0x1b0
[90151.763090]  [<ffffffff811eaca1>] dispose_list+0x41/0x50
[90151.763097]  [<ffffffff811ebce6>] prune_icache_sb+0x56/0x80
[90151.763106]  [<ffffffff811d2b25>] super_cache_scan+0x115/0x180
[90151.763115]  [<ffffffff81169e89>] shrink_slab_node+0x129/0x2f0
[90151.763123]  [<ffffffff8116abcb>] shrink_slab+0x8b/0x160
[90151.763131]  [<ffffffff8116e0e9>] kswapd_shrink_zone+0x129/0x1d0
[90151.763138]  [<ffffffff8116eb7a>] kswapd+0x54a/0x8f0
[90151.763147]  [<ffffffff8116e630>] ? mem_cgroup_shrink_node_zone+0x1c0/0x1c0
[90151.763155]  [<ffffffff81090e0a>] kthread+0xea/0x100
[90151.763162]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[90151.763171]  [<ffffffff81554bfc>] ret_from_fork+0x7c/0xb0
[90151.763178]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[90151.763190] INFO: task systemd-journal:138 blocked for more than 120 seconds.
[90151.763340]       Tainted: P           O   3.18.2-2-ARCH #1
[90151.763433] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[90151.763563] systemd-journal D 0000000000000000     0   138      1 0x00000004
[90151.763571]  ffff88022310b128 0000000000000082 ffff880222c25080 0000000000013640
[90151.763577]  ffff88022310bfd8 0000000000013640 ffff88021cd5e4a0 ffff880222c25080
[90151.763583]  ffff88022fc93640 ffff88021cd5e4a0 ffff88021cd5ec02 0000000000000000
[90151.763590] Call Trace:
[90151.763599]  [<ffffffff8109e6e7>] ? try_to_wake_up+0x1e7/0x380
[90151.763607]  [<ffffffff8109e8e2>] ? default_wake_function+0x12/0x20
[90151.763614]  [<ffffffff81550b59>] schedule+0x29/0x70
[90151.763627]  [<ffffffffa01cbb8d>] __cv_broadcast+0x12d/0x160 [spl]
[90151.763635]  [<ffffffff810b2e30>] ? __wake_up_sync+0x20/0x20
[90151.763647]  [<ffffffffa01cbbd5>] __cv_wait+0x15/0x20 [spl]
[90151.763683]  [<ffffffffa02b6d03>] txg_wait_open+0x73/0xb0 [zfs]
[90151.763710]  [<ffffffffa027514a>] dmu_tx_wait+0x33a/0x350 [zfs]
[90151.763737]  [<ffffffffa02751f1>] dmu_tx_assign+0x91/0x4b0 [zfs]
[90151.763765]  [<ffffffffa02f2a73>] zfs_inactive+0x163/0x200 [zfs]
[90151.763774]  [<ffffffff810b2e70>] ? autoremove_wake_function+0x40/0x40
[90151.763799]  [<ffffffffa030af18>] zpl_vap_init+0x838/0xa10 [zfs]
[90151.763805]  [<ffffffff811eab68>] evict+0xb8/0x1b0
[90151.763812]  [<ffffffff811eaca1>] dispose_list+0x41/0x50
[90151.763819]  [<ffffffff811ebce6>] prune_icache_sb+0x56/0x80
[90151.763827]  [<ffffffff811d2b25>] super_cache_scan+0x115/0x180
[90151.763834]  [<ffffffff81169e89>] shrink_slab_node+0x129/0x2f0
[90151.763842]  [<ffffffff811c1523>] ? mem_cgroup_iter+0x2f3/0x4d0
[90151.763850]  [<ffffffff8116abcb>] shrink_slab+0x8b/0x160
[90151.763858]  [<ffffffff8116da45>] do_try_to_free_pages+0x365/0x4e0
[90151.763866]  [<ffffffff8116dc71>] try_to_free_pages+0xb1/0x1a0
[90151.763873]  [<ffffffff81160ca7>] __alloc_pages_nodemask+0x697/0xb50
[90151.763884]  [<ffffffff811a6e9c>] alloc_pages_current+0x9c/0x120
[90151.763891]  [<ffffffff811afba5>] new_slab+0x305/0x370
[90151.763899]  [<ffffffff811b2175>] __slab_alloc.isra.51+0x545/0x650
[90151.763907]  [<ffffffff81445af9>] ? __alloc_skb+0x89/0x210
[90151.763914]  [<ffffffff814380f4>] ? raw_pci_write+0x24/0x50
[90151.763923]  [<ffffffff812e8be6>] ? pci_bus_write_config_word+0x66/0x80
[90151.763949]  [<ffffffffa004ed7f>] ? ata_bmdma_start+0x2f/0x40 [libata]
[90151.763960]  [<ffffffffa01121ad>] ? atiixp_bmdma_start+0x9d/0xe0 [pata_atiixp]
[90151.763969]  [<ffffffff811b5445>] __kmalloc_node_track_caller+0xa5/0x240
[90151.763976]  [<ffffffff81445af9>] ? __alloc_skb+0x89/0x210
[90151.763984]  [<ffffffff81445a11>] __kmalloc_reserve.isra.38+0x31/0x90
[90151.763990]  [<ffffffff81445acb>] ? __alloc_skb+0x5b/0x210
[90151.763997]  [<ffffffff81445af9>] __alloc_skb+0x89/0x210
[90151.764004]  [<ffffffff81445de4>] alloc_skb_with_frags+0x64/0x1e0
[90151.764011]  [<ffffffff8143efb9>] sock_alloc_send_pskb+0x219/0x290
[90151.764020]  [<ffffffff810136fb>] ? __switch_to+0x1fb/0x600
[90151.764029]  [<ffffffff814f886d>] unix_dgram_sendmsg+0x18d/0x690
[90151.764037]  [<ffffffff8143ac39>] sock_sendmsg+0x79/0xb0
[90151.764045]  [<ffffffff810986da>] ? finish_task_switch+0x4a/0xf0
[90151.764051]  [<ffffffff815504c8>] ? __schedule+0x3e8/0xa50
[90151.764059]  [<ffffffff8143c76c>] ? move_addr_to_kernel+0x2c/0x50
[90151.764066]  [<ffffffff8144a3c7>] ? verify_iovec+0x47/0xd0
[90151.764074]  [<ffffffff8143b928>] ___sys_sendmsg+0x408/0x420
[90151.764083]  [<ffffffff812149d0>] ? ep_read_events_proc+0xe0/0xe0
[90151.764089]  [<ffffffff81440030>] ? sk_prot_alloc.isra.33+0x30/0x130
[90151.764097]  [<ffffffff811b291a>] ? kmem_cache_alloc+0x16a/0x170
[90151.764104]  [<ffffffff811d100c>] ? get_empty_filp+0x5c/0x1c0
[90151.764112]  [<ffffffff8126e496>] ? security_file_alloc+0x16/0x20
[90151.764118]  [<ffffffff811d1084>] ? get_empty_filp+0xd4/0x1c0
[90151.764126]  [<ffffffff811d118f>] ? alloc_file+0x1f/0xb0
[90151.764134]  [<ffffffff8143d651>] __sys_sendmsg+0x51/0x90
[90151.764142]  [<ffffffff8143d6a2>] SyS_sendmsg+0x12/0x20
[90151.764149]  [<ffffffff81554ca9>] system_call_fastpath+0x12/0x17
[90151.764186] INFO: task txg_sync:583 blocked for more than 120 seconds.
[90151.764325]       Tainted: P           O   3.18.2-2-ARCH #1
[90151.764418] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[90151.764547] txg_sync        D 0000000000000000     0   583      2 0x00000000
[90151.764554]  ffff88021ce03b18 0000000000000046 ffff88021cd5eeb0 0000000000013640
[90151.764560]  ffff88021ce03fd8 0000000000013640 ffff88021f601420 ffff88021cd5eeb0
[90151.764566]  ffff88018de384b8 ffffffff00000000 ffff88018de38490 00000000025e92cf
[90151.764572] Call Trace:
[90151.764581]  [<ffffffff810b2978>] ? __wake_up+0x48/0x60
[90151.764588]  [<ffffffff81550b59>] schedule+0x29/0x70
[90151.764594]  [<ffffffff81550e38>] io_schedule+0x98/0x100
[90151.764607]  [<ffffffffa01cbb45>] __cv_broadcast+0xe5/0x160 [spl]
[90151.764615]  [<ffffffff810b2e30>] ? __wake_up_sync+0x20/0x20
[90151.764627]  [<ffffffffa01cbc18>] __cv_wait_io+0x18/0x180 [spl]
[90151.764651]  [<ffffffffa030367b>] zio_wait+0x11b/0x200 [zfs]
[90151.764686]  [<ffffffffa028ac71>] dsl_pool_sync+0xc1/0x480 [zfs]
[90151.764698]  [<ffffffffa01c60af>] ? spl_kmem_cache_free+0x10f/0x4a0 [spl]
[90151.764735]  [<ffffffffa02a5b60>] spa_sync+0x480/0xbd0 [zfs]
[90151.764744]  [<ffffffff810b2e46>] ? autoremove_wake_function+0x16/0x40
[90151.764781]  [<ffffffffa02b730c>] txg_delay+0x4ec/0xa60 [zfs]
[90151.764818]  [<ffffffffa02b6f90>] ? txg_delay+0x170/0xa60 [zfs]
[90151.764830]  [<ffffffffa01c707a>] __thread_exit+0x9a/0xb0 [spl]
[90151.764841]  [<ffffffffa01c7000>] ? __thread_exit+0x20/0xb0 [spl]
[90151.764847]  [<ffffffff81090e0a>] kthread+0xea/0x100
[90151.764855]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[90151.764862]  [<ffffffff81554bfc>] ret_from_fork+0x7c/0xb0
[90151.764869]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[90195.648689] systemd[1]: systemd-journald.service stop-sigterm timed out. Killing.
[90195.649942] systemd[1]: Starting Journal Service...
behlendorf commented 9 years ago

This might be caused by #2523, can you verify you have the fix to the SPL applied: zfsonlinux/spl@a3c1eb77721a0d511b4fe7111bb2314686570c4b

ioquatix commented 9 years ago

More of the same

[ 1320.575152] INFO: task txg_sync:583 blocked for more than 120 seconds.
[ 1320.575263]       Tainted: P           O   3.18.2-2-ARCH #1
[ 1320.575314] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1320.575384] txg_sync        D 0000000000000000     0   583      2 0x00000000
[ 1320.575390]  ffff88021ba43b18 0000000000000046 ffff8800cbfc4670 0000000000013640
[ 1320.575395]  ffff88021ba43fd8 0000000000013640 ffff88021dee2840 ffff8800cbfc4670
[ 1320.575399]  ffff880131dcc138 ffffffff00000000 ffff880131dcc130 00000000b3f1aac7
[ 1320.575402] Call Trace:
[ 1320.575414]  [<ffffffff810b2978>] ? __wake_up+0x48/0x60
[ 1320.575420]  [<ffffffff81550b59>] schedule+0x29/0x70
[ 1320.575424]  [<ffffffff81550e38>] io_schedule+0x98/0x100
[ 1320.575442]  [<ffffffffa01cfb45>] __cv_broadcast+0xe5/0x160 [spl]
[ 1320.575447]  [<ffffffff810b2e30>] ? __wake_up_sync+0x20/0x20
[ 1320.575453]  [<ffffffffa01cfc18>] __cv_wait_io+0x18/0x180 [spl]
[ 1320.575470]  [<ffffffffa030767b>] zio_wait+0x11b/0x200 [zfs]
[ 1320.575490]  [<ffffffffa028ec71>] dsl_pool_sync+0xc1/0x480 [zfs]
[ 1320.575496]  [<ffffffffa01ca0af>] ? spl_kmem_cache_free+0x10f/0x4a0 [spl]
[ 1320.575517]  [<ffffffffa02a9b60>] spa_sync+0x480/0xbd0 [zfs]
[ 1320.575522]  [<ffffffff810b2e46>] ? autoremove_wake_function+0x16/0x40
[ 1320.575543]  [<ffffffffa02bb30c>] txg_delay+0x4ec/0xa60 [zfs]
[ 1320.575563]  [<ffffffffa02baf90>] ? txg_delay+0x170/0xa60 [zfs]
[ 1320.575569]  [<ffffffffa01cb07a>] __thread_exit+0x9a/0xb0 [spl]
[ 1320.575575]  [<ffffffffa01cb000>] ? __thread_exit+0x20/0xb0 [spl]
[ 1320.575579]  [<ffffffff81090e0a>] kthread+0xea/0x100
[ 1320.575583]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[ 1320.575588]  [<ffffffff81554bfc>] ret_from_fork+0x7c/0xb0
[ 1320.575592]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[ 1560.661841] INFO: task txg_sync:583 blocked for more than 120 seconds.
[ 1560.661995]       Tainted: P           O   3.18.2-2-ARCH #1
[ 1560.662090] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1560.662221] txg_sync        D 0000000000000000     0   583      2 0x00000000
[ 1560.662232]  ffff88021ba43b18 0000000000000046 ffff8800cbfc4670 0000000000013640
[ 1560.662240]  ffff88021ba43fd8 0000000000013640 ffffffff81818540 ffff8800cbfc4670
[ 1560.662247]  ffff880223335eb0 ffff880223335e98 0000000000000000 0000000000000003
[ 1560.662255] Call Trace:
[ 1560.662272]  [<ffffffff8109e8e2>] ? default_wake_function+0x12/0x20
[ 1560.662283]  [<ffffffff810b2715>] ? __wake_up_common+0x55/0x90
[ 1560.662292]  [<ffffffff810b2978>] ? __wake_up+0x48/0x60
[ 1560.662301]  [<ffffffff81550b59>] schedule+0x29/0x70
[ 1560.662307]  [<ffffffff81550e38>] io_schedule+0x98/0x100
[ 1560.662335]  [<ffffffffa01cfb45>] __cv_broadcast+0xe5/0x160 [spl]
[ 1560.662343]  [<ffffffff810b2e30>] ? __wake_up_sync+0x20/0x20
[ 1560.662355]  [<ffffffffa01cfc18>] __cv_wait_io+0x18/0x180 [spl]
[ 1560.662383]  [<ffffffffa030767b>] zio_wait+0x11b/0x200 [zfs]
[ 1560.662418]  [<ffffffffa028ec71>] dsl_pool_sync+0xc1/0x480 [zfs]
[ 1560.662430]  [<ffffffffa01ca0af>] ? spl_kmem_cache_free+0x10f/0x4a0 [spl]
[ 1560.662468]  [<ffffffffa02a9b60>] spa_sync+0x480/0xbd0 [zfs]
[ 1560.662477]  [<ffffffff810b2e46>] ? autoremove_wake_function+0x16/0x40
[ 1560.662516]  [<ffffffffa02bb30c>] txg_delay+0x4ec/0xa60 [zfs]
[ 1560.662552]  [<ffffffffa02baf90>] ? txg_delay+0x170/0xa60 [zfs]
[ 1560.662564]  [<ffffffffa01cb07a>] __thread_exit+0x9a/0xb0 [spl]
[ 1560.662575]  [<ffffffffa01cb000>] ? __thread_exit+0x20/0xb0 [spl]
[ 1560.662583]  [<ffffffff81090e0a>] kthread+0xea/0x100
[ 1560.662590]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[ 1560.662598]  [<ffffffff81554bfc>] ret_from_fork+0x7c/0xb0
[ 1560.662605]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[ 1560.662617] INFO: task java:1218 blocked for more than 120 seconds.
[ 1560.662753]       Tainted: P           O   3.18.2-2-ARCH #1
[ 1560.662846] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1560.662975] java            D 0000000000000001     0  1218    655 0x00000000
[ 1560.662983]  ffff8800c17f7ac8 0000000000000086 ffff8800c33e1420 0000000000013640
[ 1560.662990]  ffff8800c17f7fd8 0000000000013640 ffff8802240ceeb0 ffff8800c33e1420
[ 1560.662997]  0000008000000000 ffff88002041b7a0 0000000000000000 ffff88002041b778
[ 1560.663003] Call Trace:
[ 1560.663027]  [<ffffffffa0260398>] ? dbuf_rele_and_unlock+0x2c8/0x4d0 [zfs]
[ 1560.663049]  [<ffffffffa0261eca>] ? dbuf_read+0x8da/0xf20 [zfs]
[ 1560.663061]  [<ffffffffa01c8901>] ? kmem_asprintf+0x51/0x80 [spl]
[ 1560.663068]  [<ffffffff81550b59>] schedule+0x29/0x70
[ 1560.663080]  [<ffffffffa01cfb8d>] __cv_broadcast+0x12d/0x160 [spl]
[ 1560.663088]  [<ffffffff810b2e30>] ? __wake_up_sync+0x20/0x20
[ 1560.663099]  [<ffffffffa01cfbd5>] __cv_wait+0x15/0x20 [spl]
[ 1560.663126]  [<ffffffffa0278eab>] dmu_tx_wait+0x9b/0x350 [zfs]
[ 1560.663153]  [<ffffffffa02791f1>] dmu_tx_assign+0x91/0x4b0 [zfs]
[ 1560.663161]  [<ffffffff810e1006>] ? getrawmonotonic+0x36/0xd0
[ 1560.663186]  [<ffffffffa0268e1c>] dmu_free_long_range+0x1ac/0x280 [zfs]
[ 1560.663217]  [<ffffffffa02daf1c>] zfs_rmnode+0x6c/0x340 [zfs]
[ 1560.663244]  [<ffffffffa02fe141>] zfs_zinactive+0xc1/0x1d0 [zfs]
[ 1560.663273]  [<ffffffffa02f6974>] zfs_inactive+0x64/0x200 [zfs]
[ 1560.663281]  [<ffffffff810b2e70>] ? autoremove_wake_function+0x40/0x40
[ 1560.663307]  [<ffffffffa030ef18>] zpl_vap_init+0x838/0xa10 [zfs]
[ 1560.663315]  [<ffffffff811eab68>] evict+0xb8/0x1b0
[ 1560.663322]  [<ffffffff811eb405>] iput+0xf5/0x1a0
[ 1560.663330]  [<ffffffff811df7f2>] do_unlinkat+0x1e2/0x350
[ 1560.663337]  [<ffffffff811d4839>] ? SyS_newstat+0x39/0x60
[ 1560.663345]  [<ffffffff811e02c6>] SyS_unlink+0x16/0x20
[ 1560.663353]  [<ffffffff81554ca9>] system_call_fastpath+0x12/0x17
[ 1560.663371] INFO: task sshd:1381 blocked for more than 120 seconds.
[ 1560.663508]       Tainted: P           O   3.18.2-2-ARCH #1
[ 1560.663601] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1560.663791] sshd            D 0000000000000001     0  1381    764 0x00000004
[ 1560.663798]  ffff880113ad7b38 0000000000000086 ffff8800a7356eb0 0000000000013640
[ 1560.663804]  ffff880113ad7fd8 0000000000013640 ffff8802240ceeb0 ffff8800a7356eb0
[ 1560.663810]  ffff8800c9ccabb8 ffff8800c9ccabb8 0000000000000280 0000000000000001
[ 1560.663816] Call Trace:
[ 1560.663825]  [<ffffffff81550b59>] schedule+0x29/0x70
[ 1560.663839]  [<ffffffffa01cfb8d>] __cv_broadcast+0x12d/0x160 [spl]
[ 1560.663847]  [<ffffffff810b2e30>] ? __wake_up_sync+0x20/0x20
[ 1560.663858]  [<ffffffffa01cfbd5>] __cv_wait+0x15/0x20 [spl]
[ 1560.663885]  [<ffffffffa0278eab>] dmu_tx_wait+0x9b/0x350 [zfs]
[ 1560.663912]  [<ffffffffa02791f1>] dmu_tx_assign+0x91/0x4b0 [zfs]
[ 1560.663940]  [<ffffffffa02f66d7>] zfs_dirty_inode+0xf7/0x330 [zfs]
[ 1560.663951]  [<ffffffffa01c8956>] ? kmem_free_debug+0x16/0x20 [spl]
[ 1560.663962]  [<ffffffffa01d0b10>] ? crfree+0x170/0x180 [spl]
[ 1560.663972]  [<ffffffffa01d123d>] ? tsd_exit+0x19d/0x1b0 [spl]
[ 1560.663998]  [<ffffffffa02fe288>] ? zfs_tstamp_update_setup+0x38/0x1c0 [zfs]
[ 1560.664026]  [<ffffffffa02f0cfe>] ? zfs_read+0x39e/0x460 [zfs]
[ 1560.664049]  [<ffffffffa030ef2e>] zpl_vap_init+0x84e/0xa10 [zfs]
[ 1560.664056]  [<ffffffff811fb0f8>] __mark_inode_dirty+0x38/0x2d0
[ 1560.664082]  [<ffffffffa02fb3cd>] zfs_mark_inode_dirty+0x4d/0x60 [zfs]
[ 1560.664106]  [<ffffffffa030d626>] zpl_putpage+0x576/0xd50 [zfs]
[ 1560.664114]  [<ffffffff811d0d3c>] __fput+0x9c/0x200
[ 1560.664122]  [<ffffffff811d0eee>] ____fput+0xe/0x10
[ 1560.664128]  [<ffffffff8108f33f>] task_work_run+0x9f/0xe0
[ 1560.664137]  [<ffffffff81014e75>] do_notify_resume+0x95/0xa0
[ 1560.664145]  [<ffffffff81554f20>] int_signal+0x12/0x17
[ 1680.702704] INFO: task txg_sync:583 blocked for more than 120 seconds.
[ 1680.702864]       Tainted: P           O   3.18.2-2-ARCH #1
[ 1680.702958] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1680.703089] txg_sync        D 0000000000000000     0   583      2 0x00000000
[ 1680.703100]  ffff88021ba43b18 0000000000000046 ffff8800cbfc4670 0000000000013640
[ 1680.703108]  ffff88021ba43fd8 0000000000013640 ffffffff81818540 ffff8800cbfc4670
[ 1680.703116]  ffff880223335eb0 ffff880223335e98 0000000000000000 0000000000000003
[ 1680.703123] Call Trace:
[ 1680.703141]  [<ffffffff8109e8e2>] ? default_wake_function+0x12/0x20
[ 1680.703152]  [<ffffffff810b2715>] ? __wake_up_common+0x55/0x90
[ 1680.703161]  [<ffffffff810b2978>] ? __wake_up+0x48/0x60
[ 1680.703169]  [<ffffffff81550b59>] schedule+0x29/0x70
[ 1680.703176]  [<ffffffff81550e38>] io_schedule+0x98/0x100
[ 1680.703203]  [<ffffffffa01cfb45>] __cv_broadcast+0xe5/0x160 [spl]
[ 1680.703211]  [<ffffffff810b2e30>] ? __wake_up_sync+0x20/0x20
[ 1680.703224]  [<ffffffffa01cfc18>] __cv_wait_io+0x18/0x180 [spl]
[ 1680.703252]  [<ffffffffa030767b>] zio_wait+0x11b/0x200 [zfs]
[ 1680.703287]  [<ffffffffa028ec71>] dsl_pool_sync+0xc1/0x480 [zfs]
[ 1680.703299]  [<ffffffffa01ca0af>] ? spl_kmem_cache_free+0x10f/0x4a0 [spl]
[ 1680.703338]  [<ffffffffa02a9b60>] spa_sync+0x480/0xbd0 [zfs]
[ 1680.703346]  [<ffffffff810b2e46>] ? autoremove_wake_function+0x16/0x40
[ 1680.703385]  [<ffffffffa02bb30c>] txg_delay+0x4ec/0xa60 [zfs]
[ 1680.703422]  [<ffffffffa02baf90>] ? txg_delay+0x170/0xa60 [zfs]
[ 1680.703433]  [<ffffffffa01cb07a>] __thread_exit+0x9a/0xb0 [spl]
[ 1680.703444]  [<ffffffffa01cb000>] ? __thread_exit+0x20/0xb0 [spl]
[ 1680.703452]  [<ffffffff81090e0a>] kthread+0xea/0x100
[ 1680.703459]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[ 1680.703467]  [<ffffffff81554bfc>] ret_from_fork+0x7c/0xb0
[ 1680.703474]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[ 1800.742712] INFO: task txg_sync:583 blocked for more than 120 seconds.
[ 1800.742869]       Tainted: P           O   3.18.2-2-ARCH #1
[ 1800.742964] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1800.743095] txg_sync        D 0000000000000000     0   583      2 0x00000000
[ 1800.743106]  ffff88021ba43b18 0000000000000046 ffff8800cbfc4670 0000000000013640
[ 1800.743114]  ffff88021ba43fd8 0000000000013640 ffffffff81818540 ffff8800cbfc4670
[ 1800.743121]  ffff880223335eb0 ffff880223335e98 0000000000000000 0000000000000003
[ 1800.743129] Call Trace:
[ 1800.743146]  [<ffffffff8109e8e2>] ? default_wake_function+0x12/0x20
[ 1800.743157]  [<ffffffff810b2715>] ? __wake_up_common+0x55/0x90
[ 1800.743165]  [<ffffffff810b2978>] ? __wake_up+0x48/0x60
[ 1800.743174]  [<ffffffff81550b59>] schedule+0x29/0x70
[ 1800.743180]  [<ffffffff81550e38>] io_schedule+0x98/0x100
[ 1800.743206]  [<ffffffffa01cfb45>] __cv_broadcast+0xe5/0x160 [spl]
[ 1800.743214]  [<ffffffff810b2e30>] ? __wake_up_sync+0x20/0x20
[ 1800.743227]  [<ffffffffa01cfc18>] __cv_wait_io+0x18/0x180 [spl]
[ 1800.743256]  [<ffffffffa030767b>] zio_wait+0x11b/0x200 [zfs]
[ 1800.743290]  [<ffffffffa028ec71>] dsl_pool_sync+0xc1/0x480 [zfs]
[ 1800.743302]  [<ffffffffa01ca0af>] ? spl_kmem_cache_free+0x10f/0x4a0 [spl]
[ 1800.743341]  [<ffffffffa02a9b60>] spa_sync+0x480/0xbd0 [zfs]
[ 1800.743349]  [<ffffffff810b2e46>] ? autoremove_wake_function+0x16/0x40
[ 1800.743388]  [<ffffffffa02bb30c>] txg_delay+0x4ec/0xa60 [zfs]
[ 1800.743425]  [<ffffffffa02baf90>] ? txg_delay+0x170/0xa60 [zfs]
[ 1800.743437]  [<ffffffffa01cb07a>] __thread_exit+0x9a/0xb0 [spl]
[ 1800.743448]  [<ffffffffa01cb000>] ? __thread_exit+0x20/0xb0 [spl]
[ 1800.743455]  [<ffffffff81090e0a>] kthread+0xea/0x100
[ 1800.743462]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[ 1800.743471]  [<ffffffff81554bfc>] ret_from_fork+0x7c/0xb0
[ 1800.743477]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[ 1920.782195] INFO: task txg_sync:583 blocked for more than 120 seconds.
[ 1920.782354]       Tainted: P           O   3.18.2-2-ARCH #1
[ 1920.782449] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1920.782580] txg_sync        D 0000000000000000     0   583      2 0x00000000
[ 1920.782590]  ffff88021ba43b18 0000000000000046 ffff8800cbfc4670 0000000000013640
[ 1920.782599]  ffff88021ba43fd8 0000000000013640 ffffffff81818540 ffff8800cbfc4670
[ 1920.782606]  ffff880223335eb0 ffff880223335e98 0000000000000000 0000000000000003
[ 1920.782614] Call Trace:
[ 1920.782631]  [<ffffffff8109e8e2>] ? default_wake_function+0x12/0x20
[ 1920.782641]  [<ffffffff810b2715>] ? __wake_up_common+0x55/0x90
[ 1920.782650]  [<ffffffff810b2978>] ? __wake_up+0x48/0x60
[ 1920.782658]  [<ffffffff81550b59>] schedule+0x29/0x70
[ 1920.782665]  [<ffffffff81550e38>] io_schedule+0x98/0x100
[ 1920.782693]  [<ffffffffa01cfb45>] __cv_broadcast+0xe5/0x160 [spl]
[ 1920.782701]  [<ffffffff810b2e30>] ? __wake_up_sync+0x20/0x20
[ 1920.782714]  [<ffffffffa01cfc18>] __cv_wait_io+0x18/0x180 [spl]
[ 1920.782742]  [<ffffffffa030767b>] zio_wait+0x11b/0x200 [zfs]
[ 1920.782777]  [<ffffffffa028ec71>] dsl_pool_sync+0xc1/0x480 [zfs]
[ 1920.782789]  [<ffffffffa01ca0af>] ? spl_kmem_cache_free+0x10f/0x4a0 [spl]
[ 1920.782827]  [<ffffffffa02a9b60>] spa_sync+0x480/0xbd0 [zfs]
[ 1920.782836]  [<ffffffff810b2e46>] ? autoremove_wake_function+0x16/0x40
[ 1920.782874]  [<ffffffffa02bb30c>] txg_delay+0x4ec/0xa60 [zfs]
[ 1920.782911]  [<ffffffffa02baf90>] ? txg_delay+0x170/0xa60 [zfs]
[ 1920.782923]  [<ffffffffa01cb07a>] __thread_exit+0x9a/0xb0 [spl]
[ 1920.782934]  [<ffffffffa01cb000>] ? __thread_exit+0x20/0xb0 [spl]
[ 1920.782942]  [<ffffffff81090e0a>] kthread+0xea/0x100
[ 1920.782949]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[ 1920.782957]  [<ffffffff81554bfc>] ret_from_fork+0x7c/0xb0
[ 1920.782964]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[ 1920.782976] INFO: task java:1218 blocked for more than 120 seconds.
[ 1920.783136]       Tainted: P           O   3.18.2-2-ARCH #1
[ 1920.783236] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1920.783366] java            D 0000000000000000     0  1218    655 0x00000000
[ 1920.783373]  ffff8800c17f7ab8 0000000000000086 ffff8800c33e1420 0000000000013640
[ 1920.783381]  ffff8800c17f7fd8 0000000000013640 ffffffff81818540 ffff8800c33e1420
[ 1920.783387]  ffff8800c17f7a08 ffffffffa025a872 0000000000000000 ffff8800235be4a0
[ 1920.783393] Call Trace:
[ 1920.783416]  [<ffffffffa025a872>] ? arc_buf_eviction_needed+0x82/0xc0 [zfs]
[ 1920.783439]  [<ffffffffa0260398>] ? dbuf_rele_and_unlock+0x2c8/0x4d0 [zfs]
[ 1920.783476]  [<ffffffffa02b634d>] ? bp_get_dsize+0xad/0xf0 [zfs]
[ 1920.783503]  [<ffffffffa02770e4>] ? dmu_tx_callback_register+0x324/0xab0 [zfs]
[ 1920.783512]  [<ffffffff81550b59>] schedule+0x29/0x70
[ 1920.783524]  [<ffffffffa01cfb8d>] __cv_broadcast+0x12d/0x160 [spl]
[ 1920.783532]  [<ffffffff810b2e30>] ? __wake_up_sync+0x20/0x20
[ 1920.783544]  [<ffffffffa01cfbd5>] __cv_wait+0x15/0x20 [spl]
[ 1920.783570]  [<ffffffffa0278eab>] dmu_tx_wait+0x9b/0x350 [zfs]
[ 1920.783597]  [<ffffffffa02791f1>] dmu_tx_assign+0x91/0x4b0 [zfs]
[ 1920.783627]  [<ffffffffa02810e5>] ? dsl_dataset_block_freeable+0x45/0x1d0 [zfs]
[ 1920.783653]  [<ffffffffa0276fb9>] ? dmu_tx_callback_register+0x1f9/0xab0 [zfs]
[ 1920.783681]  [<ffffffffa02f74a0>] zfs_write+0x3c0/0xbf0 [zfs]
[ 1920.783688]  [<ffffffff810a9dee>] ? enqueue_entity+0x24e/0xaa0
[ 1920.783696]  [<ffffffff8109a1f0>] ? resched_curr+0xd0/0xe0
[ 1920.783705]  [<ffffffff810ec497>] ? wake_futex+0x67/0x90
[ 1920.783711]  [<ffffffff810ef856>] ? do_futex+0x8f6/0xae0
[ 1920.783736]  [<ffffffffa030d2cb>] zpl_putpage+0x21b/0xd50 [zfs]
[ 1920.783744]  [<ffffffff811cf1d7>] vfs_write+0xb7/0x200
[ 1920.783752]  [<ffffffff811cfd29>] SyS_write+0x59/0xd0
[ 1920.783760]  [<ffffffff81554ca9>] system_call_fastpath+0x12/0x17
[ 1920.783778] INFO: task imap:1389 blocked for more than 120 seconds.
[ 1920.783916]       Tainted: P           O   3.18.2-2-ARCH #1
[ 1920.784009] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1920.784138] imap            D 0000000000000000     0  1389    770 0x00000000
[ 1920.784145]  ffff88000f97fb38 0000000000000082 ffff880222d3bc60 0000000000013640
[ 1920.784151]  ffff88000f97ffd8 0000000000013640 ffff88009d290a10 ffff880222d3bc60
[ 1920.784161]  ffff8800c9ccabb8 ffff8800c9ccabb8 0000000000000fd8 0000000000000001
[ 1920.784168] Call Trace:
[ 1920.784176]  [<ffffffff81550b59>] schedule+0x29/0x70
[ 1920.784190]  [<ffffffffa01cfb8d>] __cv_broadcast+0x12d/0x160 [spl]
[ 1920.784197]  [<ffffffff810b2e30>] ? __wake_up_sync+0x20/0x20
[ 1920.784209]  [<ffffffffa01cfbd5>] __cv_wait+0x15/0x20 [spl]
[ 1920.784236]  [<ffffffffa0278eab>] dmu_tx_wait+0x9b/0x350 [zfs]
[ 1920.784263]  [<ffffffffa02791f1>] dmu_tx_assign+0x91/0x4b0 [zfs]
[ 1920.784292]  [<ffffffffa02f66d7>] zfs_dirty_inode+0xf7/0x330 [zfs]
[ 1920.784303]  [<ffffffffa01d0b10>] ? crfree+0x170/0x180 [spl]
[ 1920.784314]  [<ffffffffa01d123d>] ? tsd_exit+0x19d/0x1b0 [spl]
[ 1920.784338]  [<ffffffffa030ef2e>] zpl_vap_init+0x84e/0xa10 [zfs]
[ 1920.784345]  [<ffffffff811fb0f8>] __mark_inode_dirty+0x38/0x2d0
[ 1920.784372]  [<ffffffffa02fb3cd>] zfs_mark_inode_dirty+0x4d/0x60 [zfs]
[ 1920.784395]  [<ffffffffa030d626>] zpl_putpage+0x576/0xd50 [zfs]
[ 1920.784403]  [<ffffffff811d0d3c>] __fput+0x9c/0x200
[ 1920.784411]  [<ffffffff811d0eee>] ____fput+0xe/0x10
[ 1920.784417]  [<ffffffff8108f33f>] task_work_run+0x9f/0xe0
[ 1920.784426]  [<ffffffff81014e75>] do_notify_resume+0x95/0xa0
[ 1920.784434]  [<ffffffff81554f20>] int_signal+0x12/0x17
[ 2280.901736] INFO: task txg_sync:583 blocked for more than 120 seconds.
[ 2280.901897]       Tainted: P           O   3.18.2-2-ARCH #1
[ 2280.901992] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2280.902124] txg_sync        D 0000000000000000     0   583      2 0x00000000
[ 2280.902134]  ffff88021ba43b18 0000000000000046 ffff8800cbfc4670 0000000000013640
[ 2280.902143]  ffff88021ba43fd8 0000000000013640 ffffffff81818540 ffff8800cbfc4670
[ 2280.902150]  ffff880223335eb0 ffff880223335e98 0000000000000000 0000000000000003
[ 2280.902157] Call Trace:
[ 2280.902175]  [<ffffffff8109e8e2>] ? default_wake_function+0x12/0x20
[ 2280.902186]  [<ffffffff810b2715>] ? __wake_up_common+0x55/0x90
[ 2280.902194]  [<ffffffff810b2978>] ? __wake_up+0x48/0x60
[ 2280.902203]  [<ffffffff81550b59>] schedule+0x29/0x70
[ 2280.902210]  [<ffffffff81550e38>] io_schedule+0x98/0x100
[ 2280.902237]  [<ffffffffa01cfb45>] __cv_broadcast+0xe5/0x160 [spl]
[ 2280.902245]  [<ffffffff810b2e30>] ? __wake_up_sync+0x20/0x20
[ 2280.902258]  [<ffffffffa01cfc18>] __cv_wait_io+0x18/0x180 [spl]
[ 2280.902286]  [<ffffffffa030767b>] zio_wait+0x11b/0x200 [zfs]
[ 2280.902321]  [<ffffffffa028ec71>] dsl_pool_sync+0xc1/0x480 [zfs]
[ 2280.902333]  [<ffffffffa01ca0af>] ? spl_kmem_cache_free+0x10f/0x4a0 [spl]
[ 2280.902371]  [<ffffffffa02a9b60>] spa_sync+0x480/0xbd0 [zfs]
[ 2280.902380]  [<ffffffff810b2e46>] ? autoremove_wake_function+0x16/0x40
[ 2280.902419]  [<ffffffffa02bb30c>] txg_delay+0x4ec/0xa60 [zfs]
[ 2280.902455]  [<ffffffffa02baf90>] ? txg_delay+0x170/0xa60 [zfs]
[ 2280.902467]  [<ffffffffa01cb07a>] __thread_exit+0x9a/0xb0 [spl]
[ 2280.902478]  [<ffffffffa01cb000>] ? __thread_exit+0x20/0xb0 [spl]
[ 2280.902486]  [<ffffffff81090e0a>] kthread+0xea/0x100
[ 2280.902493]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[ 2280.902502]  [<ffffffff81554bfc>] ret_from_fork+0x7c/0xb0
[ 2280.902508]  [<ffffffff81090d20>] ? kthread_create_on_node+0x1c0/0x1c0
[38275.738573] perf interrupt took too long (2506 > 2495), lowering kernel.perf_event_max_sample_rate to 50100

Running spl versions from arch:

hinoki% pacman -Q | grep spl
spl-git 0.6.3_r54_g03a7835_3.18.2_2-1
spl-utils-git 0.6.3_r54_g03a7835_3.18.2_2-1
ioquatix commented 9 years ago

Free memory:

hinoki% free -h
              total        used        free      shared  buff/cache   available
Mem:           7.8G        3.6G        691M        1.0M        3.6G        872M
Swap:          4.0G          0B        4.0G
TonyHoyle commented 9 years ago

I'm getting this every couple of days now since upgrading to the latest Debian build (which according to the linked bug has the fix in it). Oops messages more or less the same as those already posted.. The zvols lock up and load average climbs into the hundreds. I've never had an issue prior to this.

StevenWolfe commented 9 years ago

I'm seeing similar symptoms after running a zfs rollback on a SMB shared dataset. After seeing load quickly rise I've stopped the SMB service and will give the rollback command some time to see if it recovers.

From dmesg, on Ubuntu Server 14.04.2 LTS, ZoL 0.6.3-5 from the Ubuntu PPA:

[3764100.795021] smbd            D ffff88010b433480     0 34303  60686 0x00000000
[3764100.795025]  ffff8800ae491d68 0000000000000082 ffff8800cf951800 ffff8800ae491fd8
[3764100.795029]  0000000000013480 0000000000013480 ffff8800cf951800 ffff8800e48dc3d0
[3764100.795032]  ffff8800e48dc3a0 ffff8800e48dc3d8 ffff8800e48dc3c8 0000000000000000
[3764100.795037] Call Trace:
[3764100.795052]  [<ffffffff817251a9>] schedule+0x29/0x70
[3764100.795067]  [<ffffffffa006b7b5>] cv_wait_common+0x125/0x1c0 [spl]
[3764100.795073]  [<ffffffff810ab0b0>] ? prepare_to_wait_event+0x100/0x100
[3764100.795091]  [<ffffffffa006b865>] __cv_wait+0x15/0x20 [spl]
[3764100.795130]  [<ffffffffa017019b>] rrw_enter_read+0x3b/0x150 [zfs]
[3764100.795181]  [<ffffffffa01bf65d>] zfs_getattr_fast+0x3d/0x180 [zfs]
[3764100.795230]  [<ffffffffa01d81fd>] zpl_getattr+0x2d/0x50 [zfs]
[3764100.795234]  [<ffffffff811c2829>] vfs_getattr_nosec+0x29/0x40
[3764100.795237]  [<ffffffff811c28fd>] vfs_getattr+0x2d/0x40
[3764100.795240]  [<ffffffff811c29d2>] vfs_fstatat+0x62/0xa0
[3764100.795244]  [<ffffffff811c2e5f>] SYSC_newstat+0x1f/0x40
[3764100.795248]  [<ffffffff811cdc99>] ? putname+0x29/0x40
[3764100.795252]  [<ffffffff811bcfe8>] ? do_sys_open+0x1b8/0x280
[3764100.795256]  [<ffffffff811c30ae>] SyS_newstat+0xe/0x10
[3764100.795260]  [<ffffffff8173186d>] system_call_fastpath+0x1a/0x1f
[3764100.795263] INFO: task smbd:34313 blocked for more than 120 seconds.
stevleibelt commented 9 years ago

I can trigger the same error by using rsync from two different pools. The rsync process is hanging and can not be killed.

If I won't stop the rsync from the cli, the will endup into a fault status. The fault status is gone after a reboot.

May 07 22:43:27 <my host> kernel: INFO: task txg_sync:1408 blocked for more than 120 seconds.
May 07 22:43:27 <my host> kernel:       Tainted: P           O    4.0.1-1-ARCH #1
May 07 22:43:27 <my host> kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 07 22:43:27 <my host> kernel: txg_sync        D ffff880028b63a28     0  1408      2 0x00000000
May 07 22:43:27 <my host> kernel:  ffff880028b63a28 ffff88018c5f3cc0 ffff880070a16540 0000000000000000
May 07 22:43:27 <my host> kernel:  ffff880028b63fd8 ffff88021fc93e00 7fffffffffffffff ffff8801b1ccae08
May 07 22:43:27 <my host> kernel:  0000000000000001 ffff880028b63a48 ffffffff8156fa87 ffff88008e7bdcb0
May 07 22:43:27 <my host> kernel: Call Trace:
May 07 22:43:27 <my host> kernel:  [<ffffffff8156fa87>] schedule+0x37/0x90
May 07 22:43:27 <my host> kernel:  [<ffffffff8157246c>] schedule_timeout+0x1bc/0x250
May 07 22:43:27 <my host> kernel:  [<ffffffff8101f599>] ? read_tsc+0x9/0x10
May 07 22:43:27 <my host> kernel:  [<ffffffff810e6757>] ? ktime_get+0x37/0xb0
May 07 22:43:27 <my host> kernel:  [<ffffffff8156ef9a>] io_schedule_timeout+0xaa/0x130
May 07 22:43:27 <my host> kernel:  [<ffffffffa034daa0>] ? zio_taskq_member.isra.6+0x80/0x80 [zfs]
May 07 22:43:27 <my host> kernel:  [<ffffffffa0192248>] cv_wait_common+0xb8/0x140 [spl]
May 07 22:43:27 <my host> kernel:  [<ffffffff810b6b20>] ? wake_atomic_t_function+0x60/0x60
May 07 22:43:27 <my host> kernel:  [<ffffffffa0192328>] __cv_wait_io+0x18/0x20 [spl]
May 07 22:43:27 <my host> kernel:  [<ffffffffa034f943>] zio_wait+0x123/0x210 [zfs]
May 07 22:43:27 <my host> kernel:  [<ffffffffa02d6be1>] dsl_pool_sync+0xc1/0x480 [zfs]
May 07 22:43:27 <my host> kernel:  [<ffffffffa02f1f80>] spa_sync+0x480/0xbf0 [zfs]
May 07 22:43:27 <my host> kernel:  [<ffffffff810b6b36>] ? autoremove_wake_function+0x16/0x40
May 07 22:43:27 <my host> kernel:  [<ffffffffa0303e06>] txg_sync_thread+0x386/0x630 [zfs]
May 07 22:43:27 <my host> kernel:  [<ffffffff8156fd62>] ? preempt_schedule_common+0x22/0x40
May 07 22:43:27 <my host> kernel:  [<ffffffffa0303a80>] ? txg_quiesce_thread+0x3a0/0x3a0 [zfs]
May 07 22:43:27 <my host> kernel:  [<ffffffffa018d561>] thread_generic_wrapper+0x71/0x80 [spl]
May 07 22:43:27 <my host> kernel:  [<ffffffffa018d4f0>] ? __thread_exit+0x20/0x20 [spl]
May 07 22:43:27 <my host> kernel:  [<ffffffff81093338>] kthread+0xd8/0xf0
May 07 22:43:27 <my host> kernel:  [<ffffffff81093260>] ? kthread_worker_fn+0x170/0x170
May 07 22:43:27 <my host> kernel:  [<ffffffff81573718>] ret_from_fork+0x58/0x90
May 07 22:43:27 <my host> kernel:  [<ffffffff81093260>] ? kthread_worker_fn+0x170/0x170
May 07 22:43:27 <my host> kernel: INFO: task rsync:10064 blocked for more than 120 seconds.
May 07 22:43:27 <my host> kernel:       Tainted: P           O    4.0.1-1-ARCH #1
May 07 22:43:27 <my host> kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 07 22:43:27 <my host> kernel: rsync           D ffff8801a4633a58     0 10064      1 0x00000004
May 07 22:43:27 <my host> kernel:  ffff8801a4633a58 ffff880216186f60 ffff880070a56540 ffff8801a4633a68
May 07 22:43:27 <my host> kernel:  ffff8801a4633fd8 ffff8800da2a4a20 ffff8800da2a4ae0 ffff8800da2a4a48
May 07 22:43:27 <my host> kernel:  0000000000000000 ffff8801a4633a78 ffffffff8156fa87 ffff8800da2a4a20
May 07 22:43:27 <my host> kernel: Call Trace:
May 07 22:43:27 <my host> kernel:  [<ffffffff8156fa87>] schedule+0x37/0x90
May 07 22:43:27 <my host> kernel:  [<ffffffffa019229d>] cv_wait_common+0x10d/0x140 [spl]
May 07 22:43:27 <my host> kernel:  [<ffffffff810b6b20>] ? wake_atomic_t_function+0x60/0x60
May 07 22:43:27 <my host> kernel:  [<ffffffffa01922e5>] __cv_wait+0x15/0x20 [spl]
May 07 22:43:27 <my host> kernel:  [<ffffffffa030327b>] txg_wait_synced+0x8b/0xd0 [zfs]
May 07 22:43:27 <my host> kernel:  [<ffffffffa02c078c>] dmu_tx_wait+0x25c/0x3a0 [zfs]
May 07 22:43:27 <my host> kernel:  [<ffffffffa02c096e>] dmu_tx_assign+0x9e/0x520 [zfs]
May 07 22:43:27 <my host> kernel:  [<ffffffffa02c8e10>] ? dsl_dataset_block_freeable+0x20/0x70 [zfs]
May 07 22:43:27 <my host> kernel:  [<ffffffffa02be639>] ? dmu_tx_count_dnode+0x59/0xb0 [zfs]
May 07 22:43:27 <my host> kernel:  [<ffffffffa033f5de>] zfs_write+0x3ce/0xc50 [zfs]
May 07 22:43:27 <my host> kernel:  [<ffffffff81515821>] ? unix_stream_recvmsg+0x701/0x7e0
May 07 22:43:27 <my host> kernel:  [<ffffffffa03558cd>] zpl_write+0xbd/0x130 [zfs]
May 07 22:43:27 <my host> kernel:  [<ffffffff811d56b3>] vfs_write+0xb3/0x200
May 07 22:43:27 <my host> kernel:  [<ffffffff811d624c>] ? vfs_read+0x11c/0x140
May 07 22:43:27 <my host> kernel:  [<ffffffff811d6399>] SyS_write+0x59/0xd0
May 07 22:43:27 <my host> kernel:  [<ffffffff815737c9>] system_call_fastpath+0x12/0x17
May 07 22:44:27 <my host> kernel: WARNING: Pool '<my target pool>' has encountered an uncorrectable I/O failure and has been suspended.
May 07 22:44:27 <my host> zed[12566]: eid=2302 class=data pool=<my target pool>
May 07 22:44:27 <my host> zed[12569]: eid=2303 class=io_failure pool=<my target pool>
# zpool status -v
  pool: <my target pool>
 state: ONLINE
  scan: scrub repaired 0 in 9h15m with 0 errors on Thu May  7 09:01:37 2015
config:

        NAME                    STATE     READ WRITE CKSUM
        <my target pool>          ONLINE       0     0     0
          <my target pool>-crypt  ONLINE       0     0     0

errors: No known data errors

  pool: <my source pool>
 state: ONLINE
  scan: scrub repaired 0 in 6h3m with 0 errors on Thu May  7 05:49:33 2015
config:

        NAME                       STATE     READ WRITE CKSUM
        <my source pool>               ONLINE       0     0     0
          mirror-0                 ONLINE       0     0     0
            crypt-<my source pool>-00  ONLINE       0     0     0
            crypt-<my source pool>-01  ONLINE       0     0     0

errors: No known data errors
#modinfo zfs | head
filename:       /lib/modules/4.0.1-1-ARCH/extra/zfs/zfs.ko.gz
version:        0.6.4.1-1
license:        CDDL
author:         OpenZFS on Linux
description:    ZFS
srcversion:     8324F6AEA2A06B2B6F0A0F5
depends:        spl,znvpair,zunicode,zcommon,zavl
vermagic:       4.0.1-1-ARCH SMP preempt mod_unload modversions 
#modinfo spl | head
filename:       /lib/modules/4.0.1-1-ARCH/extra/spl/spl.ko.gz
version:        0.6.4.1-1
license:        GPL
author:         OpenZFS on Linux
description:    Solaris Porting Layer
srcversion:     8907748310B8940C9D0DCD2
depends:        
vermagic:       4.0.1-1-ARCH SMP preempt mod_unload modversions 
#uname -a
Linux <my host> 4.0.1-1-ARCH #1 SMP PREEMPT Wed Apr 29 12:00:26 CEST 2015 x86_64 GNU/Linux

Fingers crossed I've provided good information. I'm running an arch linux with demz repo.

melkor217 commented 9 years ago

I got the same issue with high-loaded mongodb on ZFS.

free -m
             total       used       free     shared    buffers     cached
Mem:          7928       7422        506         76        192       1024
-/+ buffers/cache:       6205       1723
Swap:            0          0          0
# modinfo spl|head
filename:       /lib/modules/4.0.0-1-amd64/updates/dkms/spl.ko
version:        0.6.4-2-62e2eb
license:        GPL
author:         OpenZFS on Linux
description:    Solaris Porting Layer
srcversion:     3F1EAF06925B312A0B3F767
depends:        
vermagic:       4.0.0-1-amd64 SMP mod_unload modversions 
parm:           spl_hostid:The system hostid. (ulong)
parm:           spl_hostid_path:The system hostid file (/etc/hostid) (charp)
# modinfo zfs|head
filename:       /lib/modules/4.0.0-1-amd64/updates/dkms/zfs.ko
version:        0.6.4-16-544f71
license:        CDDL
author:         OpenZFS on Linux
description:    ZFS
srcversion:     29BA21B62706579B75D5974
depends:        spl,znvpair,zunicode,zcommon,zavl
vermagic:       4.0.0-1-amd64 SMP mod_unload modversions 
parm:           zvol_inhibit_dev:Do not create zvol device nodes (uint)
parm:           zvol_major:Major number for zvol device (uint)
# uname -a
Linux dan-desktop 4.0.0-1-amd64 #1 SMP Debian 4.0.2-1 (2015-05-11) x86_64 GNU/Linux
# lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux unstable (sid)
Release:        unstable
Codename:       sid
andreas-p commented 8 years ago

Same for me on Debian Jessie with Linux 3.16.0-4-amd64 and zfs 0.6.5.2-2. Happened on a postgresql server, no load at that time.

Jan 22 07:04:42 db04 kernel: [5056080.684110] INFO: task txg_sync:378 blocked for more than 120 seconds. Jan 22 07:04:43 db04 kernel: [5056080.684128] Tainted: P O 3.16.0-4-amd64 #1 Jan 22 07:04:43 db04 kernel: [5056080.684134] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jan 22 07:04:43 db04 kernel: [5056080.684142] txg_sync D ffff8800f96bc7e8 0 378 2 0x00000000 Jan 22 07:04:43 db04 kernel: [5056080.684153] ffff8800f96bc390 0000000000000246 0000000000012f00 ffff8800f707ffd8 Jan 22 07:04:43 db04 kernel: [5056080.684165] 0000000000012f00 ffff8800f96bc390 ffff8800ff3137b0 ffff880048db7570 Jan 22 07:04:43 db04 kernel: [5056080.684176] ffff880048db75b0 0000000000000001 ffff8800f8eb9000 0000000000000000 Jan 22 07:04:43 db04 kernel: [5056080.684187] Call Trace: Jan 22 07:04:43 db04 kernel: [5056080.684201] [] ? io_schedule+0x99/0x120 Jan 22 07:04:43 db04 kernel: [5056080.684220] [] ? cv_wait_common+0x90/0x100 [spl] Jan 22 07:04:43 db04 kernel: [5056080.684232] [] ? prepare_to_wait_event+0xf0/0xf0 Jan 22 07:04:43 db04 kernel: [5056080.684264] [] ? zio_wait+0x10b/0x1e0 [zfs] Jan 22 07:04:43 db04 kernel: [5056080.684287] [] ? dsl_pool_sync+0xaa/0x460 [zfs] Jan 22 07:04:43 db04 kernel: [5056080.684313] [] ? spa_sync+0x366/0xb30 [zfs] Jan 22 07:04:43 db04 kernel: [5056080.684339] [] ? txg_sync_thread+0x3d1/0x680 [zfs] Jan 22 07:04:43 db04 kernel: [5056080.684363] [] ? txg_quiesce_thread+0x3e0/0x3e0 [zfs] Jan 22 07:04:43 db04 kernel: [5056080.684374] [] ? thread_generic_wrapper+0x6b/0x80 [spl] Jan 22 07:04:43 db04 kernel: [5056080.684388] [] ? __thread_exit+0x20/0x20 [spl] Jan 22 07:04:43 db04 kernel: [5056080.684398] [] ? kthread+0xbd/0xe0 Jan 22 07:04:43 db04 kernel: [5056080.684406] [] ? kthread_create_on_node+0x180/0x180 Jan 22 07:04:43 db04 kernel: [5056080.684417] [] ? ret_from_fork+0x58/0x90 Jan 22 07:04:43 db04 kernel: [5056080.684428] [] ? kthread_create_on_node+0x180/0x180

kernelOfTruth commented 8 years ago

@andreas-p please update to 0.6.5.4* if available

alternatively: you can build your own latest zfsonlinux packages:

http://zfsonlinux.org/generic-deb.html

andreas-p commented 8 years ago

Unfortunately, debian8 is DKMS still on 6.5.2, no update in the last three months. Any clue when this gets resolved?

kernelOfTruth commented 8 years ago

@andreas-p sorry, no idea,

but there's always the option of building the packages on your own - which is some effort but you'll know that you can trust those instead of having to rely on third-party repositories, etc.

andreas-p commented 8 years ago

Got the very same problem with 0.6.5.4 on a different machine with Debian8, zfs built from source. The stack trace shows exactly the same positions as the trace from Jan 18. Virtual machine: 6GB RAM, 1.5GB swap, 4 AMD CPU cores.

12:54 Starting an rsync from a 2.5TB xfs to 4TB zfs partition, memory rising from 2GB to 5GB within 5 minutes. 13:09 CPU load/1m steps up from 1 to 10, CPU utilization around 50%, mostly system 13:11 txg_sync hung_timeout first appearance, CPU load/1m steps to 16, committed memory drops to 0, CPU utilization 85% system. 13:15 CPU utilization 95% system, load 16, need reboot.

hostingnuggets commented 8 years ago

same problem here with Debian 8, ZoL 6.5.2 from the official package repository.

andreas-p commented 8 years ago

Exactly the same block with 0.6.5.5: Mar 24 11:13:06 db04 kernel: [4746960.372343] [] ? io_schedule+0x99/0x120 Mar 24 11:13:06 db04 kernel: [4746960.372358] [] ? cv_wait_common+0x92/0x110 [spl] Mar 24 11:13:06 db04 kernel: [4746960.372370] [] ? prepare_to_wait_event+0xf0/0xf0 Mar 24 11:13:06 db04 kernel: [4746960.372400] [] ? zio_wait+0x10b/0x1e0 [zfs] Mar 24 11:13:06 db04 kernel: [4746960.372423] [] ? dsl_pool_sync+0xaa/0x460 [zfs] Mar 24 11:13:06 db04 kernel: [4746960.372447] [] ? spa_sync+0x366/0xb30 [zfs] Mar 24 11:13:06 db04 kernel: [4746960.372472] [] ? txg_sync_thread+0x3d1/0x680 [zfs] Mar 24 11:13:06 db04 kernel: [4746960.372496] [] ? txg_quiesce_thread+0x3e0/0x3e0 [zfs] Mar 24 11:13:06 db04 kernel: [4746960.372507] [] ? thread_generic_wrapper+0x6b/0x80 [spl] Mar 24 11:13:06 db04 kernel: [4746960.372518] [] ? __thread_exit+0x20/0x20 [spl] Mar 24 11:13:06 db04 kernel: [4746960.372529] [] ? kthread+0xbd/0xe0 Mar 24 11:13:06 db04 kernel: [4746960.372547] [] ? kthread_create_on_node+0x180/0x180 Mar 24 11:13:06 db04 kernel: [4746960.372558] [] ? ret_from_fork+0x58/0x90 Mar 24 11:13:06 db04 kernel: [4746960.372567] [] ? kthread_create_on_node+0x180/0x180

I got two consecutive "txg_sync blocked for more than 120 seconds", then it went back to normal.

narun4sk commented 8 years ago

I'm suffering from the similar problem. rsync reads/writes are extremely slow ~3M. After struggling like this ~24h (it's a multi-million file dataset) machine gets bricked.

Please advice.


# uname -a
Linux ip-172-30-0-118 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt25-2 (2016-04-08) x86_64 GNU/Linux
# lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 8.5 (jessie)
Release:        8.5
Codename:       jessie
# dpkg -l '*zfs*' | grep ii
ii  libzfs2linux   0.6.5.7-1    amd64        OpenZFS filesystem library for Linux
ii  zfs-dkms       0.6.5.7-1    all          OpenZFS filesystem kernel modules for Linux
ii  zfs-zed        0.6.5.7-1    amd64        OpenZFS Event Daemon
ii  zfsutils-linux 0.6.5.7-1    amd64        command-line tools to manage OpenZFS filesystems
# dpkg -l '*spl*' | grep ii
ii  spl            0.6.5.7-1    amd64        Solaris Porting Layer user-space utilities for Linux
ii  spl-dkms       0.6.5.7-1    all          Solaris Porting Layer kernel modules for Linux
# zpool status
  pool: zfs-backup
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        zfs-backup  ONLINE       0     0     0
          xvdf      ONLINE       0     0     0
        logs
          xvdg      ONLINE       0     0     0

errors: No known data errors
# zpool iostat -v
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
zfs-backup   564G   380G    538      0  1.48M  2.42K
  xvdf       564G   380G    538      0  1.48M  2.11K
logs            -      -      -      -      -      -
  xvdg          0  1.98G      0      0    471    312
----------  -----  -----  -----  -----  -----  -----
# zpool get all
NAME        PROPERTY                    VALUE                       SOURCE
zfs-backup  size                        944G                        -
zfs-backup  capacity                    59%                         -
zfs-backup  altroot                     -                           default
zfs-backup  health                      ONLINE                      -
zfs-backup  guid                        2876612074418704500         default
zfs-backup  version                     -                           default
zfs-backup  bootfs                      -                           default
zfs-backup  delegation                  on                          default
zfs-backup  autoreplace                 off                         default
zfs-backup  cachefile                   -                           default
zfs-backup  failmode                    wait                        default
zfs-backup  listsnapshots               off                         default
zfs-backup  autoexpand                  off                         default
zfs-backup  dedupditto                  0                           default
zfs-backup  dedupratio                  1.00x                       -
zfs-backup  free                        380G                        -
zfs-backup  allocated                   564G                        -
zfs-backup  readonly                    off                         -
zfs-backup  ashift                      0                           default
zfs-backup  comment                     -                           default
zfs-backup  expandsize                  -                           -
zfs-backup  freeing                     0                           default
zfs-backup  fragmentation               44%                         -
zfs-backup  leaked                      0                           default
zfs-backup  feature@async_destroy       enabled                     local
zfs-backup  feature@empty_bpobj         active                      local
zfs-backup  feature@lz4_compress        active                      local
zfs-backup  feature@spacemap_histogram  active                      local
zfs-backup  feature@enabled_txg         active                      local
zfs-backup  feature@hole_birth          active                      local
zfs-backup  feature@extensible_dataset  enabled                     local
zfs-backup  feature@embedded_data       active                      local
zfs-backup  feature@bookmarks           enabled                     local
zfs-backup  feature@filesystem_limits   enabled                     local
zfs-backup  feature@large_blocks        enabled                     local
Jun 15 16:55:40 ip-172-30-0-118 kernel: [113640.618167] txg_sync        D ffff880079c259c8     0  1361      2 0x00000000
Jun 15 16:55:41 ip-172-30-0-118 kernel: [113640.621686]  ffff880079c25570 0000000000000046 0000000000012f00 ffff88007aaf3fd8
Jun 15 16:55:43 ip-172-30-0-118 kernel: [113640.625572]  0000000000012f00 ffff880079c25570 ffff88007fc137b0 ffff8800070cf050
Jun 15 16:55:43 ip-172-30-0-118 kernel: [113640.629691]  ffff8800070cf090 0000000000000001 ffff8800790a1000 0000000000000000
Jun 15 16:55:44 ip-172-30-0-118 kernel: [113640.633524] Call Trace:
Jun 15 16:55:45 ip-172-30-0-118 kernel: [113640.634740]  [<ffffffff815114a9>] ? io_schedule+0x99/0x120
Jun 15 16:55:47 ip-172-30-0-118 kernel: [113640.637492]  [<ffffffffa0152572>] ? cv_wait_common+0x92/0x110 [spl]
Jun 15 16:55:48 ip-172-30-0-118 kernel: [113640.640409]  [<ffffffff810a7e60>] ? prepare_to_wait_event+0xf0/0xf0
Jun 15 16:55:49 ip-172-30-0-118 kernel: [113640.643375]  [<ffffffffa029d12b>] ? zio_wait+0x10b/0x1e0 [zfs]
Jun 15 16:55:49 ip-172-30-0-118 kernel: [113640.646118]  [<ffffffffa022a77a>] ? dsl_pool_sync+0xaa/0x460 [zfs]
Jun 15 16:55:50 ip-172-30-0-118 kernel: [113640.649130]  [<ffffffffa0244766>] ? spa_sync+0x366/0xb30 [zfs]
Jun 15 16:55:51 ip-172-30-0-118 kernel: [113640.651910]  [<ffffffffa0256231>] ? txg_sync_thread+0x3d1/0x680 [zfs]
Jun 15 16:55:53 ip-172-30-0-118 kernel: [113640.655111]  [<ffffffffa0255e60>] ? txg_quiesce_thread+0x3e0/0x3e0 [zfs]
Jun 15 16:55:53 ip-172-30-0-118 kernel: [113640.658303]  [<ffffffffa014dcab>] ? thread_generic_wrapper+0x6b/0x80 [spl]
Jun 15 16:55:54 ip-172-30-0-118 kernel: [113640.661609]  [<ffffffffa014dc40>] ? __thread_exit+0x20/0x20 [spl]
Jun 15 16:55:55 ip-172-30-0-118 kernel: [113640.664528]  [<ffffffff8108809d>] ? kthread+0xbd/0xe0
Jun 15 16:55:56 ip-172-30-0-118 kernel: [113640.667024]  [<ffffffff81087fe0>] ? kthread_create_on_node+0x180/0x180
Jun 15 16:55:58 ip-172-30-0-118 kernel: [113640.670190]  [<ffffffff81514958>] ? ret_from_fork+0x58/0x90
Jun 15 16:55:58 ip-172-30-0-118 kernel: [113640.672778]  [<ffffffff81087fe0>] ? kthread_create_on_node+0x180/0x180
Jun 15 16:57:39 ip-172-30-0-118 kernel: [113760.681975] txg_sync        D ffff880079c259c8     0  1361      2 0x00000000
Jun 15 16:57:40 ip-172-30-0-118 kernel: [113760.685521]  ffff880079c25570 0000000000000046 0000000000012f00 ffff88007aaf3fd8
Jun 15 16:57:41 ip-172-30-0-118 kernel: [113760.689295]  0000000000012f00 ffff880079c25570 ffff88007fc137b0 ffff8800070cf050
Jun 15 16:57:42 ip-172-30-0-118 kernel: [113760.693211]  ffff8800070cf090 0000000000000001 ffff8800790a1000 0000000000000000
Jun 15 16:57:43 ip-172-30-0-118 kernel: [113760.697110] Call Trace:
Jun 15 16:57:44 ip-172-30-0-118 kernel: [113760.698425]  [<ffffffff815114a9>] ? io_schedule+0x99/0x120
Jun 15 16:57:45 ip-172-30-0-118 kernel: [113760.701106]  [<ffffffffa0152572>] ? cv_wait_common+0x92/0x110 [spl]
Jun 15 16:57:47 ip-172-30-0-118 kernel: [113760.704131]  [<ffffffff810a7e60>] ? prepare_to_wait_event+0xf0/0xf0
Jun 15 16:57:48 ip-172-30-0-118 kernel: [113760.707157]  [<ffffffffa029d12b>] ? zio_wait+0x10b/0x1e0 [zfs]
Jun 15 16:57:49 ip-172-30-0-118 kernel: [113760.709944]  [<ffffffffa022a77a>] ? dsl_pool_sync+0xaa/0x460 [zfs]
Jun 15 16:57:51 ip-172-30-0-118 kernel: [113760.713084]  [<ffffffffa0244766>] ? spa_sync+0x366/0xb30 [zfs]
Jun 15 16:57:52 ip-172-30-0-118 kernel: [113760.715920]  [<ffffffffa0256231>] ? txg_sync_thread+0x3d1/0x680 [zfs]
Jun 15 16:57:54 ip-172-30-0-118 kernel: [113760.719014]  [<ffffffffa0255e60>] ? txg_quiesce_thread+0x3e0/0x3e0 [zfs]
Jun 15 16:57:55 ip-172-30-0-118 kernel: [113760.722283]  [<ffffffffa014dcab>] ? thread_generic_wrapper+0x6b/0x80 [spl]
Jun 15 16:57:57 ip-172-30-0-118 kernel: [113760.725571]  [<ffffffffa014dc40>] ? __thread_exit+0x20/0x20 [spl]
Jun 15 16:57:57 ip-172-30-0-118 kernel: [113760.728485]  [<ffffffff8108809d>] ? kthread+0xbd/0xe0
Jun 15 16:57:58 ip-172-30-0-118 kernel: [113760.730974]  [<ffffffff81087fe0>] ? kthread_create_on_node+0x180/0x180
Jun 15 16:58:00 ip-172-30-0-118 kernel: [113760.734102]  [<ffffffff81514958>] ? ret_from_fork+0x58/0x90
Jun 15 16:58:02 ip-172-30-0-118 kernel: [113760.736819]  [<ffffffff81087fe0>] ? kthread_create_on_node+0x180/0x180
Jun 15 16:59:38 ip-172-30-0-118 kernel: [113880.747394] txg_sync        D ffff880079c259c8     0  1361      2 0x00000000
Jun 15 16:59:39 ip-172-30-0-118 kernel: [113880.750796]  ffff880079c25570 0000000000000046 0000000000012f00 ffff88007aaf3fd8
Jun 15 16:59:39 ip-172-30-0-118 kernel: [113880.754658]  0000000000012f00 ffff880079c25570 ffff88007fc137b0 ffff8800070cf050
Jun 15 16:59:41 ip-172-30-0-118 kernel: [113880.758412]  ffff8800070cf090 0000000000000001 ffff8800790a1000 0000000000000000
Jun 15 16:59:42 ip-172-30-0-118 kernel: [113880.762234] Call Trace:
Jun 15 16:59:43 ip-172-30-0-118 kernel: [113880.763454]  [<ffffffff815114a9>] ? io_schedule+0x99/0x120
Jun 15 16:59:44 ip-172-30-0-118 kernel: [113880.766117]  [<ffffffffa0152572>] ? cv_wait_common+0x92/0x110 [spl]
Jun 15 16:59:45 ip-172-30-0-118 kernel: [113880.769024]  [<ffffffff810a7e60>] ? prepare_to_wait_event+0xf0/0xf0
Jun 15 16:59:45 ip-172-30-0-118 kernel: [113880.772106]  [<ffffffffa029d12b>] ? zio_wait+0x10b/0x1e0 [zfs]
Jun 15 16:59:46 ip-172-30-0-118 kernel: [113880.775310]  [<ffffffffa022a77a>] ? dsl_pool_sync+0xaa/0x460 [zfs]
Jun 15 16:59:48 ip-172-30-0-118 kernel: [113880.778331]  [<ffffffffa0244766>] ? spa_sync+0x366/0xb30 [zfs]
Jun 15 16:59:50 ip-172-30-0-118 kernel: [113880.781198]  [<ffffffffa0256231>] ? txg_sync_thread+0x3d1/0x680 [zfs]
Jun 15 16:59:51 ip-172-30-0-118 kernel: [113880.784324]  [<ffffffffa0255e60>] ? txg_quiesce_thread+0x3e0/0x3e0 [zfs]
Jun 15 16:59:52 ip-172-30-0-118 kernel: [113880.787500]  [<ffffffffa014dcab>] ? thread_generic_wrapper+0x6b/0x80 [spl]
Jun 15 16:59:54 ip-172-30-0-118 kernel: [113880.790757]  [<ffffffffa014dc40>] ? __thread_exit+0x20/0x20 [spl]
Jun 15 16:59:55 ip-172-30-0-118 kernel: [113880.793707]  [<ffffffff8108809d>] ? kthread+0xbd/0xe0
Jun 15 16:59:56 ip-172-30-0-118 kernel: [113880.796081]  [<ffffffff81087fe0>] ? kthread_create_on_node+0x180/0x180
Jun 15 16:59:57 ip-172-30-0-118 kernel: [113880.799104]  [<ffffffff81514958>] ? ret_from_fork+0x58/0x90
Jun 15 16:59:58 ip-172-30-0-118 kernel: [113880.801724]  [<ffffffff81087fe0>] ? kthread_create_on_node+0x180/0x180
Jun 15 17:01:41 ip-172-30-0-118 kernel: [114000.817906] txg_sync        D ffff880079c259c8     0  1361      2 0x00000000
Jun 15 17:01:42 ip-172-30-0-118 kernel: [114000.822256]  ffff880079c25570 0000000000000046 0000000000012f00 ffff88007aaf3fd8
Jun 15 17:01:43 ip-172-30-0-118 kernel: [114000.827952]  0000000000012f00 ffff880079c25570 ffff88007fc137b0 ffff8800070cf050
Jun 15 17:01:45 ip-172-30-0-118 kernel: [114000.833092]  ffff8800070cf090 0000000000000001 ffff8800790a1000 0000000000000000
Jun 15 17:01:45 ip-172-30-0-118 kernel: [114000.837533] Call Trace:
Jun 15 17:01:47 ip-172-30-0-118 kernel: [114000.838994]  [<ffffffff815114a9>] ? io_schedule+0x99/0x120
Jun 15 17:01:48 ip-172-30-0-118 kernel: [114000.842069]  [<ffffffffa0152572>] ? cv_wait_common+0x92/0x110 [spl]
Jun 15 17:01:49 ip-172-30-0-118 kernel: [114000.845833]  [<ffffffff810a7e60>] ? prepare_to_wait_event+0xf0/0xf0
Jun 15 17:01:50 ip-172-30-0-118 kernel: [114000.849362]  [<ffffffffa029d12b>] ? zio_wait+0x10b/0x1e0 [zfs]
Jun 15 17:01:51 ip-172-30-0-118 kernel: [114000.852630]  [<ffffffffa022a77a>] ? dsl_pool_sync+0xaa/0x460 [zfs]
Jun 15 17:01:52 ip-172-30-0-118 kernel: [114000.856037]  [<ffffffffa0244766>] ? spa_sync+0x366/0xb30 [zfs]
Jun 15 17:01:53 ip-172-30-0-118 kernel: [114000.859185]  [<ffffffffa0256231>] ? txg_sync_thread+0x3d1/0x680 [zfs]
Jun 15 17:01:54 ip-172-30-0-118 kernel: [114000.862633]  [<ffffffffa0255e60>] ? txg_quiesce_thread+0x3e0/0x3e0 [zfs]
Jun 15 17:01:55 ip-172-30-0-118 kernel: [114000.866301]  [<ffffffffa014dcab>] ? thread_generic_wrapper+0x6b/0x80 [spl]
Jun 15 17:01:56 ip-172-30-0-118 kernel: [114000.870041]  [<ffffffffa014dc40>] ? __thread_exit+0x20/0x20 [spl]
Jun 15 17:01:57 ip-172-30-0-118 kernel: [114000.873530]  [<ffffffff8108809d>] ? kthread+0xbd/0xe0
Jun 15 17:01:57 ip-172-30-0-118 kernel: [114000.876425]  [<ffffffff81087fe0>] ? kthread_create_on_node+0x180/0x180
Jun 15 17:01:59 ip-172-30-0-118 kernel: [114000.879915]  [<ffffffff81514958>] ? ret_from_fork+0x58/0x90
Jun 15 17:02:00 ip-172-30-0-118 kernel: [114000.882644]  [<ffffffff81087fe0>] ? kthread_create_on_node+0x180/0x180
kernelOfTruth commented 8 years ago

@narunask could you please post output of

cat /proc/spl/kstat/zfs/arcstats

or meanwhile access to the box is denied ("bricked") ?

Also please post some hardware and specific configuration data (RAM, processor, mainboard, harddrive type, lvm/cryptsetup/etc. etc.)

thanks

narun4sk commented 8 years ago

Server is on the EC2, currently 1 core, 2GB RAM (t2.small), HDD - the magnetic one, I believe it is documented here.

HDD that I'm copying from is LVM based, consisting of 3 PVs. HDD are not encrypted.

Also FYI, currently rsync is calculating deltas and should start moving data again very soon, if that adds something to the stats.

# cat /proc/spl/kstat/zfs/arcstats
6 1 0x01 91 4368 35041278654 31678952268315
name                            type data
hits                            4    42347862
misses                          4    10285010
demand_data_hits                4    0
demand_data_misses              4    68
demand_metadata_hits            4    34887103
demand_metadata_misses          4    6345542
prefetch_data_hits              4    0
prefetch_data_misses            4    0
prefetch_metadata_hits          4    7460759
prefetch_metadata_misses        4    3939400
mru_hits                        4    20830447
mru_ghost_hits                  4    2991105
mfu_hits                        4    14056656
mfu_ghost_hits                  4    1782606
deleted                         4    2167955
mutex_miss                      4    90369
evict_skip                      4    1781974161
evict_not_enough                4    15390660
evict_l2_cached                 4    0
evict_l2_eligible               4    35359622656
evict_l2_ineligible             4    55754657792
evict_l2_skip                   4    0
hash_elements                   4    5452
hash_elements_max               4    48255
hash_collisions                 4    57676
hash_chains                     4    69
hash_chain_max                  4    4
p                               4    7007439
c                               4    34933248
c_min                           4    33554432
c_max                           4    1053282304
size                            4    105425184
hdr_size                        4    1969640
data_size                       4    0
metadata_size                   4    30842880
other_size                      4    72612664
anon_size                       4    753664
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_size                        4    27006976
mru_evictable_data              4    0
mru_evictable_metadata          4    3653632
mru_ghost_size                  4    6850560
mru_ghost_evictable_data        4    0
mru_ghost_evictable_metadata    4    6850560
mfu_size                        4    3082240
mfu_evictable_data              4    0
mfu_evictable_metadata          4    32768
mfu_ghost_size                  4    25972224
mfu_ghost_evictable_data        4    0
mfu_ghost_evictable_metadata    4    25972224
l2_hits                         4    0
l2_misses                       4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_lock_retry            4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
l2_free_on_write                4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_asize                        4    0
l2_hdr_size                     4    0
l2_compress_successes           4    0
l2_compress_zeros               4    0
l2_compress_failures            4    0
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    0
memory_direct_count             4    947
memory_indirect_count           4    259028
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    3960
arc_meta_used                   4    105425184
arc_meta_limit                  4    789961728
arc_meta_max                    4    816526072
arc_meta_min                    4    16777216
arc_need_free                   4    0
arc_sys_free                    4    32911360
# cat /proc/cpuinfo 
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 63
model name      : Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz
stepping        : 2
microcode       : 0x25
cpu MHz         : 2394.530
cache size      : 30720 KB
physical id     : 0
siblings        : 1
core id         : 0
cpu cores       : 1
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm xsaveopt fsgsbase bmi1 avx2 smep bmi2 erms invpcid
bogomips        : 4789.06
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:
# dmidecode --type memory
# dmidecode 3.0
Scanning /dev/mem for entry point.
SMBIOS 2.4 present.

Handle 0x1000, DMI type 16, 15 bytes
Physical Memory Array
        Location: Other
        Use: System Memory
        Error Correction Type: Multi-bit ECC
        Maximum Capacity: 2 GB
        Error Information Handle: Not Provided
        Number Of Devices: 1

Handle 0x1100, DMI type 17, 21 bytes
Memory Device
        Array Handle: 0x1000
        Error Information Handle: 0x0000
        Total Width: 64 bits
        Data Width: 64 bits
        Size: 2048 MB
        Form Factor: DIMM
        Set: None
        Locator: DIMM 0
        Bank Locator: Not Specified
        Type: RAM
        Type Detail: None

Thanks

andreas-p commented 8 years ago

I haven't seen the problem for quite a while now (seemed to have gone since 0.6.5.6), but this morning I had the very same hung task with 0.6.5.7 (single occurrance) come up again. Sigh...

ioquatix commented 8 years ago

I can confirm that I have not seen this issue happen for ages on the same hardware. I have not had it reoccur, probably within the last 3-6 months.

ioquatix commented 7 years ago

Still happening with heavy RSync backups.

[88234.807775] INFO: task rsync:18699 blocked for more than 120 seconds.
[88234.807929]       Tainted: P           O    4.8.13-1-ARCH #1
[88234.808026] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[88234.808159] rsync           D ffff880124bcb9d0     0 18699  18698 0x00000000
[88234.808171]  ffff880124bcb9d0 00ffffff81003016 ffff88021cd00d00 ffff880127fa6800
[88234.808179]  ffff88022fc17ff0 ffff880124bcc000 ffff88021d8eaa10 ffff88021d8eab58
[88234.808185]  ffff88021d8eaa38 0000000000000000 ffff880124bcb9e8 ffffffff815f40ec
[88234.808190] Call Trace:
[88234.808203]  [<ffffffff815f40ec>] schedule+0x3c/0x90
[88234.808217]  [<ffffffffa0171e1f>] cv_wait_common+0x10f/0x130 [spl]
[88234.808225]  [<ffffffff810c0450>] ? wake_atomic_t_function+0x60/0x60
[88234.808235]  [<ffffffffa0171e55>] __cv_wait+0x15/0x20 [spl]
[88234.808294]  [<ffffffffa03095f8>] txg_wait_open+0xa8/0xe0 [zfs]
[88234.808338]  [<ffffffffa02c73db>] dmu_tx_wait+0x32b/0x340 [zfs]
[88234.808380]  [<ffffffffa02c747b>] dmu_tx_assign+0x8b/0x490 [zfs]
[88234.808422]  [<ffffffffa0342d09>] zfs_write+0x3f9/0xc80 [zfs]
[88234.808461]  [<ffffffffa034f567>] ? zio_destroy+0xb7/0xc0 [zfs]
[88234.808500]  [<ffffffffa0352c78>] ? zio_wait+0x138/0x1d0 [zfs]
[88234.808507]  [<ffffffffa016ab9a>] ? spl_kmem_free+0x2a/0x40 [spl]
[88234.808551]  [<ffffffffa0337510>] ? zfs_range_unlock+0x1a0/0x2c0 [zfs]
[88234.808590]  [<ffffffffa035826c>] zpl_write_common_iovec+0x8c/0xe0 [zfs]
[88234.808627]  [<ffffffffa0358497>] zpl_write+0x87/0xc0 [zfs]
[88234.808635]  [<ffffffff81208797>] __vfs_write+0x37/0x140
[88234.808642]  [<ffffffff810c7be7>] ? percpu_down_read+0x17/0x50
[88234.808648]  [<ffffffff81209566>] vfs_write+0xb6/0x1a0
[88234.808652]  [<ffffffff8120a9e5>] SyS_write+0x55/0xc0
[88234.808658]  [<ffffffff815f8032>] entry_SYSCALL_64_fastpath+0x1a/0xa4
[88234.808664] INFO: task imap:19376 blocked for more than 120 seconds.
[88234.808803]       Tainted: P           O    4.8.13-1-ARCH #1
[88234.808898] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[88234.809029] imap            D ffff8801b875fad8     0 19376   1488 0x00000000
[88234.809036]  ffff8801b875fad8 00ff88021d8eab00 ffffffff81a0d500 ffff8801a9f95b00
[88234.809042]  ffff8801b875fab8 ffff8801b8760000 ffff88021d8eaa10 ffff88021d8eab58
[88234.809048]  ffff88021d8eaa38 0000000000000000 ffff8801b875faf0 ffffffff815f40ec
[88234.809053] Call Trace:
[88234.809059]  [<ffffffff815f40ec>] schedule+0x3c/0x90
[88234.809067]  [<ffffffffa0171e1f>] cv_wait_common+0x10f/0x130 [spl]
[88234.809073]  [<ffffffff810c0450>] ? wake_atomic_t_function+0x60/0x60
[88234.809081]  [<ffffffffa0171e55>] __cv_wait+0x15/0x20 [spl]
[88234.809131]  [<ffffffffa03095f8>] txg_wait_open+0xa8/0xe0 [zfs]
[88234.809175]  [<ffffffffa02c73db>] dmu_tx_wait+0x32b/0x340 [zfs]
[88234.809217]  [<ffffffffa02c747b>] dmu_tx_assign+0x8b/0x490 [zfs]
[88234.809259]  [<ffffffffa0342299>] zfs_dirty_inode+0xe9/0x300 [zfs]
[88234.809265]  [<ffffffff8122a3b4>] ? mntput+0x24/0x40
[88234.809272]  [<ffffffff81189adb>] ? release_pages+0x2cb/0x380
[88234.809278]  [<ffffffff811c763e>] ? free_pages_and_swap_cache+0x8e/0xa0
[88234.809315]  [<ffffffffa035a2cc>] zpl_dirty_inode+0x2c/0x40 [zfs]
[88234.809322]  [<ffffffff81237785>] __mark_inode_dirty+0x45/0x400
[88234.809362]  [<ffffffffa0346fe8>] zfs_mark_inode_dirty+0x48/0x50 [zfs]
[88234.809399]  [<ffffffffa0358826>] zpl_release+0x46/0x90 [zfs]
[88234.809404]  [<ffffffff8120b42f>] __fput+0x9f/0x1e0
[88234.809408]  [<ffffffff8120b5ae>] ____fput+0xe/0x10
[88234.809414]  [<ffffffff8109a0d0>] task_work_run+0x80/0xa0
[88234.809420]  [<ffffffff8100366a>] exit_to_usermode_loop+0xba/0xc0
[88234.809425]  [<ffffffff81003b2e>] syscall_return_slowpath+0x4e/0x60
[88234.809430]  [<ffffffff815f80ba>] entry_SYSCALL_64_fastpath+0xa2/0xa4   

Most of the time performance is fine.

ioquatix commented 7 years ago

So, I replaced the drive which might have been causing issues, with a new Samsung 840 PRO SSD, partitioned 50GB OS, 4GB Swap and the rest available as L2ARC.

I've set the arc max size to 4GB on a system with only 8GB memory.

Tonight, the same issue occurred, rsync and then everything ground to a halt. So, it seems unlikely to be a disk issue. Nominal read speeds are 100MB/s+ so things are humming along quite nicely.

ioquatix commented 7 years ago

I can reproduce this issue fairly easily so just let me know what information you'd like me to collect and I'll try to do it.

ioquatix commented 7 years ago
[78889.098553] INFO: task txg_sync:1372 blocked for more than 120 seconds.
[78889.098667]       Tainted: P           O    4.8.13-1-ARCH #1
[78889.098719] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[78889.098790] txg_sync        D ffff88021c893ab8     0  1372      2 0x00000000
[78889.098798]  ffff88021c893ab8 00ffffff810a5d2f ffff8802212b5b00 ffff88021d3ece00
[78889.098802]  0000000000000046 ffff88021c894000 ffff88022fc17f80 7fffffffffffffff
[78889.098806]  ffff880105123b60 0000000000000001 ffff88021c893ad0 ffffffff815f40ec
[78889.098809] Call Trace:
[78889.098818]  [<ffffffff815f40ec>] schedule+0x3c/0x90
[78889.098821]  [<ffffffff815f6fa3>] schedule_timeout+0x243/0x3d0
[78889.098825]  [<ffffffff810a6c12>] ? default_wake_function+0x12/0x20
[78889.098828]  [<ffffffff810bfd9d>] ? __wake_up_common+0x4d/0x80
[78889.098832]  [<ffffffff810f2c51>] ? ktime_get+0x41/0xb0
[78889.098835]  [<ffffffff815f38c4>] io_schedule_timeout+0xa4/0x110
[78889.098843]  [<ffffffffa0203dc1>] cv_wait_common+0xb1/0x130 [spl]
[78889.098846]  [<ffffffff810c0450>] ? wake_atomic_t_function+0x60/0x60
[78889.098851]  [<ffffffffa0203e98>] __cv_wait_io+0x18/0x20 [spl]
[78889.098886]  [<ffffffffa0343c3d>] zio_wait+0xfd/0x1d0 [zfs]
[78889.098912]  [<ffffffffa02cf258>] dsl_pool_sync+0xb8/0x480 [zfs]
[78889.098939]  [<ffffffffa02ea38f>] spa_sync+0x37f/0xb30 [zfs]
[78889.098942]  [<ffffffff810a6c12>] ? default_wake_function+0x12/0x20
[78889.098969]  [<ffffffffa02fac1a>] txg_sync_thread+0x3ba/0x620 [zfs]
[78889.098997]  [<ffffffffa02fa860>] ? txg_delay+0x160/0x160 [zfs]
[78889.099002]  [<ffffffffa01fef61>] thread_generic_wrapper+0x71/0x80 [spl]
[78889.099006]  [<ffffffffa01feef0>] ? __thread_exit+0x20/0x20 [spl]
[78889.099010]  [<ffffffff8109be38>] kthread+0xd8/0xf0
[78889.099013]  [<ffffffff8102c782>] ? __switch_to+0x2d2/0x630
[78889.099016]  [<ffffffff815f823f>] ret_from_fork+0x1f/0x40
[78889.099019]  [<ffffffff8109bd60>] ? kthread_worker_fn+0x170/0x170
[79380.619090] INFO: task txg_sync:1372 blocked for more than 120 seconds.
[79380.619245]       Tainted: P           O    4.8.13-1-ARCH #1
[79380.619342] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[79380.619475] txg_sync        D ffff88021c893ab8     0  1372      2 0x00000000
[79380.619487]  ffff88021c893ab8 00ffffff810a5d2f ffffffff81a0d500 ffff88021d3ece00
[79380.619494]  0000000000000046 ffff88021c894000 ffff88022fc17f80 7fffffffffffffff
[79380.619500]  ffff8801e9c0c850 0000000000000001 ffff88021c893ad0 ffffffff815f40ec
[79380.619506] Call Trace:
[79380.619519]  [<ffffffff815f40ec>] schedule+0x3c/0x90
[79380.619525]  [<ffffffff815f6fa3>] schedule_timeout+0x243/0x3d0
[79380.619531]  [<ffffffff810a6c12>] ? default_wake_function+0x12/0x20
[79380.619537]  [<ffffffff810bfd9d>] ? __wake_up_common+0x4d/0x80
[79380.619543]  [<ffffffff810f2c51>] ? ktime_get+0x41/0xb0
[79380.619547]  [<ffffffff815f38c4>] io_schedule_timeout+0xa4/0x110
[79380.619560]  [<ffffffffa0203dc1>] cv_wait_common+0xb1/0x130 [spl]
[79380.619565]  [<ffffffff810c0450>] ? wake_atomic_t_function+0x60/0x60
[79380.619574]  [<ffffffffa0203e98>] __cv_wait_io+0x18/0x20 [spl]
[79380.619623]  [<ffffffffa0343c3d>] zio_wait+0xfd/0x1d0 [zfs]
[79380.619672]  [<ffffffffa02cf258>] dsl_pool_sync+0xb8/0x480 [zfs]
[79380.619724]  [<ffffffffa02ea38f>] spa_sync+0x37f/0xb30 [zfs]
[79380.619729]  [<ffffffff810a6c12>] ? default_wake_function+0x12/0x20
[79380.619780]  [<ffffffffa02fac1a>] txg_sync_thread+0x3ba/0x620 [zfs]
[79380.619830]  [<ffffffffa02fa860>] ? txg_delay+0x160/0x160 [zfs]
[79380.619839]  [<ffffffffa01fef61>] thread_generic_wrapper+0x71/0x80 [spl]
[79380.619847]  [<ffffffffa01feef0>] ? __thread_exit+0x20/0x20 [spl]
[79380.619854]  [<ffffffff8109be38>] kthread+0xd8/0xf0
[79380.619860]  [<ffffffff8102c782>] ? __switch_to+0x2d2/0x630
[79380.619866]  [<ffffffff815f823f>] ret_from_fork+0x1f/0x40
[79380.619872]  [<ffffffff8109bd60>] ? kthread_worker_fn+0x170/0x170
[81223.826929] INFO: task txg_sync:1372 blocked for more than 120 seconds.
[81223.827091]       Tainted: P           O    4.8.13-1-ARCH #1
[81223.827226] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[81223.827381] txg_sync        D ffff88021c893ab8     0  1372      2 0x00000000
[81223.827393]  ffff88021c893ab8 00ffffff810a5d2f ffff8802157c3400 ffff88021d3ece00
[81223.827400]  0000000000000046 ffff88021c894000 ffff88022fc17f80 7fffffffffffffff
[81223.827407]  ffff88015f37d0f0 0000000000000001 ffff88021c893ad0 ffffffff815f40ec
[81223.827413] Call Trace:
[81223.827426]  [<ffffffff815f40ec>] schedule+0x3c/0x90
[81223.827432]  [<ffffffff815f6fa3>] schedule_timeout+0x243/0x3d0
[81223.827439]  [<ffffffff810a6c12>] ? default_wake_function+0x12/0x20
[81223.827444]  [<ffffffff810bfd9d>] ? __wake_up_common+0x4d/0x80
[81223.827451]  [<ffffffff810f2c51>] ? ktime_get+0x41/0xb0
[81223.827455]  [<ffffffff815f38c4>] io_schedule_timeout+0xa4/0x110
[81223.827468]  [<ffffffffa0203dc1>] cv_wait_common+0xb1/0x130 [spl]
[81223.827473]  [<ffffffff810c0450>] ? wake_atomic_t_function+0x60/0x60
[81223.827482]  [<ffffffffa0203e98>] __cv_wait_io+0x18/0x20 [spl]
[81223.827531]  [<ffffffffa0343c3d>] zio_wait+0xfd/0x1d0 [zfs]
[81223.827580]  [<ffffffffa02cf258>] dsl_pool_sync+0xb8/0x480 [zfs]
[81223.827632]  [<ffffffffa02ea38f>] spa_sync+0x37f/0xb30 [zfs]
[81223.827636]  [<ffffffff810a6c12>] ? default_wake_function+0x12/0x20
[81223.827687]  [<ffffffffa02fac1a>] txg_sync_thread+0x3ba/0x620 [zfs]
[81223.827738]  [<ffffffffa02fa860>] ? txg_delay+0x160/0x160 [zfs]
[81223.827747]  [<ffffffffa01fef61>] thread_generic_wrapper+0x71/0x80 [spl]
[81223.827755]  [<ffffffffa01feef0>] ? __thread_exit+0x20/0x20 [spl]
[81223.827762]  [<ffffffff8109be38>] kthread+0xd8/0xf0
[81223.827767]  [<ffffffff8102c782>] ? __switch_to+0x2d2/0x630
[81223.827773]  [<ffffffff815f823f>] ret_from_fork+0x1f/0x40
[81223.827779]  [<ffffffff8109bd60>] ? kthread_worker_fn+0x170/0x170
[94923.528386] CE: hpet increased min_delta_ns to 20115 nsec
[102727.918958] INFO: task rsync:5577 blocked for more than 120 seconds.
[102727.919092]       Tainted: P           O    4.8.13-1-ARCH #1
[102727.919170] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[102727.919277] rsync           D ffff8801dfe9ba10     0  5577   5576 0x00000000
[102727.919287]  ffff8801dfe9ba10 00000000ffffffff ffffffff81a0d500 ffff8802157c3400
[102727.919293]  ffffffffa0340567 ffff8801dfe9c000 ffff88021ed41148 ffff88021ed41180
[102727.919298]  ffff88021ed41170 0000000000000000 ffff8801dfe9ba28 ffffffff815f40ec
[102727.919303] Call Trace:
[102727.919352]  [<ffffffffa0340567>] ? zio_destroy+0xb7/0xc0 [zfs]
[102727.919358]  [<ffffffff815f40ec>] schedule+0x3c/0x90
[102727.919367]  [<ffffffffa0203e1f>] cv_wait_common+0x10f/0x130 [spl]
[102727.919373]  [<ffffffff810c0450>] ? wake_atomic_t_function+0x60/0x60
[102727.919380]  [<ffffffffa0203e55>] __cv_wait+0x15/0x20 [spl]
[102727.919414]  [<ffffffffa02b814b>] dmu_tx_wait+0x9b/0x340 [zfs]
[102727.919448]  [<ffffffffa02b847b>] dmu_tx_assign+0x8b/0x490 [zfs]
[102727.919482]  [<ffffffffa0333d09>] zfs_write+0x3f9/0xc80 [zfs]
[102727.919514]  [<ffffffffa0340567>] ? zio_destroy+0xb7/0xc0 [zfs]
[102727.919544]  [<ffffffffa0343c78>] ? zio_wait+0x138/0x1d0 [zfs]
[102727.919579]  [<ffffffffa0328510>] ? zfs_range_unlock+0x1a0/0x2c0 [zfs]
[102727.919610]  [<ffffffffa034926c>] zpl_write_common_iovec+0x8c/0xe0 [zfs]
[102727.919640]  [<ffffffffa0349497>] zpl_write+0x87/0xc0 [zfs]
[102727.919646]  [<ffffffff81208797>] __vfs_write+0x37/0x140
[102727.919652]  [<ffffffff810c7be7>] ? percpu_down_read+0x17/0x50
[102727.919656]  [<ffffffff81209566>] vfs_write+0xb6/0x1a0
[102727.919660]  [<ffffffff8120a9e5>] SyS_write+0x55/0xc0
[102727.919665]  [<ffffffff815f8032>] entry_SYSCALL_64_fastpath+0x1a/0xa4

Happened last night during backups.

ioquatix commented 7 years ago

Okay, so I've replaced the drive which had the high await time, and also increased the memory to 16GB, still having issues:

[12785.566973] CE: hpet increased min_delta_ns to 20115 nsec
[25560.317294] INFO: task txg_sync:1392 blocked for more than 120 seconds.
[25560.317451]       Tainted: P           O    4.9.6-1-ARCH #1
[25560.317542] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[25560.317668] txg_sync        D    0  1392      2 0x00000000
[25560.317678]  ffff88041b855c00 0000000000000000 ffff8804104b2700 ffff88042fc180c0
[25560.317686]  ffffffff81a0e500 ffffc9000828fad8 ffffffff81605cdf ffff880411a4c080
[25560.317692]  00ffffffa0341360 ffff88042fc180c0 0000000000000000 ffff8804104b2700
[25560.317698] Call Trace:
[25560.317713]  [<ffffffff81605cdf>] ? __schedule+0x22f/0x6e0
[25560.317719]  [<ffffffff816061cd>] schedule+0x3d/0x90
[25560.317726]  [<ffffffff81608fd3>] schedule_timeout+0x243/0x3d0
[25560.317781]  [<ffffffffa033fbb1>] ? zio_taskq_dispatch+0x91/0xa0 [zfs]
[25560.317821]  [<ffffffffa033fbd2>] ? zio_issue_async+0x12/0x20 [zfs]
[25560.317859]  [<ffffffffa0343569>] ? zio_nowait+0x79/0x110 [zfs]
[25560.317867]  [<ffffffff810f7b81>] ? ktime_get+0x41/0xb0
[25560.317873]  [<ffffffff81605a44>] io_schedule_timeout+0xa4/0x110
[25560.317884]  [<ffffffffa01eccd1>] cv_wait_common+0xb1/0x130 [spl]
[25560.317891]  [<ffffffff810c4200>] ? wake_atomic_t_function+0x60/0x60
[25560.317900]  [<ffffffffa01ecda8>] __cv_wait_io+0x18/0x20 [spl]
[25560.317938]  [<ffffffffa034337c>] zio_wait+0xac/0x130 [zfs]
[25560.317984]  [<ffffffffa02cf408>] dsl_pool_sync+0xb8/0x480 [zfs]
[25560.318035]  [<ffffffffa02e9e1f>] spa_sync+0x37f/0xb30 [zfs]
[25560.318041]  [<ffffffff810aa4a2>] ? default_wake_function+0x12/0x20
[25560.318091]  [<ffffffffa02fa6aa>] txg_sync_thread+0x3ba/0x620 [zfs]
[25560.318096]  [<ffffffff8102d752>] ? __switch_to+0x2d2/0x630
[25560.318145]  [<ffffffffa02fa2f0>] ? txg_delay+0x160/0x160 [zfs]
[25560.318154]  [<ffffffffa01e7f22>] thread_generic_wrapper+0x72/0x80 [spl]
[25560.318161]  [<ffffffffa01e7eb0>] ? __thread_exit+0x20/0x20 [spl]
[25560.318167]  [<ffffffff8109e8f9>] kthread+0xd9/0xf0
[25560.318171]  [<ffffffff8102d752>] ? __switch_to+0x2d2/0x630
[25560.318176]  [<ffffffff8109e820>] ? kthread_park+0x60/0x60
[25560.318180]  [<ffffffff8109e820>] ? kthread_park+0x60/0x60
[25560.318184]  [<ffffffff8160a995>] ret_from_fork+0x25/0x30
[25683.204571] INFO: task txg_sync:1392 blocked for more than 120 seconds.
[25683.204722]       Tainted: P           O    4.9.6-1-ARCH #1
[25683.204814] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[25683.204940] txg_sync        D    0  1392      2 0x00000000
[25683.204950]  ffff88041b855c00 0000000000000000 ffff8804104b2700 ffff88042fc180c0
[25683.204958]  ffffffff81a0e500 ffffc9000828fad8 ffffffff81605cdf ffff880411a4c080
[25683.204965]  00ffffffa0341360 ffff88042fc180c0 0000000000000000 ffff8804104b2700
[25683.204970] Call Trace:
[25683.204985]  [<ffffffff81605cdf>] ? __schedule+0x22f/0x6e0
[25683.204992]  [<ffffffff816061cd>] schedule+0x3d/0x90
[25683.204999]  [<ffffffff81608fd3>] schedule_timeout+0x243/0x3d0
[25683.205055]  [<ffffffffa033fbb1>] ? zio_taskq_dispatch+0x91/0xa0 [zfs]
[25683.205095]  [<ffffffffa033fbd2>] ? zio_issue_async+0x12/0x20 [zfs]
[25683.205134]  [<ffffffffa0343569>] ? zio_nowait+0x79/0x110 [zfs]
[25683.205142]  [<ffffffff810f7b81>] ? ktime_get+0x41/0xb0
[25683.205148]  [<ffffffff81605a44>] io_schedule_timeout+0xa4/0x110
[25683.205159]  [<ffffffffa01eccd1>] cv_wait_common+0xb1/0x130 [spl]
[25683.205166]  [<ffffffff810c4200>] ? wake_atomic_t_function+0x60/0x60
[25683.205175]  [<ffffffffa01ecda8>] __cv_wait_io+0x18/0x20 [spl]
[25683.205213]  [<ffffffffa034337c>] zio_wait+0xac/0x130 [zfs]
[25683.205259]  [<ffffffffa02cf408>] dsl_pool_sync+0xb8/0x480 [zfs]
[25683.205310]  [<ffffffffa02e9e1f>] spa_sync+0x37f/0xb30 [zfs]
[25683.205315]  [<ffffffff810aa4a2>] ? default_wake_function+0x12/0x20
[25683.205365]  [<ffffffffa02fa6aa>] txg_sync_thread+0x3ba/0x620 [zfs]
[25683.205371]  [<ffffffff8102d752>] ? __switch_to+0x2d2/0x630
[25683.205419]  [<ffffffffa02fa2f0>] ? txg_delay+0x160/0x160 [zfs]
[25683.205428]  [<ffffffffa01e7f22>] thread_generic_wrapper+0x72/0x80 [spl]
[25683.205436]  [<ffffffffa01e7eb0>] ? __thread_exit+0x20/0x20 [spl]
[25683.205441]  [<ffffffff8109e8f9>] kthread+0xd9/0xf0
[25683.205445]  [<ffffffff8102d752>] ? __switch_to+0x2d2/0x630
[25683.205450]  [<ffffffff8109e820>] ? kthread_park+0x60/0x60
[25683.205454]  [<ffffffff8109e820>] ? kthread_park+0x60/0x60
[25683.205458]  [<ffffffff8160a995>] ret_from_fork+0x25/0x30
[26051.866268] INFO: task txg_sync:1392 blocked for more than 120 seconds.
[26051.866424]       Tainted: P           O    4.9.6-1-ARCH #1
[26051.866516] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[26051.866642] txg_sync        D    0  1392      2 0x00000000
[26051.866652]  ffff8803e6f2c000 0000000000000000 ffff8804104b2700 ffff88042fc180c0
[26051.866660]  ffff88041b50b400 ffffc9000828fad8 ffffffff81605cdf ffff880411a4c080
[26051.866667]  00ffffffa0341360 ffff88042fc180c0 0000000000000000 ffff8804104b2700
[26051.866673] Call Trace:
[26051.866687]  [<ffffffff81605cdf>] ? __schedule+0x22f/0x6e0
[26051.866694]  [<ffffffff816061cd>] schedule+0x3d/0x90
[26051.866700]  [<ffffffff81608fd3>] schedule_timeout+0x243/0x3d0
[26051.866756]  [<ffffffffa033fbb1>] ? zio_taskq_dispatch+0x91/0xa0 [zfs]
[26051.866796]  [<ffffffffa033fbd2>] ? zio_issue_async+0x12/0x20 [zfs]
[26051.866834]  [<ffffffffa0343569>] ? zio_nowait+0x79/0x110 [zfs]
[26051.866842]  [<ffffffff810f7b81>] ? ktime_get+0x41/0xb0
[26051.866847]  [<ffffffff81605a44>] io_schedule_timeout+0xa4/0x110
[26051.866859]  [<ffffffffa01eccd1>] cv_wait_common+0xb1/0x130 [spl]
[26051.866866]  [<ffffffff810c4200>] ? wake_atomic_t_function+0x60/0x60
[26051.866874]  [<ffffffffa01ecda8>] __cv_wait_io+0x18/0x20 [spl]
[26051.866913]  [<ffffffffa034337c>] zio_wait+0xac/0x130 [zfs]
[26051.866959]  [<ffffffffa02cf408>] dsl_pool_sync+0xb8/0x480 [zfs]
[26051.867009]  [<ffffffffa02e9e1f>] spa_sync+0x37f/0xb30 [zfs]
[26051.867015]  [<ffffffff810aa4a2>] ? default_wake_function+0x12/0x20
[26051.867065]  [<ffffffffa02fa6aa>] txg_sync_thread+0x3ba/0x620 [zfs]
[26051.867070]  [<ffffffff8102d752>] ? __switch_to+0x2d2/0x630
[26051.867119]  [<ffffffffa02fa2f0>] ? txg_delay+0x160/0x160 [zfs]
[26051.867128]  [<ffffffffa01e7f22>] thread_generic_wrapper+0x72/0x80 [spl]
[26051.867135]  [<ffffffffa01e7eb0>] ? __thread_exit+0x20/0x20 [spl]
[26051.867140]  [<ffffffff8109e8f9>] kthread+0xd9/0xf0
[26051.867145]  [<ffffffff8102d752>] ? __switch_to+0x2d2/0x630
[26051.867149]  [<ffffffff8109e820>] ? kthread_park+0x60/0x60
[26051.867153]  [<ffffffff8109e820>] ? kthread_park+0x60/0x60
[26051.867157]  [<ffffffff8160a995>] ret_from_fork+0x25/0x30
ioquatix commented 7 years ago

Running

% pacman -Q | egrep "zfs|spl"
spl-linux 0.6.5.9_4.9.6_1-2
spl-utils-linux 0.6.5.9_4.9.6_1-2
zfs-linux 0.6.5.9_4.9.6_1-2
zfs-utils-linux 0.6.5.9_4.9.6_1-2
ioquatix commented 7 years ago

Memory available is okay:

% free -h
              total        used        free      shared  buff/cache   available
Mem:            15G         12G        2.7G        692K        630M        3.0G
Swap:          4.0G          0B        4.0G

iostat seem okay (sde is samsung 840 pro)

% iostat -mx 10
Linux 4.9.6-1-ARCH (hinoki)     06/02/17    _x86_64_    (2 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.69    0.00    2.26    3.18    0.00   93.88

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    4.39   13.96     0.19     0.23    47.25     0.06    3.16   11.42    0.57   2.28   4.18
sdb               0.00     0.00    4.39   13.69     0.19     0.22    46.63     0.06    3.13   11.19    0.55   2.26   4.09
sdc               0.00     0.00    4.29   13.58     0.19     0.23    47.97     0.06    3.61   12.91    0.66   2.58   4.62
sdd               0.00     0.00    4.32   13.80     0.18     0.23    46.40     0.06    3.18   11.60    0.55   2.29   4.15
sde               0.00     0.30    1.46   11.25     0.02     0.36    61.82     0.01    0.41    0.78    0.36   0.22   0.28

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          10.43    0.00   47.68   28.80    0.00   13.08

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00   54.20    0.80     3.27     0.01   122.01     0.31    5.70    5.77    0.88   3.10  17.06
sdb               0.00     0.00   54.60    0.80     3.09     0.01   114.60     0.43    7.70    7.80    0.75   3.62  20.07
sdc               0.00     0.00   54.20    1.00     3.15     0.01   117.16     0.39    6.98    7.11    0.00   3.80  21.00
sdd               0.00     0.00   56.30    1.00     3.08     0.01   110.27     0.43    7.57    7.70    0.30   3.87  22.17
sde               0.00     5.60   27.30  159.50     0.08     7.38    81.70     0.06    0.34    0.21    0.36   0.22   4.20

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.69    0.00   43.65   30.28    0.00   20.38

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00   43.30    0.90     2.45     0.01   113.81     0.35    7.86    8.03    0.00   4.38  19.34
sdb               0.00     0.00   40.30    0.90     2.31     0.01   114.99     0.36    8.68    8.87    0.44   4.47  18.40
sdc               0.00     0.00   41.80    0.80     2.39     0.01   115.08     0.34    8.08    8.22    0.38   4.68  19.93
sdd               0.00     0.00   41.20    0.80     2.30     0.01   112.34     0.44   10.39   10.58    0.88   4.56  19.16
sde               0.00     3.10   25.80  136.70     0.07     5.80    74.00     0.05    0.28    0.48    0.25   0.21   3.47

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.42    0.00   38.83   28.72    0.00   29.02

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00   26.70    0.90     1.58     0.01   117.86     0.26    9.37    9.63    1.78   5.51  15.20
sdb               0.00     0.00   28.50    0.90     1.53     0.01   107.54     0.25    8.59    8.86    0.00   5.04  14.83
sdc               0.00     0.00   27.40    0.90     1.41     0.01   103.12     0.27    9.34    9.65    0.00   5.49  15.53
sdd               0.00     0.00   28.60    0.90     1.43     0.01   100.18     0.24    8.21    8.47    0.00   4.84  14.27
sde               0.00     0.80   27.30   84.50     0.08     2.67    50.38     0.02    0.21    0.27    0.19   0.19   2.07
hostingnuggets commented 7 years ago

Just wanted to give a short update to my post in this issue from last year: in the mean time I have upgraded to ZFS 0.6.5.8 from Debian's backports, still using Debian 8. Unfortunately I still get the exact same timeout in the kernel logs.

gdevenyi commented 7 years ago

@kpande if you're going to assert that, can you please describe what's big enough?

I have resources 4X the average described in this thread and I also see the same issues, in this case with the combination of zfs send and nfsd load.

ioquatix commented 7 years ago

severely undersized hardware

You may be right.

I have 4x 4TB drives, and one 256 SSD as OS/Cache. 16GB memory, 2 core CPU (1.5Ghz Atom). This system is almost exclusively used for RSync backups. I do feel that this is pretty reasonable for my needs.

The ARC cache is 12GB, the OS has 4GB left over. The L2 cache, if enabled , is about 170GB.

I'll have to check how many files it is, but I'm not sure if it's multi-millions or not.

hostingnuggets commented 7 years ago

@kpande you are right I forgot to give any relevant infos about my underlying hardware. Sorry about that, here are hopefully all the relevant infos:

My server is a virtualization Server running Xen 4.4 with currently 6 virtual machines which all have their logical volumes (LVM) stored on a RAIDZ1 volume with 3x 2TB Seagate SATA enterprise disks (ST32000645NS). The Debian 8 OS is independent and located on two internal SATA-SSD disks of both 16 GB in RAID1 using Linux MD for mirroring. The CPU is an Intel E5-2620 v3 @ 2.40GHz with 6 cores/12 threads. Out of these 6 cores 4 vCPUs have been pinned to the host/hypervisor/dom0 using the dom0_max_vcpus=4 dom0_vcpus_pin Linux kernel options. The server has 64 GB of TruDDR4 ECC memory and out of this 64 GB of memory 6 GB has been reserved to the host/hypervisor/dom0 using the respective Linux kernel options dom0_mem=6G,max:6G. Finally I have reserved 2 GB RAM of these 6 GB for the ARC using the following zfs_arc_max=2147483648 zfs module option. I have also disabled ZFS prefetch if that is of any relevance (zfs_prefetch_disable=1).

Below is the output of an actual ARC summary (server has bee rebooted 5 days ago):

ZFS Subsystem Report                Tue Feb 07 15:11:32 2017
ARC Summary: (HEALTHY)
    Memory Throttle Count:          0

ARC Misc:
    Deleted:                3.05m
    Mutex Misses:               7
    Evict Skips:                7

ARC Size:               77.69%  1.55    GiB
    Target Size: (Adaptive)     100.00% 2.00    GiB
    Min Size (Hard Limit):      1.56%   32.00   MiB
    Max Size (High Water):      64:1    2.00    GiB

ARC Size Breakdown:
    Recently Used Cache Size:   93.75%  1.88    GiB
    Frequently Used Cache Size: 6.25%   128.00  MiB

ARC Hash Breakdown:
    Elements Max:               171.83k
    Elements Current:       97.97%  168.34k
    Collisions:             68.60m
    Chain Max:              6
    Chains:                 12.12k

ARC Total accesses:                 102.96m
    Cache Hit Ratio:        48.88%  50.32m
    Cache Miss Ratio:       51.12%  52.63m
    Actual Hit Ratio:       48.88%  50.32m

    Data Demand Efficiency:     0.00%   49.34m

    CACHE HITS BY CACHE LIST:
      Most Recently Used:       63.44%  31.92m
      Most Frequently Used:     36.56%  18.40m
      Most Recently Used Ghost: 0.08%   42.37k
      Most Frequently Used Ghost:   0.00%   0

    CACHE HITS BY DATA TYPE:
      Demand Data:          0.00%   167
      Prefetch Data:        0.00%   0
      Demand Metadata:      100.00% 50.32m
      Prefetch Metadata:        0.00%   0

    CACHE MISSES BY DATA TYPE:
      Demand Data:          93.74%  49.34m
      Prefetch Data:        0.00%   0
      Demand Metadata:      6.26%   3.29m
      Prefetch Metadata:        0.00%   145

ZFS Tunable:
    metaslab_debug_load                               0
    zfs_arc_min_prefetch_lifespan                     0
    zfetch_max_streams                                8
    zfs_nopwrite_enabled                              1
    zfetch_min_sec_reap                               2
    zfs_dbgmsg_enable                                 0
    zfs_dirty_data_max_max_percent                    25
    zfs_arc_p_aggressive_disable                      1
    spa_load_verify_data                              1
    zfs_zevent_cols                                   80
    zfs_dirty_data_max_percent                        10
    zfs_sync_pass_dont_compress                       5
    l2arc_write_max                                   8388608
    zfs_vdev_scrub_max_active                         2
    zfs_vdev_sync_write_min_active                    10
    zvol_prefetch_bytes                               131072
    metaslab_aliquot                                  524288
    zfs_no_scrub_prefetch                             0
    zfs_arc_shrink_shift                              0
    zfetch_block_cap                                  256
    zfs_txg_history                                   0
    zfs_delay_scale                                   500000
    zfs_vdev_async_write_active_min_dirty_percent     30
    metaslab_debug_unload                             0
    zfs_read_history                                  0
    zvol_max_discard_blocks                           16384
    zfs_recover                                       0
    l2arc_headroom                                    2
    zfs_deadman_synctime_ms                           1000000
    zfs_scan_idle                                     50
    zfs_free_min_time_ms                              1000
    zfs_dirty_data_max                                624856268
    zfs_vdev_async_read_min_active                    1
    zfs_mg_noalloc_threshold                          0
    zfs_dedup_prefetch                                0
    zfs_vdev_max_active                               1000
    l2arc_write_boost                                 8388608
    zfs_resilver_min_time_ms                          3000
    zfs_vdev_async_write_max_active                   10
    zil_slog_limit                                    1048576
    zfs_prefetch_disable                              1
    zfs_resilver_delay                                2
    metaslab_lba_weighting_enabled                    1
    zfs_mg_fragmentation_threshold                    85
    l2arc_feed_again                                  1
    zfs_zevent_console                                0
    zfs_immediate_write_sz                            32768
    zfs_dbgmsg_maxsize                                4194304
    zfs_free_leak_on_eio                              0
    zfs_deadman_enabled                               1
    metaslab_bias_enabled                             1
    zfs_arc_p_dampener_disable                        1
    zfs_object_mutex_size                             64
    zfs_metaslab_fragmentation_threshold              70
    zfs_no_scrub_io                                   0
    metaslabs_per_vdev                                200
    zfs_dbuf_state_index                              0
    zfs_vdev_sync_read_min_active                     10
    metaslab_fragmentation_factor_enabled             1
    zvol_inhibit_dev                                  0
    zfs_vdev_async_write_active_max_dirty_percent     60
    zfs_vdev_cache_size                               0
    zfs_vdev_mirror_switch_us                         10000
    zfs_dirty_data_sync                               67108864
    spa_config_path                                   /etc/zfs/zpool.cache
    zfs_dirty_data_max_max                            1562140672
    zfs_arc_lotsfree_percent                          10
    zfs_zevent_len_max                                64
    zfs_scan_min_time_ms                              1000
    zfs_arc_sys_free                                  0
    zfs_arc_meta_strategy                             1
    zfs_vdev_cache_bshift                             16
    zfs_arc_meta_adjust_restarts                      4096
    zfs_max_recordsize                                1048576
    zfs_vdev_scrub_min_active                         1
    zfs_vdev_read_gap_limit                           32768
    zfs_arc_meta_limit                                0
    zfs_vdev_sync_write_max_active                    10
    l2arc_norw                                        0
    zfs_arc_meta_prune                                10000
    metaslab_preload_enabled                          1
    l2arc_nocompress                                  0
    zvol_major                                        230
    zfs_vdev_aggregation_limit                        131072
    zfs_flags                                         0
    spa_asize_inflation                               24
    zfs_admin_snapshot                                0
    l2arc_feed_secs                                   1
    zio_taskq_batch_pct                               75
    zfs_sync_pass_deferred_free                       2
    zfs_disable_dup_eviction                          0
    zfs_arc_grow_retry                                0
    zfs_read_history_hits                             0
    zfs_vdev_async_write_min_active                   1
    zfs_vdev_async_read_max_active                    3
    zfs_scrub_delay                                   4
    zfs_delay_min_dirty_percent                       60
    zfs_free_max_blocks                               100000
    zfs_vdev_cache_max                                16384
    zio_delay_max                                     30000
    zfs_top_maxinflight                               32
    ignore_hole_birth                                 1
    spa_slop_shift                                    5
    zfs_vdev_write_gap_limit                          4096
    spa_load_verify_metadata                          1
    spa_load_verify_maxinflight                       10000
    l2arc_noprefetch                                  1
    zfs_vdev_scheduler                                noop
    zfs_expire_snapshot                               300
    zfs_sync_pass_rewrite                             2
    zil_replay_disable                                0
    zfs_nocacheflush                                  0
    zfs_arc_max                                       2147483648
    zfs_arc_min                                       0
    zfs_read_chunk_size                               1048576
    zfs_txg_timeout                                   5
    zfs_pd_bytes_max                                  52428800
    l2arc_headroom_boost                              200
    zfs_send_corrupt_data                             0
    l2arc_feed_min_ms                                 200
    zfs_arc_meta_min                                  0
    zfs_arc_average_blocksize                         8192
    zfetch_array_rd_sz                                1048576
    zfs_autoimport_disable                            1
    zfs_arc_p_min_shift                               0
    zio_requeue_io_start_cut_in_line                  1
    zfs_vdev_sync_read_max_active                     10
    zfs_mdcomp_disable                                0
    zfs_arc_num_sublists_per_state                    4

Do you need any more information? and what do you think about this setup? is my hardware undersized?

ioquatix commented 7 years ago

I checked and the majority of my backups are < 200k files and < 20GBytes.

red-scorp commented 6 years ago

I hope it helps:

user@server ~ $ cat /var/log/syslog | grep zfs Mar 21 04:01:11 server kernel: [204939.534145] dmu_tx_wait+0x36f/0x390 [zfs] Mar 21 04:01:11 server kernel: [204939.534234] dmu_tx_assign+0x83/0x470 [zfs] Mar 21 04:01:11 server kernel: [204939.534339] zfs_write+0x3f6/0xd80 [zfs] Mar 21 04:01:11 server kernel: [204939.534439] ? rrw_exit+0x5a/0x150 [zfs] Mar 21 04:01:11 server kernel: [204939.534539] ? rrm_exit+0x46/0x80 [zfs] Mar 21 04:01:11 server kernel: [204939.534648] zpl_write_common_iovec+0x8c/0xe0 [zfs] Mar 21 04:01:11 server kernel: [204939.534752] zpl_iter_write+0xae/0xe0 [zfs] Mar 21 04:03:12 server kernel: [205060.371676] dmu_tx_wait+0x36f/0x390 [zfs] Mar 21 04:03:12 server kernel: [205060.371764] dmu_tx_assign+0x83/0x470 [zfs] Mar 21 04:03:12 server kernel: [205060.371869] zfs_write+0x3f6/0xd80 [zfs] Mar 21 04:03:12 server kernel: [205060.371969] ? rrw_exit+0x5a/0x150 [zfs] Mar 21 04:03:12 server kernel: [205060.372069] ? rrm_exit+0x46/0x80 [zfs] Mar 21 04:03:12 server kernel: [205060.372177] zpl_write_common_iovec+0x8c/0xe0 [zfs] Mar 21 04:03:12 server kernel: [205060.372282] zpl_iter_write+0xae/0xe0 [zfs] Mar 21 04:05:13 server kernel: [205181.209034] dmu_tx_wait+0x36f/0x390 [zfs] Mar 21 04:05:13 server kernel: [205181.209123] dmu_tx_assign+0x83/0x470 [zfs] Mar 21 04:05:13 server kernel: [205181.209228] zfs_write+0x3f6/0xd80 [zfs] Mar 21 04:05:13 server kernel: [205181.209329] ? rrw_exit+0x5a/0x150 [zfs] Mar 21 04:05:13 server kernel: [205181.209429] ? rrm_exit+0x46/0x80 [zfs] Mar 21 04:05:13 server kernel: [205181.209537] zpl_write_common_iovec+0x8c/0xe0 [zfs] Mar 21 04:05:13 server kernel: [205181.209641] zpl_iter_write+0xae/0xe0 [zfs] Mar 21 04:07:13 server kernel: [205302.045999] dmu_tx_wait+0x36f/0x390 [zfs] Mar 21 04:07:13 server kernel: [205302.046088] dmu_tx_assign+0x83/0x470 [zfs] Mar 21 04:07:13 server kernel: [205302.046192] zfs_write+0x3f6/0xd80 [zfs] Mar 21 04:07:13 server kernel: [205302.046293] ? rrw_exit+0x5a/0x150 [zfs] Mar 21 04:07:13 server kernel: [205302.046392] ? rrm_exit+0x46/0x80 [zfs] Mar 21 04:07:13 server kernel: [205302.046501] zpl_write_common_iovec+0x8c/0xe0 [zfs] Mar 21 04:07:13 server kernel: [205302.046605] zpl_iter_write+0xae/0xe0 [zfs] Mar 21 04:09:14 server kernel: [205422.882950] dmu_tx_wait+0x36f/0x390 [zfs] Mar 21 04:09:14 server kernel: [205422.883039] dmu_tx_assign+0x83/0x470 [zfs] Mar 21 04:09:14 server kernel: [205422.883144] zfs_write+0x3f6/0xd80 [zfs] Mar 21 04:09:14 server kernel: [205422.883244] ? rrw_exit+0x5a/0x150 [zfs] Mar 21 04:09:14 server kernel: [205422.883343] ? rrm_exit+0x46/0x80 [zfs] Mar 21 04:09:14 server kernel: [205422.883452] zpl_write_common_iovec+0x8c/0xe0 [zfs] Mar 21 04:09:14 server kernel: [205422.883556] zpl_iter_write+0xae/0xe0 [zfs] Mar 21 04:11:15 server kernel: [205543.719843] dmu_tx_wait+0x36f/0x390 [zfs] Mar 21 04:11:15 server kernel: [205543.719933] dmu_tx_assign+0x83/0x470 [zfs] Mar 21 04:11:15 server kernel: [205543.720037] zfs_write+0x3f6/0xd80 [zfs] Mar 21 04:11:15 server kernel: [205543.720137] ? rrw_exit+0x5a/0x150 [zfs] Mar 21 04:11:15 server kernel: [205543.720237] ? rrm_exit+0x46/0x80 [zfs] Mar 21 04:11:15 server kernel: [205543.720345] zpl_write_common_iovec+0x8c/0xe0 [zfs] Mar 21 04:11:15 server kernel: [205543.720449] zpl_iter_write+0xae/0xe0 [zfs] Mar 21 04:13:16 server kernel: [205664.556609] dmu_tx_wait+0x36f/0x390 [zfs] Mar 21 04:13:16 server kernel: [205664.556698] dmu_tx_assign+0x83/0x470 [zfs] Mar 21 04:13:16 server kernel: [205664.556802] zfs_write+0x3f6/0xd80 [zfs] Mar 21 04:13:16 server kernel: [205664.556903] ? rrw_exit+0x5a/0x150 [zfs] Mar 21 04:13:16 server kernel: [205664.557003] ? rrm_exit+0x46/0x80 [zfs] Mar 21 04:13:16 server kernel: [205664.557112] zpl_write_common_iovec+0x8c/0xe0 [zfs] Mar 21 04:13:16 server kernel: [205664.557216] zpl_iter_write+0xae/0xe0 [zfs] Mar 21 04:15:17 server kernel: [205785.393394] dmu_tx_wait+0x36f/0x390 [zfs] Mar 21 04:15:17 server kernel: [205785.393483] dmu_tx_assign+0x83/0x470 [zfs] Mar 21 04:15:17 server kernel: [205785.393587] zfs_write+0x3f6/0xd80 [zfs] Mar 21 04:15:17 server kernel: [205785.393687] ? rrw_exit+0x5a/0x150 [zfs] Mar 21 04:15:17 server kernel: [205785.393787] ? rrm_exit+0x46/0x80 [zfs] Mar 21 04:15:17 server kernel: [205785.393896] zpl_write_common_iovec+0x8c/0xe0 [zfs] Mar 21 04:15:17 server kernel: [205785.394000] zpl_iter_write+0xae/0xe0 [zfs] Mar 21 04:17:18 server kernel: [205906.229973] dmu_tx_wait+0x36f/0x390 [zfs] Mar 21 04:17:18 server kernel: [205906.230062] dmu_tx_assign+0x83/0x470 [zfs] Mar 21 04:17:18 server kernel: [205906.230166] zfs_write+0x3f6/0xd80 [zfs] Mar 21 04:17:18 server kernel: [205906.230289] ? rrw_exit+0x5a/0x150 [zfs] Mar 21 04:17:18 server kernel: [205906.230389] ? rrm_exit+0x46/0x80 [zfs] Mar 21 04:17:18 server kernel: [205906.230498] zpl_write_common_iovec+0x8c/0xe0 [zfs] Mar 21 04:17:18 server kernel: [205906.230602] zpl_iter_write+0xae/0xe0 [zfs] Mar 21 04:19:18 server kernel: [206027.066643] dmu_tx_wait+0x36f/0x390 [zfs] Mar 21 04:19:18 server kernel: [206027.066732] dmu_tx_assign+0x83/0x470 [zfs] Mar 21 04:19:18 server kernel: [206027.066837] zfs_write+0x3f6/0xd80 [zfs] Mar 21 04:19:18 server kernel: [206027.066937] ? rrw_exit+0x5a/0x150 [zfs] Mar 21 04:19:18 server kernel: [206027.067037] ? rrm_exit+0x46/0x80 [zfs] Mar 21 04:19:18 server kernel: [206027.067146] zpl_write_common_iovec+0x8c/0xe0 [zfs] Mar 21 04:19:18 server kernel: [206027.067250] zpl_iter_write+0xae/0xe0 [zfs] Mar 21 07:54:33 server systemd[1]: zfs-import-cache.service: Main process exited, code=exited, status=1/FAILURE Mar 21 07:54:33 server systemd[1]: zfs-import-cache.service: Failed with result 'exit-code'. user@server ~ $ cat /etc/issue Ubuntu Bionic Beaver (development branch) \n \l

user@server ~ $ uname -a Linux server 4.15.0-12-generic #13-Ubuntu SMP Thu Mar 8 06:24:47 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux user@server ~ $ dpkg -l zfs* Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-==============-============-============-================================= un zfs (no description available) un zfs-dkms (no description available) un zfs-dracut (no description available) un zfs-fuse (no description available) un zfs-initramfs (no description available) un zfs-modules (no description available) ii zfs-zed 0.7.5-1ubunt amd64 OpenZFS Event Daemon un zfsutils (no description available) ii zfsutils-linux 0.7.5-1ubunt amd64 command-line tools to manage Open user@server ~ $ user@server ~ $ df -h /data Filesystem Size Used Avail Use% Mounted on data 14T 11T 3.1T 78% /data

it was mv process from one zfs to another in the same pool (actually it behaves as cp, not faster). the process hangs forever (at least few hours) and not allow to terminate with a signal. zpool reports errors-free. Now I've started zpool scrub, let's see what it will report.

UPDATE:

user@server ~ $ zpool status pool: data state: ONLINE scan: scrub repaired 0B in 18h22m with 0 errors on Thu Mar 22 02:32:52 2018 config:

    NAME        STATE     READ WRITE CKSUM
    data        ONLINE       0     0     0
      raidz2-0  ONLINE       0     0     0
        sdf     ONLINE       0     0     0
        sdg     ONLINE       0     0     0
        sdh     ONLINE       0     0     0
        sdi     ONLINE       0     0     0
        sdj     ONLINE       0     0     0
        sdk     ONLINE       0     0     0
        sdl     ONLINE       0     0     0
        sdm     ONLINE       0     0     0

errors: No known data errors

Let me know if you need other infos. With best regards, Ag