Closed fantom-x closed 8 years ago
Another one from this morning
Jun 23 07:37:15 server zed: eid=21 class=delay pool=storage
Jun 23 07:37:16 server zed: eid=22 class=delay pool=storage
Jun 23 07:38:34 server kernel: [170513.464556] txg_sync D ffff88040e9dfa88 0 2156 2 0x00000000
Jun 23 07:38:34 server kernel: [170513.464565] Call Trace:
Jun 23 07:38:34 server kernel: [170513.464577] [<ffffffff810abf62>] ? default_wake_function+0x12/0x20
Jun 23 07:38:34 server kernel: [170513.464584] [<ffffffff81821044>] io_schedule_timeout+0xa4/0x110
Jun 23 07:38:34 server kernel: [170513.464598] [<ffffffffc05facc8>] __cv_wait_io+0x18/0x20 [spl]
Jun 23 07:38:34 server kernel: [170513.464657] [<ffffffffc05f54c5>] ? spl_kmem_cache_free+0x135/0x1c0 [spl]
Jun 23 07:38:34 server kernel: [170513.464708] [<ffffffff810abf62>] ? default_wake_function+0x12/0x20
Jun 23 07:38:34 server kernel: [170513.464759] [<ffffffffc05f5e31>] thread_generic_wrapper+0x71/0x80 [spl]
Jun 23 07:38:34 server kernel: [170513.464762] [<ffffffffc05f5dc0>] ? __thread_exit+0x20/0x20 [spl]
Jun 23 07:38:34 server kernel: [170513.464764] [<ffffffff810a06a8>] kthread+0xd8/0xf0
Jun 23 07:38:34 server kernel: [170513.464765] [<ffffffff810a05d0>] ? kthread_create_on_node+0x1e0/0x1e0
Jun 23 07:38:34 server kernel: [170513.464767] [<ffffffff81825f8f>] ret_from_fork+0x3f/0x70
Jun 23 07:38:34 server kernel: [170513.464769] [<ffffffff810a05d0>] ? kthread_create_on_node+0x1e0/0x1e0
Jun 23 07:40:34 server kernel: [170633.459697] Tainted: P OE 4.4.0-24-generic #43-Ubuntu
Jun 23 07:40:34 server kernel: [170633.459699] txg_sync D ffff88040e9dfa88 0 2156 2 0x00000000
Jun 23 07:40:34 server kernel: [170633.459704] ffff88040e9e0000 ffff88042e396d00 7fffffffffffffff ffff8800097d1da8
Jun 23 07:40:34 server kernel: [170633.459707] Call Trace:
Jun 23 07:40:34 server kernel: [170633.459715] [<ffffffff81824c35>] schedule_timeout+0x1b5/0x270
Jun 23 07:40:34 server kernel: [170633.459721] [<ffffffff810c34c2>] ? __wake_up_common+0x52/0x90
Jun 23 07:40:34 server kernel: [170633.459725] [<ffffffff81821044>] io_schedule_timeout+0xa4/0x110
Jun 23 07:40:34 server kernel: [170633.459735] [<ffffffff810c3b80>] ? wake_atomic_t_function+0x60/0x60
Jun 23 07:40:34 server kernel: [170633.459777] [<ffffffffc07a820e>] zio_wait+0x10e/0x1f0 [zfs]
Jun 23 07:40:34 server kernel: [170633.459801] [<ffffffffc05f54c5>] ? spl_kmem_cache_free+0x135/0x1c0 [spl]
Jun 23 07:40:34 server kernel: [170633.459850] [<ffffffffc074d613>] spa_sync+0x3c3/0xb30 [zfs]
Jun 23 07:40:34 server kernel: [170633.459875] [<ffffffffc075ea4a>] txg_sync_thread+0x3ba/0x630 [zfs]
Jun 23 07:40:34 server kernel: [170633.459902] [<ffffffffc05f5e31>] thread_generic_wrapper+0x71/0x80 [spl]
Jun 23 07:40:34 server kernel: [170633.459908] [<ffffffff810a06a8>] kthread+0xd8/0xf0
Jun 23 07:40:34 server kernel: [170633.459911] [<ffffffff81825f8f>] ret_from_fork+0x3f/0x70
Jun 23 07:40:34 server kernel: [170633.459925] INFO: task dconf-service:5445 blocked for more than 120 seconds.
Jun 23 07:40:34 server kernel: [170633.459927] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 23 07:40:34 server kernel: [170633.459929] ffff8803fdbcbc28 ffff8804167c3270 ffff88041b9b1b80 ffff8804183044c0
Jun 23 07:40:34 server kernel: [170633.459933] 0000000000000001 ffff8803fdbcbc40 ffffffff81821b15 0000000000000000
Jun 23 07:40:34 server kernel: [170633.459936] [<ffffffff81821b15>] schedule+0x35/0x80
Jun 23 07:40:34 server kernel: [170633.459940] [<ffffffff810c3544>] ? __wake_up+0x44/0x50
Jun 23 07:40:34 server kernel: [170633.459970] [<ffffffffc05f68c1>] ? taskq_dispatch_ent+0x51/0x120 [spl]
Jun 23 07:40:34 server kernel: [170633.459997] [<ffffffff810f585c>] ? ktime_get+0x3c/0xb0
Jun 23 07:40:34 server kernel: [170633.460003] [<ffffffffc05fabec>] cv_wait_common+0xbc/0x140 [spl]
Jun 23 07:40:34 server kernel: [170633.460009] [<ffffffffc05facc8>] __cv_wait_io+0x18/0x20 [spl]
Jun 23 07:40:34 server kernel: [170633.460061] [<ffffffffc07a1fb5>] zil_commit.part.11+0x485/0x7e0 [zfs]
Jun 23 07:40:34 server kernel: [170633.460115] [<ffffffffc0797eda>] zfs_fsync+0x7a/0xf0 [zfs]
Jun 23 07:40:34 server kernel: [170633.460145] [<ffffffff81240b3b>] vfs_fsync_range+0x4b/0xb0
Jun 23 07:40:34 server kernel: [170633.460148] [<ffffffff81240e90>] SyS_fsync+0x10/0x20
Jun 23 07:40:34 server kernel: [170633.460207] INFO: task mozStorage #2:9527 blocked for more than 120 seconds.
Jun 23 07:40:34 server kernel: [170633.460210] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 23 07:40:34 server kernel: [170633.460214] ffff880040a43d40 ffffea0007eea0f0 ffff88041b9b1b80 ffff880034b3ee00
Jun 23 07:40:34 server kernel: [170633.460217] 0000000000000000 ffff880040a43d58 ffffffff81821b15 ffff8804161ab9c8
Jun 23 07:40:34 server kernel: [170633.460222] [<ffffffff81821b15>] schedule+0x35/0x80
Jun 23 07:40:34 server kernel: [170633.460229] [<ffffffff810c3b80>] ? wake_atomic_t_function+0x60/0x60
Jun 23 07:40:34 server kernel: [170633.460264] [<ffffffffc07a1bb3>] zil_commit.part.11+0x83/0x7e0 [zfs]
Jun 23 07:40:34 server kernel: [170633.460273] [<ffffffffc05fbfd7>] ? tsd_set+0x2a7/0x4e0 [spl]
Jun 23 07:40:34 server kernel: [170633.460303] [<ffffffffc07a2327>] zil_commit+0x17/0x20 [zfs]
Jun 23 07:40:34 server kernel: [170633.460358] [<ffffffffc07adde6>] zpl_fsync+0x66/0x90 [zfs]
Jun 23 07:40:34 server kernel: [170633.460363] [<ffffffff81240bfd>] do_fsync+0x3d/0x70
Jun 23 07:40:34 server kernel: [170633.460366] [<ffffffff81825bf2>] entry_SYSCALL_64_fastpath+0x16/0x71
Jun 23 07:42:41 server zed: eid=23 class=delay pool=storage
Jun 23 07:42:41 server zed: eid=24 class=delay pool=storage
Jun 23 07:42:41 server zed: eid=25 class=delay pool=storage
Any suggestions? I tried the config parameter below and it seemed to have eliminated the call traces, but I am still getting "zed" messages. Feels like the call traces are logged as a result of freeing space after some daily snapshots are destroyed.
echo 2 > /sys/module/zfs/parameters/zfs_vdev_async_write_min_active
Jun 27 07:37:15 server zed: eid=100 class=delay pool=storage
Jun 27 07:37:15 server zed: eid=101 class=delay pool=storage
Jun 27 07:37:15 server zed: eid=102 class=delay pool=storage
Jun 27 07:37:15 server zed: eid=103 class=delay pool=storage
Jun 27 07:37:15 server zed: eid=104 class=delay pool=storage
Jun 27 07:37:15 server zed: eid=105 class=delay pool=storage
Jun 27 07:37:15 server zed: eid=106 class=delay pool=storage
Jun 27 07:37:16 server zed: eid=107 class=delay pool=storage
Jun 27 07:37:16 server zed: eid=108 class=delay pool=storage
Jun 27 07:37:16 server zed: eid=109 class=delay pool=storage
Jun 27 07:37:16 server zed: eid=110 class=delay pool=storage
Jun 27 07:37:16 server zed: eid=111 class=delay pool=storage
Jun 27 07:37:16 server zed: eid=112 class=delay pool=storage
Jun 27 07:37:16 server zed: eid=113 class=delay pool=storage
Jun 27 07:41:22 server zed: eid=114 class=delay pool=storage
Jun 27 07:41:22 server zed: eid=115 class=delay pool=storage
Jun 27 07:41:22 server zed: eid=116 class=delay pool=storage
Jun 27 07:41:22 server zed: eid=117 class=delay pool=storage
Jun 27 07:41:22 server zed: eid=118 class=delay pool=storage
Jun 27 07:41:22 server zed: eid=119 class=delay pool=storage
@fantom-x delay events are usually the result of hardware which is slow to respond for some reason. If you're running the latest master code you I'd suggest running zpool iostat -w
, this will generate a latency histogram which should provide more insight.
$ zpool iostat -w
dozer total_wait disk_wait sync_queue async_queue
latency read write read write read write read write scrub
------------- ----- ----- ----- ----- ----- ----- ----- ----- -----
1ns 0 0 0 0 0 0 0 0 0
3ns 0 0 0 0 0 0 0 0 0
7ns 0 0 0 0 0 0 0 0 0
15ns 0 0 0 0 0 0 0 0 0
31ns 0 0 0 0 0 0 0 0 0
63ns 0 0 0 0 0 0 0 0 0
127ns 0 0 0 0 0 0 0 0 0
255ns 0 0 0 0 0 0 0 0 0
511ns 0 0 0 0 0 0 0 0 0
1us 0 0 0 0 25 7 0 9 1
2us 0 0 0 0 24 18 0 34 3
4us 0 0 0 0 30 12 0 5 0
8us 0 0 0 0 0 1 0 0 0
16us 0 0 0 0 0 0 0 0 0
32us 0 0 0 0 0 0 0 0 0
65us 0 0 0 0 0 0 0 0 0
131us 2 0 9 0 6 0 0 0 2
262us 34 4 46 57 7 0 0 4 5
524us 64 45 50 73 7 0 0 9 3
1ms 9 47 5 29 0 0 0 19 0
2ms 2 46 1 8 0 0 0 35 0
4ms 1 27 1 2 0 0 0 18 0
8ms 0 2 0 2 0 0 0 0 0
16ms 0 0 0 0 0 0 0 0 0
33ms 1 0 1 0 0 0 0 0 0
67ms 0 0 0 0 0 0 0 0 0
134ms 0 0 0 0 0 0 0 0 0
268ms 0 0 0 0 0 0 0 0 0
536ms 0 0 0 0 0 0 0 0 0
1s 0 0 0 0 0 0 0 0 0
2s 0 0 0 0 0 0 0 0 0
4s 0 0 0 0 0 0 0 0 0
8s 0 0 0 0 0 0 0 0 0
17s 0 0 0 0 0 0 0 0 0
34s 0 0 0 0 0 0 0 0 0
68s 0 0 0 0 0 0 0 0 0
137s 0 0 0 0 0 0 0 0 0
----------------------------------------------------------------------------
Man page.
-w
Display latency histograms:
total_wait:
Total IO time (queuing + disk IO time).
disk_wait:
Disk IO time (time reading/writing the
disk).
syncq_wait:
Amount of time IO spent in synchronous
priority queues. Does not include disk
time.
asyncq_wait:
Amount of time IO spent in asynchronous
priority queues. Does not include disk
time.
scrub:
Amount of time IO spent in scrub queue.
Does not include disk time.
All histogram buckets are power-of-two sized. The time
labels are the end ranges of the buckets, so for example, a
15ns bucket stores latencies from 8-15ns. The last bucket is
also a catch-all for latencies higher than the maximum.
@behlendorf, thx for the suggestion. Unfortunately "-w" flag is not available in my version. I will try it when Ubuntu 16.04 updates their zfs version...
Got a new kernel version yesterday and these messages below are what I got this time. I did notice that the default scheduler for rotational disks changed from cfq
to deadline
...
Jun 28 07:40:24 server kernel: [44038.643907] [<ffffffffc087f257>] ? zio_wait+0x157/0x1f0 [zfs]
Jun 28 07:40:24 server kernel: [44038.643953] [<ffffffffc0835a4a>] txg_sync_thread+0x3ba/0x630 [zfs]
Jun 28 07:40:24 server kernel: [44038.643990] [<ffffffffc062ee31>] thread_generic_wrapper+0x71/0x80 [spl]
Jun 28 07:40:24 server kernel: [44038.643997] [<ffffffff810a0730>] ? kthread_create_on_node+0x1e0/0x1e0
What kernel is that?
I am also getting hangs, but in my case it's the whole system that hangs, but that might simply be because I'm using a ZFS root (= only /boot is on ext4). I also don't get any useful information in my system logs, again possibly because it can no longer be written. Not that they could be due to unexpected delays; the pool is on a hybrid HDD (SSHD) on the internal SATA3 bus...
I am running Ubuntu 16.04. The root is ext4 on an SSD and a ZFS mirror is on two rotational disks. These freezes only happen within 10..20 minutes after daily snapshots created/destroyed. After the freezes I get additional space freed and while it is in the frozen state, the drives make a lot of noise: the heads are moving violently.
4.4.0-28-generic #47-Ubuntu SMP Fri Jun 24 10:09:13 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
This below parameter change along with the new kernel (which brought a different default scheduler) seem to be helping in my case, but need to monitor for longer.
echo 2 > /sys/module/zfs/parameters/zfs_vdev_async_write_min_active
From http://docs.basho.com/riak/kv/2.1.4/using/performance/:
Scheduler recommendations:
- The noop scheduler when deploying on iSCSI over HBAs, or any hardware-based RAID.
- The deadline scheduler when using SSD-based storage.
On Thursday June 30 2016 04:50:57 fantom-x wrote:
This below parameter change along with the new kernel (which brought a different default scheduler) seem to be helping in my case, but need to monitor for longer.
echo 2 > /sys/module/zfs/parameters/zfs_vdev_async_write_min_active
If I understand the meaning of that parameter correctly it's a bit surprising, theoretically speaking, that it reduces your issue. Forcing the system to wait until there are more async write requests before taking action on them could be expected to increase latency.
Have you considered trying the blk-mq scheduler? (Requires a scsi_mod.use_blk_mq=1 kernel boot parameter)
R.
The issues is no longer present; some latest updates must have taken care of it...
Getting these in my syslog daily now a few minutes after daily snapshots are taken. Last for more than 10 minutes during which the drives are very noisy and all IO seems to be blocked.