Closed ltz3317 closed 2 years ago
The same situation with me too.
Same issue. No reproduction steps identified, seems "random".
Most commonly appears on long-running disk intensive tasks (in my case caused by BURP backup software) but getting the same blocked stack traces at generic_write_sync and kworker at 100%.
Here is output from 'echo l > /proc/sysrq-trigger'
[5490997.296870] sending NMI to all CPUs: [5490997.297924] NMI backtrace for cpu 0 [5490997.297926] CPU: 0 PID: 27539 Comm: kworker/u8:1 Tainted: P OE ------------ 3.10.0-693.el7.x86_64 #1 [5490997.297928] Hardware name: Gigabyte Technology Co., Ltd. G33-DS3R/G33-DS3R, BIOS F4 06/29/2007 [5490997.297929] Workqueue: writeback bdi_writeback_workfn (flush-zfs-1) [5490997.297931] task: ffff8800c7db3f40 ti: ffff880120aac000 task.ti: ffff880120aac000 [5490997.297932] RIP: 0010:[
] [ ] read_tsc+0xa/0x20 [5490997.297933] RSP: 0018:ffff880120aaf7c8 EFLAGS: 00000246 [5490997.297935] RAX: ffff8800a5cdf0d8 RBX: ffffffff81a0e3c0 RCX: 0000000000000000 [5490997.297936] RDX: 0000000000000058 RSI: ffff8800a5cdf080 RDI: ffffffff81a0e3c0 [5490997.297937] RBP: ffff880120aaf7c8 R08: 000000000053c8f3 R09: 0000000000000000 [5490997.297939] R10: 00000000006db6e0 R11: ffffea0001935000 R12: ffff880120aaf828 [5490997.297940] R13: 000000008aaf12f6 R14: ffff8800a5cdf500 R15: ffff880127d6f3e8 [5490997.297942] FS: 0000000000000000(0000) GS:ffff88012fc00000(0000) knlGS:0000000000000000 [5490997.297943] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [5490997.297944] CR2: 0000000000dde6d0 CR3: 000000005a22f000 CR4: 00000000000007f0 [5490997.297946] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [5490997.297947] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [5490997.297948] Stack: [5490997.297949] ffff880120aaf7f0 ffffffff810ea6ae 0000000000000005 ffff8800a5cdf080 [5490997.297951] ffff880127d6f3b8 ffff880120aaf868 ffffffffc0b73a7c ffff8800c7db3f40 [5490997.297952] ffff8800c7db3f40 000000000000b048 000000000000b049 ffff880120aaf828 [5490997.297954] Call Trace: [5490997.297955] [ ] getrawmonotonic64+0x2e/0xc0 [5490997.297956] [ ] dmu_zfetch+0x39c/0x4c0 [zfs] [5490997.297957] [ ] dbuf_read+0x8e2/0x9e0 [zfs] [5490997.297959] [ ] ? spl_kmem_free+0x35/0x40 [spl] [5490997.297960] [ ] ? dbuf_hold_impl+0xb9/0xd0 [zfs] [5490997.297961] [ ] dnode_hold_impl+0xc6/0xc30 [zfs] [5490997.297963] [ ] ? avl_add+0x4a/0x80 [zavl] [5490997.297964] [ ] ? mutex_lock+0x12/0x2f [5490997.297965] [ ] dnode_hold+0x1b/0x20 [zfs] [5490997.297967] [ ] dmu_bonus_hold+0x32/0x1d0 [zfs] [5490997.297968] [ ] sa_buf_hold+0xe/0x10 [zfs] [5490997.297969] [ ] zfs_zget+0x123/0x250 [zfs] [5490997.297971] [ ] zfs_get_data+0x57/0x2d0 [zfs] [5490997.297972] [ ] zil_commit.part.12+0x41c/0x830 [zfs] [5490997.297973] [ ] zil_commit+0x17/0x20 [zfs] [5490997.297975] [ ] zpl_writepages+0xd6/0x170 [zfs] [5490997.297976] [ ] do_writepages+0x1e/0x40 [5490997.297977] [ ] __writeback_single_inode+0x40/0x220 [5490997.297979] [ ] writeback_sb_inodes+0x1c4/0x490 [5490997.297980] [ ] wb_writeback+0xff/0x2f0 [5490997.297981] [ ] ? set_worker_desc+0x86/0xb0 [5490997.297983] [ ] bdi_writeback_workfn+0x115/0x460 [5490997.297984] [ ] process_one_work+0x17a/0x440 [5490997.297985] [ ] worker_thread+0x126/0x3c0 [5490997.297987] [ ] ? manage_workers.isra.24+0x2a0/0x2a0 [5490997.297988] [ ] kthread+0xcf/0xe0 [5490997.297989] [ ] ? insert_kthread_work+0x40/0x40 [5490997.297991] [ ] ret_from_fork+0x58/0x90 [5490997.297992] [ ] ? insert_kthread_work+0x40/0x40 [5490997.297994] Code: 00 55 8b 05 39 cc ad 00 48 89 e5 5d c3 0f 1f 40 00 55 b8 ed ff ff ff 48 89 e5 5d c3 0f 1f 44 00 00 55 48 89 e5 66 66 90 0f ae e8 <0f> 31 89 c0 48 c1 e2 20 48 09 c2 48 89 d0 5d c3 66 0f 1f 44 00
and another shortly after
[5490939.586576] SysRq : Show backtrace of all active CPUs [5490939.587701] sending NMI to all CPUs: [5490939.588755] NMI backtrace for cpu 0 [5490939.588757] CPU: 0 PID: 27539 Comm: kworker/u8:1 Tainted: P OE ------------ 3.10.0-693.el7.x86_64 #1 [5490939.588759] Hardware name: Gigabyte Technology Co., Ltd. G33-DS3R/G33-DS3R, BIOS F4 06/29/2007 [5490939.588760] Workqueue: writeback bdi_writeback_workfn (flush-zfs-1) [5490939.588762] task: ffff8800c7db3f40 ti: ffff880120aac000 task.ti: ffff880120aac000 [5490939.588763] RIP: 0010:[
] [ ] getrawmonotonic64+0x7a/0xc0 [5490939.588764] RSP: 0018:ffff880120aaf7d8 EFLAGS: 00000207 [5490939.588766] RAX: 00000000006db6e0 RBX: ffffffff81a0e3c0 RCX: 0000000000000018 [5490939.588767] RDX: 0000000021f8aba3 RSI: 00000000000cc2e8 RDI: 00000000001dc6c8 [5490939.588768] RBP: ffff880120aaf7f0 R08: 000000000053c8b9 R09: 0000000000000000 [5490939.588770] R10: 00000000006db6e0 R11: ffffea0001935000 R12: ffff880120aaf828 [5490939.588771] R13: 000000008aad501e R14: ffff8800a5cdf500 R15: ffff880127d6f3e8 [5490939.588772] FS: 0000000000000000(0000) GS:ffff88012fc00000(0000) knlGS:0000000000000000 [5490939.588774] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [5490939.588775] CR2: 00007f4379aae8e0 CR3: 00000000b9a26000 CR4: 00000000000007f0 [5490939.588777] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [5490939.588778] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [5490939.588779] Stack: [5490939.588781] 0000000000000007 0000000000000000 ffff880127d6f3b8 ffff880120aaf868 [5490939.588782] ffffffffc0b73a7c ffff8800c7db3f40 ffff8800c7db3f40 000000000000b048 [5490939.588784] 000000000000b049 ffff880120aaf828 000000000053c8b9 0000000022056e52 [5490939.588785] Call Trace: [5490939.588786] [ ] dmu_zfetch+0x39c/0x4c0 [zfs] [5490939.588787] [ ] dbuf_read+0x8e2/0x9e0 [zfs] [5490939.588789] [ ] ? spl_kmem_free+0x35/0x40 [spl] [5490939.588790] [ ] ? dbuf_hold_impl+0xb9/0xd0 [zfs] [5490939.588791] [ ] dnode_hold_impl+0xc6/0xc30 [zfs] [5490939.588793] [ ] ? avl_add+0x4a/0x80 [zavl] [5490939.588794] [ ] ? mutex_lock+0x12/0x2f [5490939.588795] [ ] dnode_hold+0x1b/0x20 [zfs] [5490939.588797] [ ] dmu_bonus_hold+0x32/0x1d0 [zfs] [5490939.588798] [ ] sa_buf_hold+0xe/0x10 [zfs] [5490939.588799] [ ] zfs_zget+0x123/0x250 [zfs] [5490939.588801] [ ] zfs_get_data+0x57/0x2d0 [zfs] [5490939.588802] [ ] zil_commit.part.12+0x41c/0x830 [zfs] [5490939.588803] [ ] zil_commit+0x17/0x20 [zfs] [5490939.588804] [ ] zpl_writepages+0xd6/0x170 [zfs] [5490939.588806] [ ] do_writepages+0x1e/0x40 [5490939.588807] [ ] __writeback_single_inode+0x40/0x220 [5490939.588809] [ ] writeback_sb_inodes+0x1c4/0x490 [5490939.588810] [ ] wb_writeback+0xff/0x2f0 [5490939.588811] [ ] ? set_worker_desc+0x86/0xb0 [5490939.588813] [ ] bdi_writeback_workfn+0x115/0x460 [5490939.588814] [ ] process_one_work+0x17a/0x440 [5490939.588815] [ ] worker_thread+0x126/0x3c0 [5490939.588817] [ ] ? manage_workers.isra.24+0x2a0/0x2a0 [5490939.588818] [ ] kthread+0xcf/0xe0 [5490939.588819] [ ] ? insert_kthread_work+0x40/0x40 [5490939.588820] [ ] ret_from_fork+0x58/0x90 [5490939.588822] [ ] ? insert_kthread_work+0x40/0x40 [5490939.588824] Code: da 00 48 8b 15 58 b1 da 00 44 3b 2d b9 b0 da 00 75 b0 48 29 f0 31 f6 48 21 c7 44 89 d0 48 0f 49 f7 48 0f af f0 4c 01 ce 48 d3 fe <48> 8d 04 16 48 3d ff c9 9a 3b 76 2e 31 d2 0f 1f 84 00 00 00 00
same problem (hp-dataprotector backup software; vbda process). While run full backup on a snapshot zfs receive does not go on . attach file output of perf record -a and perf sched . i'm using deadline scheduler on block device to try to workaround this problem. some problem with noop scheduler. zfs version :0.6.5.6-0ubuntu16 Linux kernel: 4.4.0-101-generic #124-Ubuntu (Ubuntu 16.04.3 LTS) perf_record-a2.data.gz perf_shed_record2.data.gz
iostat -xd 20
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdd 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00
sdc 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00
sdf 0,00 0,00 4,25 15,30 406,50 1061,58 150,19 0,03 1,73 1,08 1,91 0,40 0,78
sde 0,00 0,00 4,35 16,60 419,43 752,77 111,90 0,03 1,30 0,83 1,42 0,29 0,60
sdg 0,00 0,00 319,35 0,00 9188,00 0,00 57,54 1,01 3,16 3,16 0,00 1,71 54,56
sdh 0,00 0,00 423,45 0,00 12698,40 0,00 59,98 0,93 2,20 2,20 0,00 1,29 54,44
sdj 0,00 0,00 371,45 0,00 10857,80 0,00 58,46 1,22 3,28 3,28 0,00 1,69 62,82
sdi 0,00 0,00 515,95 0,00 15066,20 0,00 58,40 1,08 2,10 2,10 0,00 1,16 59,78
sdk 0,00 0,00 295,80 0,00 8717,00 0,00 58,94 0,60 2,05 2,05 0,00 1,39 41,16
sdl 0,00 0,00 338,10 0,00 9420,80 0,00 55,73 0,70 2,06 2,06 0,00 1,35 45,60
sdq 0,00 0,00 226,60 0,00 6327,40 0,00 55,85 0,90 3,98 3,98 0,00 2,39 54,10
sdo 0,00 0,00 482,35 0,00 14504,60 0,00 60,14 1,36 2,81 2,81 0,00 1,30 62,94
sdn 0,00 0,00 430,60 0,00 12728,60 0,00 59,12 2,08 4,82 4,82 0,00 1,85 79,58
sdm 0,00 0,00 414,70 0,00 12191,80 0,00 58,80 1,00 2,42 2,42 0,00 1,28 53,06
sdr 0,00 0,00 256,05 0,00 7346,40 0,00 57,38 0,52 2,04 2,04 0,00 1,48 37,96
sdp 0,00 0,00 438,55 0,00 12945,60 0,00 59,04 0,99 2,25 2,25 0,00 1,26 55,08
zpool status
pool: pool_z2_samba
state: ONLINE
scan: scrub canceled on Fri Dec 1 09:06:55 2017
config:
NAME STATE READ WRITE CKSUM
pool_z2_samba ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sdg ONLINE 0 0 0
sdh ONLINE 0 0 0
sdi ONLINE 0 0 0
sdj ONLINE 0 0 0
sdk ONLINE 0 0 0
sdl ONLINE 0 0 0
raidz2-1 ONLINE 0 0 0
sdm ONLINE 0 0 0
sdn ONLINE 0 0 0
sdo ONLINE 0 0 0
sdp ONLINE 0 0 0
sdq ONLINE 0 0 0
sdr ONLINE 0 0 0
logs
mirror-2 ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
cache
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
errors: No known data errors
the problem occurs very often. I attach other data from : echo 'w' > /proc/sysrq-trigger and perf record -ag echo_w_sysrq-trigger.gz perf.data.gz
Hi all,
I experienced this problem as well (multiple times). Then I found this thread and tried switching from noop scheduler to deadline scheduler and still experienced the problem.
I am running a 4.16.8 Gentoo kernel, with musl libc, and using fairly recent ZFS from master branch (zfs commit 670d74b9cee4dd185a620510bffd4797a2e1066a , spl commit 1149b62d20b7ed9d8ae25d5da7a06213d79b7602).
This tends to happen after doing a lot of compilation. I merged about 1500 packages with the deadline scheduler before having this hit. The noop scheduler would do about 500-1500 packages before quitting. My data set is probably too small to do any comparison, and I didn't bother writing down specifics (the workflow can be complicated to begin with...).
Here's my dmesg just after the hang: https://pastebin.com/MayYqGaV
Note that the system has 56 threads to play with and would spend some fraction of its time with a load average around 60; perhaps that is plenty of opportunity for resource contention and deadlocks?
HTH.
I am also experiencing this issue. I don't have as many helpful logs or diagnostics, but I can also say I am similarly doing a lot of heavy compilation and small IO. I'm also using it as a Docker storage driver, but the lockups will happen with or without Docker. I'm on an older 0.6.5.6-0ubuntu18. A reboot temporarily fixes the issue.
another time (4.4.0-116-generic #140-Ubuntu zfs 0.6.5.6-0ubuntu16) While run full backup zfsBlock.log.gz
hit similar issue, happened on Gentoo during emerge --emptytree -j4 @world
,
which do all package recompile with lot's parallel access and file locking,
kworker 100% cpu usage,
kernel 4.14.39
ZFS 0.7.8
SPL 0.7.8
NMI backtrace https://pastebin.com/rYAhR1ri
As a troubleshooting step, I recommend pulling the drives and doing some read/write sector tests, make sure there are no controller or bad sector issues. I've hit this bug on a system where the drive seemed fine, except for the odd ZFS hang until the drive became completely unresponsive a few months later and failed.
Just chiming in to say 'me too'. Definitely seems to be a bug, disks are healthy.
another time sync and other processes go lock during zfs receive and full backup (4.4.0-135-generic #161-Ubuntu) perf.data.gz kernelStateD.log
Have this issue too.
Not very big IO. If I use Docker and compile something inside Docker box in ~12 hours system hangs (X11 freezes, mouse cursor is not moves). HDD LED blinks few times immediately after it and no more.
Hardware reset fixes this issue but sometimes scrub finds errors and three times I lost my pool (import fails with different kernel stack traces)
Also, may be it is important, I have root on this ZFS pool.
Tried to change my hardware to AMD desktop and old Intel server without any result. So I think this is software issue.
I am on Debian:
Linux localhost 4.14.0-3-amd64 #1 SMP Debian 4.14.17-1 (2018-02-14) x86_64 GNU/Linux
zfs-dkms 0.7.9-3
another time sync and other processes go lock during zfs receive and full backup attach echo w > /proc/sysrq-trigger echo l > /proc/sysrq-trigger kern.log
Type | Version/Name |
---|---|
Distribution Name | Ubuntu |
Distribution Version | 16.04.5 LTS |
Linux Kernel | 4.4.0-141-generic |
Architecture | x86_64 |
ZFS Version | 0.6.5.6-0ubuntu26 |
SPL Version | 0.6.5.6-0ubuntu4 |
it happened another time. Please help me!
@bud4 currently most developer already got their hand full,
so if you wanna they take on this problem first, you have to provide test script that 100% able to reproduce this problem.
so they can easily reproduce this problem on their end with some automation test environment.
and it also preferable you take a spare machine upgrade to 0.7.12 and still able to reproduce this issue.
@AndCycle This problem is a real pain in the ass. If one of us could have provided reproduce case this issue was been closed long time ago, I believe.
This issue may not appear for 2 months and then begin to appear twice per day.
Perhaps we need to turn on some additional logging and read it throught rs232? Tell it. Just in simple words, we are not all developers here and do not know how to hunt this bug.
@AndCycle Very difficult for me to be able to create scripts that reproduce the event. Also because I only have a master and a slave both in production. The bug often occurs on the slave while the full backup is running and at the same time there is a zfs receive operation. The Hp-dataprotector agent reads the data from the mounted snapshots of the various filessytems. Maximum 6 simultaneous reading processes. The bug also occurs on the master while there are about 800 smbd connections. (Many read operations and few write operations, but some writing operations are very intensive). In the zpool there are about 150 filesystems with active dedup and compression. The total data is about 60 Tera in raidz2 on two vdev of 6 disks each. Please at least suggest what information to collect the next time the problem occurs. What would be useful? Do you have any scripts I can run?
hopeless kernel.log.gz
I also have this issue, and I am running zfs on root. I also run several dockers.
When the issue occurs, I can always trace it back to a period of high I/O. The type of disk activity doesn't seem to matter because sometimes it will happen during a routine scrub, before which the disks had been fairly idle for hours. Not all periods of high I/O cause the problem though.
Seemingly random problems like this, to me, point to a hardware issue (likely disk or storage controller), but badblocks and SMART tests have been coming up clean, so I'm not sure what to think at this point. I'll keep an eye on this and try to help out with logs or test data where I can.
Same issue for us on multiple servers with high frequency create/destroy/clone.
Using Debian 9.7 and ZOL 0.7.12
Here are some logs:
execve("/usr/sbin/update-initramfs", ["update-initramfs", "-u"], [/* 16 vars */]) = 0
brk(NULL) = 0x56112fbcd000
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
mmap(NULL, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f17c42f7000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=65446, ...}) = 0
mmap(NULL, 65446, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f17c42e7000
close(3) = 0
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0\4\2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1689360, ...}) = 0
mmap(NULL, 3795296, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f17c3d38000
mprotect(0x7f17c3ecd000, 2097152, PROT_NONE) = 0
mmap(0x7f17c40cd000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x195000) = 0x7f17c40cd000
mmap(0x7f17c40d3000, 14688, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f17c40d3000
close(3) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f17c42e5000
arch_prctl(ARCH_SET_FS, 0x7f17c42e5700) = 0
mprotect(0x7f17c40cd000, 16384, PROT_READ) = 0
mprotect(0x56112f1dd000, 8192, PROT_READ) = 0
mprotect(0x7f17c42fa000, 4096, PROT_READ) = 0
munmap(0x7f17c42e7000, 65446) = 0
getpid() = 8030
rt_sigaction(SIGCHLD, {sa_handler=0x56112efd3ef0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER, sa_restorer=0x7f17c3d6b060}, NULL, 8) = 0
geteuid() = 0
brk(NULL) = 0x56112fbcd000
brk(0x56112fbee000) = 0x56112fbee000
getppid() = 8028
stat("/root", {st_mode=S_IFDIR|0700, st_size=19, ...}) = 0
stat(".", {st_mode=S_IFDIR|0700, st_size=19, ...}) = 0
open("/usr/sbin/update-initramfs", O_RDONLY) = 3
fcntl(3, F_DUPFD, 10) = 10
close(3) = 0
fcntl(10, F_SETFD, FD_CLOEXEC) = 0
rt_sigaction(SIGINT, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGINT, {sa_handler=0x56112efd3ef0, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER, sa_restorer=0x7f17c3d6b060}, NULL, 8) = 0
rt_sigaction(SIGQUIT, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGQUIT, {sa_handler=SIG_DFL, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER, sa_restorer=0x7f17c3d6b060}, NULL, 8) = 0
rt_sigaction(SIGTERM, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGTERM, {sa_handler=SIG_DFL, sa_mask=~[RTMIN RT_1], sa_flags=SA_RESTORER, sa_restorer=0x7f17c3d6b060}, NULL, 8) = 0
read(10, "#!/bin/sh\n\nSTATEDIR=/var/lib/ini"..., 8192) = 8192
faccessat(AT_FDCWD, "/etc/initramfs-tools/update-initramfs.conf", R_OK) = 0
open("/etc/initramfs-tools/update-initramfs.conf", O_RDONLY) = 3
fcntl(3, F_DUPFD, 10) = 11
close(3) = 0
fcntl(11, F_SETFD, FD_CLOEXEC) = 0
read(11, "#\n# Configuration file for updat"..., 8192) = 378
read(11, "", 8192) = 0
close(11) = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f17c42e59d0) = 8031
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 1}], 0, NULL) = 8031
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=8031, si_uid=0, si_status=1, si_utime=0, si_stime=0} ---
rt_sigreturn({mask=[]}) = 8031
read(10, ";\n\tu)\n\t\tupdate\n\t\t;;\nesac\n", 8192) = 25
pipe([3, 4]) = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f17c42e59d0) = 8032
close(4) = 0
read(3, "4.15.18-10-pve\n4.15.18-9-pve\n4.1"..., 128) = 128
read(3, ".13.16-3-pve\n4.13.16-2-pve\n4.13."..., 128) = 128
read(3, "3.8-3-pve\n4.13.8-2-pve\n4.13.4-1-"..., 128) = 36
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=8032, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
rt_sigreturn({mask=[]}) = 36
read(3, "", 128) = 0
close(3) = 0
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 8032
faccessat(AT_FDCWD, "/proc/mounts", R_OK) = 0
stat("/usr/local/sbin/ischroot", 0x7ffe69edd4e0) = -1 ENOENT (No such file or directory)
stat("/usr/local/bin/ischroot", 0x7ffe69edd4e0) = -1 ENOENT (No such file or directory)
stat("/usr/sbin/ischroot", 0x7ffe69edd4e0) = -1 ENOENT (No such file or directory)
stat("/usr/bin/ischroot", {st_mode=S_IFREG|0755, st_size=10552, ...}) = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f17c42e59d0) = 8035
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 1}], 0, NULL) = 8035
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=8035, si_uid=0, si_status=1, si_utime=0, si_stime=0} ---
rt_sigreturn({mask=[]}) = 8035
pipe([3, 4]) = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f17c42e59d0) = 8036
close(4) = 0
read(3, "", 128) = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=8036, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
rt_sigreturn({mask=[]}) = 0
close(3) = 0
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 8036
stat("/boot/initrd.img-4.15.18-10-pve", {st_mode=S_IFREG|0644, st_size=36184842, ...}) = 0
stat("/usr/local/sbin/sha1sum", 0x7ffe69edd340) = -1 ENOENT (No such file or directory)
stat("/usr/local/bin/sha1sum", 0x7ffe69edd340) = -1 ENOENT (No such file or directory)
stat("/usr/sbin/sha1sum", 0x7ffe69edd340) = -1 ENOENT (No such file or directory)
stat("/usr/bin/sha1sum", {st_mode=S_IFREG|0755, st_size=47976, ...}) = 0
pipe([3, 4]) = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f17c42e59d0) = 8037
close(4) = 0
stat("/usr/local/sbin/diff", 0x7ffe69edd340) = -1 ENOENT (No such file or directory)
stat("/usr/local/bin/diff", 0x7ffe69edd340) = -1 ENOENT (No such file or directory)
stat("/usr/sbin/diff", 0x7ffe69edd340) = -1 ENOENT (No such file or directory)
stat("/usr/bin/diff", {st_mode=S_IFREG|0755, st_size=146824, ...}) = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f17c42e59d0) = 8038
close(3) = 0
close(-1) = -1 EBADF (Bad file descriptor)
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 8037
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=8037, si_uid=0, si_status=0, si_utime=9, si_stime=1} ---
rt_sigreturn({mask=[]}) = 8037
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 8038
faccessat(AT_FDCWD, "/boot/initrd.img-4.15.18-10-pve", R_OK) = 0
faccessat(AT_FDCWD, "/boot/initrd.img-4.15.18-10-pve.dpkg-bak", R_OK) = 0
stat("/usr/local/sbin/rm", 0x7ffe69edd520) = -1 ENOENT (No such file or directory)
stat("/usr/local/bin/rm", 0x7ffe69edd520) = -1 ENOENT (No such file or directory)
stat("/usr/sbin/rm", 0x7ffe69edd520) = -1 ENOENT (No such file or directory)
stat("/usr/bin/rm", 0x7ffe69edd520) = -1 ENOENT (No such file or directory)
stat("/sbin/rm", 0x7ffe69edd520) = -1 ENOENT (No such file or directory)
stat("/bin/rm", {st_mode=S_IFREG|0755, st_size=64424, ...}) = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f17c42e59d0) = 8039
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 8039
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=8039, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
rt_sigreturn({mask=[]}) = 8039
stat("/usr/local/sbin/ln", 0x7ffe69edd540) = -1 ENOENT (No such file or directory)
stat("/usr/local/bin/ln", 0x7ffe69edd540) = -1 ENOENT (No such file or directory)
stat("/usr/sbin/ln", 0x7ffe69edd540) = -1 ENOENT (No such file or directory)
stat("/usr/bin/ln", 0x7ffe69edd540) = -1 ENOENT (No such file or directory)
stat("/sbin/ln", 0x7ffe69edd540) = -1 ENOENT (No such file or directory)
stat("/bin/ln", {st_mode=S_IFREG|0755, st_size=56240, ...}) = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f17c42e59d0) = 8040
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 8040
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=8040, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
rt_sigreturn({mask=[]}) = 8040
write(1, "update-initramfs: Generating /bo"..., 61update-initramfs: Generating /boot/initrd.img-4.15.18-10-pve
) = 61
stat("/usr/local/sbin/mkinitramfs", 0x7ffe69edd580) = -1 ENOENT (No such file or directory)
stat("/usr/local/bin/mkinitramfs", 0x7ffe69edd580) = -1 ENOENT (No such file or directory)
stat("/usr/sbin/mkinitramfs", {st_mode=S_IFREG|0755, st_size=10255, ...}) = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f17c42e59d0) = 8041
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 8041
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=8041, si_uid=0, si_status=0, si_utime=1, si_stime=3} ---
rt_sigreturn({mask=[]}) = 8041
stat("/usr/local/sbin/mv", 0x7ffe69edd540) = -1 ENOENT (No such file or directory)
stat("/usr/local/bin/mv", 0x7ffe69edd540) = -1 ENOENT (No such file or directory)
stat("/usr/sbin/mv", 0x7ffe69edd540) = -1 ENOENT (No such file or directory)
stat("/usr/bin/mv", 0x7ffe69edd540) = -1 ENOENT (No such file or directory)
stat("/sbin/mv", 0x7ffe69edd540) = -1 ENOENT (No such file or directory)
stat("/bin/mv", {st_mode=S_IFREG|0755, st_size=126416, ...}) = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f17c42e59d0) = 14543
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 14543
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=14543, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
rt_sigreturn({mask=[]}) = 14543
open("/var/lib/initramfs-tools/4.15.18-10-pve", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
fcntl(1, F_DUPFD, 10) = 11
close(1) = 0
fcntl(11, F_SETFD, FD_CLOEXEC) = 0
dup2(3, 1) = 1
close(3) = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f17c42e59d0) = 14544
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 14544
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=14544, si_uid=0, si_status=0, si_utime=8, si_stime=1} ---
rt_sigreturn({mask=[]}) = 14544
dup2(11, 1) = 1
close(11) = 0
stat("/usr/local/sbin/sync", 0x7ffe69edd560) = -1 ENOENT (No such file or directory)
stat("/usr/local/bin/sync", 0x7ffe69edd560) = -1 ENOENT (No such file or directory)
stat("/usr/sbin/sync", 0x7ffe69edd560) = -1 ENOENT (No such file or directory)
stat("/usr/bin/sync", 0x7ffe69edd560) = -1 ENOENT (No such file or directory)
stat("/sbin/sync", 0x7ffe69edd560) = -1 ENOENT (No such file or directory)
stat("/bin/sync", {st_mode=S_IFREG|0755, st_size=31496, ...}) = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f17c42e59d0) = 14545
wait4(-1,
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
proc on /proc type proc (rw,relatime)
udev on /dev type devtmpfs (rw,nosuid,relatime,size=74244120k,nr_inodes=18561030,mode=755)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
tmpfs on /run type tmpfs (rw,nosuid,noexec,relatime,size=14852988k,mode=755)
rpool/ROOT/pve-1 on / type zfs (rw,relatime,xattr,noacl)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
tmpfs on /run/lock type tmpfs (rw,nosuid,nodev,noexec,relatime,size=5120k)
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=36,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=4157)
mqueue on /dev/mqueue type mqueue (rw,relatime)
debugfs on /sys/kernel/debug type debugfs (rw,relatime)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
sunrpc on /run/rpc_pipefs type rpc_pipefs (rw,relatime)
configfs on /sys/kernel/config type configfs (rw,relatime)
fusectl on /sys/fs/fuse/connections type fusectl (rw,relatime)
rpool on /rpool type zfs (rw,noatime,xattr,noacl)
rpool/ROOT on /rpool/ROOT type zfs (rw,noatime,xattr,noacl)
rpool/data on /rpool/data type zfs (rw,noatime,xattr,noacl)
rpool/data/subvol-102-disk-0 on /rpool/data/subvol-102-disk-0 type zfs (rw,noatime,xattr,posixacl)
rpool/data/subvol-110-disk-0 on /rpool/data/subvol-110-disk-0 type zfs (rw,noatime,xattr,posixacl)
rpool/data/subvol-111-disk-1 on /rpool/data/subvol-111-disk-1 type zfs (rw,noatime,xattr,posixacl)
rpool/data/subvol-120-disk-0 on /rpool/data/subvol-120-disk-0 type zfs (rw,noatime,xattr,posixacl)
rpool/data/subvol-134-disk-1 on /rpool/data/subvol-134-disk-1 type zfs (rw,noatime,xattr,posixacl)
rpool/data/subvol-137-disk-1 on /rpool/data/subvol-137-disk-1 type zfs (rw,noatime,xattr,posixacl)
lxcfs on /var/lib/lxcfs type fuse.lxcfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
dev-nas-1:/mnt/Pool-NAS/Backups on /mnt/dev-nas-1 type nfs (rw,relatime,vers=3,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.1.104,mountvers=3,mountport=881,mountproto=udp,local_lock=none,addr=192.168.1.104)
binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,relatime)
dev-nas-1:/mnt/Pool-NAS/Proxmox on /mnt/pve/dev-nas-1 type nfs (rw,relatime,vers=3,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.1.104,mountvers=3,mountport=881,mountproto=udp,local_lock=none,addr=192.168.1.104)
rpool/data/subvol-104-disk-0 on /rpool/data/subvol-104-disk-0 type zfs (rw,noatime,xattr,posixacl)
rpool/data/subvol-101-disk-0 on /rpool/data/subvol-101-disk-0 type zfs (rw,noatime,xattr,posixacl)
tmpfs on /run/user/0 type tmpfs (rw,nosuid,nodev,relatime,size=14852984k,mode=700)
rpool/data/subvol-109-disk-0 on /rpool/data/subvol-109-disk-0 type zfs (rw,noatime,xattr,posixacl)
/dev/fuse on /etc/pve type fuse (rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other)
tmpfs on /run/user/1006 type tmpfs (rw,nosuid,nodev,relatime,size=14852984k,mode=700,uid=1006,gid=1006)
tmpfs on /run/user/1001 type tmpfs (rw,nosuid,nodev,relatime,size=14852984k,mode=700,uid=1001,gid=1001)
[Mon Feb 4 11:08:25 2019] INFO: task kworker/u49:1:30643 blocked for more than 120 seconds.
[Mon Feb 4 11:08:25 2019] Tainted: P W IO 4.15.18-8-pve #1
[Mon Feb 4 11:08:25 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Mon Feb 4 11:08:25 2019] kworker/u49:1 D 0 30643 2 0x80000000
[Mon Feb 4 11:08:25 2019] Workqueue: writeback wb_workfn (flush-zfs-9)
[Mon Feb 4 11:08:25 2019] Call Trace:
[Mon Feb 4 11:08:25 2019] __schedule+0x3e0/0x870
[Mon Feb 4 11:08:25 2019] schedule+0x36/0x80
[Mon Feb 4 11:08:25 2019] io_schedule+0x16/0x40
[Mon Feb 4 11:08:25 2019] __lock_page+0xff/0x140
[Mon Feb 4 11:08:25 2019] ? page_cache_tree_insert+0xe0/0xe0
[Mon Feb 4 11:08:25 2019] write_cache_pages+0x2e9/0x4b0
[Mon Feb 4 11:08:25 2019] ? zpl_write_common_iovec+0xe0/0xe0 [zfs]
[Mon Feb 4 11:08:25 2019] ? _cond_resched+0x1a/0x50
[Mon Feb 4 11:08:25 2019] ? mutex_lock+0x12/0x40
[Mon Feb 4 11:08:25 2019] ? rrw_exit+0x5a/0x150 [zfs]
[Mon Feb 4 11:08:25 2019] zpl_writepages+0x91/0x170 [zfs]
[Mon Feb 4 11:08:25 2019] do_writepages+0x1f/0x70
[Mon Feb 4 11:08:25 2019] __writeback_single_inode+0x45/0x330
[Mon Feb 4 11:08:25 2019] writeback_sb_inodes+0x266/0x590
[Mon Feb 4 11:08:25 2019] __writeback_inodes_wb+0x92/0xc0
[Mon Feb 4 11:08:25 2019] wb_writeback+0x288/0x320
[Mon Feb 4 11:08:25 2019] wb_workfn+0x1a3/0x440
[Mon Feb 4 11:08:25 2019] ? wb_workfn+0x1a3/0x440
[Mon Feb 4 11:08:25 2019] process_one_work+0x1e0/0x400
[Mon Feb 4 11:08:25 2019] worker_thread+0x4b/0x420
[Mon Feb 4 11:08:25 2019] kthread+0x105/0x140
[Mon Feb 4 11:08:25 2019] ? process_one_work+0x400/0x400
[Mon Feb 4 11:08:25 2019] ? kthread_create_worker_on_cpu+0x70/0x70
[Mon Feb 4 11:08:25 2019] ? do_syscall_64+0x73/0x130
[Mon Feb 4 11:08:25 2019] ? SyS_exit_group+0x14/0x20
[Mon Feb 4 11:08:25 2019] ret_from_fork+0x35/0x40
NAME PROPERTY VALUE SOURCE
rpool/ROOT/pve-1 type filesystem -
rpool/ROOT/pve-1 creation Fri Oct 27 11:46 2017 -
rpool/ROOT/pve-1 used 7.23G -
rpool/ROOT/pve-1 available 1.08T -
rpool/ROOT/pve-1 referenced 7.23G -
rpool/ROOT/pve-1 compressratio 1.60x -
rpool/ROOT/pve-1 mounted yes -
rpool/ROOT/pve-1 quota none default
rpool/ROOT/pve-1 reservation none default
rpool/ROOT/pve-1 recordsize 128K default
rpool/ROOT/pve-1 mountpoint / local
rpool/ROOT/pve-1 sharenfs off default
rpool/ROOT/pve-1 checksum on default
rpool/ROOT/pve-1 compression on inherited from rpool
rpool/ROOT/pve-1 atime off inherited from rpool
rpool/ROOT/pve-1 devices on default
rpool/ROOT/pve-1 exec on default
rpool/ROOT/pve-1 setuid on default
rpool/ROOT/pve-1 readonly off default
rpool/ROOT/pve-1 zoned off default
rpool/ROOT/pve-1 snapdir hidden default
rpool/ROOT/pve-1 aclinherit restricted default
rpool/ROOT/pve-1 createtxg 10 -
rpool/ROOT/pve-1 canmount on default
rpool/ROOT/pve-1 xattr on default
rpool/ROOT/pve-1 copies 1 default
rpool/ROOT/pve-1 version 5 -
rpool/ROOT/pve-1 utf8only off -
rpool/ROOT/pve-1 normalization none -
rpool/ROOT/pve-1 casesensitivity sensitive -
rpool/ROOT/pve-1 vscan off default
rpool/ROOT/pve-1 nbmand off default
rpool/ROOT/pve-1 sharesmb off default
rpool/ROOT/pve-1 refquota none default
rpool/ROOT/pve-1 refreservation none default
rpool/ROOT/pve-1 guid 15527408840965499781 -
rpool/ROOT/pve-1 primarycache all default
rpool/ROOT/pve-1 secondarycache all default
rpool/ROOT/pve-1 usedbysnapshots 0B -
rpool/ROOT/pve-1 usedbydataset 7.23G -
rpool/ROOT/pve-1 usedbychildren 0B -
rpool/ROOT/pve-1 usedbyrefreservation 0B -
rpool/ROOT/pve-1 logbias latency default
rpool/ROOT/pve-1 dedup off default
rpool/ROOT/pve-1 mlslabel none default
rpool/ROOT/pve-1 sync standard inherited from rpool
rpool/ROOT/pve-1 dnodesize legacy default
rpool/ROOT/pve-1 refcompressratio 1.60x -
rpool/ROOT/pve-1 written 7.23G -
rpool/ROOT/pve-1 logicalused 11.3G -
rpool/ROOT/pve-1 logicalreferenced 11.3G -
rpool/ROOT/pve-1 volmode default default
rpool/ROOT/pve-1 filesystem_limit none default
rpool/ROOT/pve-1 snapshot_limit none default
rpool/ROOT/pve-1 filesystem_count none default
rpool/ROOT/pve-1 snapshot_count none default
rpool/ROOT/pve-1 snapdev hidden default
rpool/ROOT/pve-1 acltype off default
rpool/ROOT/pve-1 context none default
rpool/ROOT/pve-1 fscontext none default
rpool/ROOT/pve-1 defcontext none default
rpool/ROOT/pve-1 rootcontext none default
rpool/ROOT/pve-1 relatime on temporary
rpool/ROOT/pve-1 redundant_metadata all default
rpool/ROOT/pve-1 overlay off default
Let me know if you need anything else
Happens to me too on my Manjaro 4.19 desktop system with zol 0.7.12-11 Really annoying. once ever few hours. Since my homes (not root) are on the pool a lot starts hanging. Most annoying I can't even reboot cleanly as the reboot never manages to unmount everything and reboot also hangs.
[87611.236429] INFO: task txg_sync:1006 blocked for more than 120 seconds.
[87611.236434] Tainted: P OE 4.19.23-1-MANJARO #1
[87611.236435] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87611.236437] txg_sync D 0 1006 2 0x80000080
[87611.236440] Call Trace:
[87611.236448] ? __schedule+0x29b/0x8b0
[87611.236482] ? zio_taskq_dispatch+0x6f/0x90 [zfs]
[87611.236484] schedule+0x32/0x90
[87611.236487] io_schedule+0x12/0x40
[87611.236493] cv_wait_common+0xaa/0x130 [spl]
[87611.236497] ? wait_woken+0x80/0x80
[87611.236524] zio_wait+0x113/0x1b0 [zfs]
[87611.236548] dsl_pool_sync+0xb9/0x400 [zfs]
[87611.236576] spa_sync+0x48b/0xd30 [zfs]
[87611.236604] txg_sync_thread+0x2c6/0x4a0 [zfs]
[87611.236607] ? _raw_spin_unlock_irq+0x1d/0x30
[87611.236633] ? txg_thread_exit.isra.4+0x60/0x60 [zfs]
[87611.236636] ? __thread_exit+0x20/0x20 [spl]
[87611.236639] thread_generic_wrapper+0x6f/0x80 [spl]
[87611.236642] kthread+0x112/0x130
[87611.236644] ? kthread_park+0x80/0x80
[87611.236646] ret_from_fork+0x22/0x40
[87611.236779] INFO: task TaskSchedulerFo:9759 blocked for more than 120 seconds.
[87611.236781] Tainted: P OE 4.19.23-1-MANJARO #1
[87611.236782] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87611.236783] TaskSchedulerFo D 0 9759 9346 0x00000080
[87611.236785] Call Trace:
[87611.236787] ? __schedule+0x29b/0x8b0
[87611.236789] schedule+0x32/0x90
[87611.236790] io_schedule+0x12/0x40
[87611.236794] cv_wait_common+0xaa/0x130 [spl]
[87611.236796] ? wait_woken+0x80/0x80
[87611.236823] zio_wait+0x113/0x1b0 [zfs]
[87611.236849] zil_commit.part.8+0x85c/0x940 [zfs]
[87611.236875] zfs_fsync+0x70/0xd0 [zfs]
[87611.236900] zpl_fsync+0x63/0x90 [zfs]
[87611.236903] ? __fget+0x6e/0xa0
[87611.236906] do_fsync+0x38/0x70
[87611.236908] __x64_sys_fdatasync+0x13/0x20
[87611.236910] do_syscall_64+0x65/0x180
[87611.236912] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[87611.236914] RIP: 0033:0x7f28924bf73f
[87611.236919] Code: Bad RIP value.
[87611.236920] RSP: 002b:00007f28810473e0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[87611.236922] RAX: ffffffffffffffda RBX: 0000000000000073 RCX: 00007f28924bf73f
[87611.236923] RDX: 0000000000000000 RSI: 000037ea5e9df880 RDI: 0000000000000073
[87611.236924] RBP: 00007f2881047430 R08: 0000000000000000 R09: 00000000f1b86835
[87611.236925] R10: 00000000f1b86835 R11: 0000000000000293 R12: 00007f28924bf700
[87611.236925] R13: 00007f2896b3a3f0 R14: 0000000000000000 R15: 000037ea5e9df8a0
[87734.114347] INFO: task txg_sync:1006 blocked for more than 120 seconds.
[87734.114351] Tainted: P OE 4.19.23-1-MANJARO #1
[87734.114352] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87734.114353] txg_sync D 0 1006 2 0x80000080
[87734.114355] Call Trace:
[87734.114363] ? __schedule+0x29b/0x8b0
[87734.114392] ? zio_taskq_dispatch+0x6f/0x90 [zfs]
[87734.114394] schedule+0x32/0x90
[87734.114396] io_schedule+0x12/0x40
[87734.114402] cv_wait_common+0xaa/0x130 [spl]
[87734.114404] ? wait_woken+0x80/0x80
[87734.114426] zio_wait+0x113/0x1b0 [zfs]
[87734.114446] dsl_pool_sync+0xb9/0x400 [zfs]
[87734.114468] spa_sync+0x48b/0xd30 [zfs]
[87734.114491] txg_sync_thread+0x2c6/0x4a0 [zfs]
[87734.114493] ? _raw_spin_unlock_irq+0x1d/0x30
[87734.114514] ? txg_thread_exit.isra.4+0x60/0x60 [zfs]
[87734.114517] ? __thread_exit+0x20/0x20 [spl]
[87734.114519] thread_generic_wrapper+0x6f/0x80 [spl]
[87734.114522] kthread+0x112/0x130
[87734.114523] ? kthread_park+0x80/0x80
[87734.114525] ret_from_fork+0x22/0x40
[87734.114652] INFO: task TaskSchedulerFo:9759 blocked for more than 120 seconds.
[87734.114653] Tainted: P OE 4.19.23-1-MANJARO #1
[87734.114654] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87734.114655] TaskSchedulerFo D 0 9759 9346 0x00000080
[87734.114656] Call Trace:
[87734.114658] ? __schedule+0x29b/0x8b0
[87734.114660] schedule+0x32/0x90
[87734.114661] io_schedule+0x12/0x40
[87734.114664] cv_wait_common+0xaa/0x130 [spl]
[87734.114666] ? wait_woken+0x80/0x80
[87734.114687] zio_wait+0x113/0x1b0 [zfs]
[87734.114708] zil_commit.part.8+0x85c/0x940 [zfs]
[87734.114729] zfs_fsync+0x70/0xd0 [zfs]
[87734.114750] zpl_fsync+0x63/0x90 [zfs]
[87734.114753] ? __fget+0x6e/0xa0
[87734.114755] do_fsync+0x38/0x70
[87734.114756] __x64_sys_fdatasync+0x13/0x20
[87734.114758] do_syscall_64+0x65/0x180
[87734.114760] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[87734.114762] RIP: 0033:0x7f28924bf73f
[87734.114766] Code: Bad RIP value.
[87734.114767] RSP: 002b:00007f28810473e0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[87734.114769] RAX: ffffffffffffffda RBX: 0000000000000073 RCX: 00007f28924bf73f
[87734.114769] RDX: 0000000000000000 RSI: 000037ea5e9df880 RDI: 0000000000000073
[87734.114770] RBP: 00007f2881047430 R08: 0000000000000000 R09: 00000000f1b86835
[87734.114771] R10: 00000000f1b86835 R11: 0000000000000293 R12: 00007f28924bf700
[87734.114771] R13: 00007f2896b3a3f0 R14: 0000000000000000 R15: 000037ea5e9df8a0
[87734.114773] INFO: task TaskSchedulerFo:24195 blocked for more than 120 seconds.
[87734.114774] Tainted: P OE 4.19.23-1-MANJARO #1
[87734.114775] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87734.114776] TaskSchedulerFo D 0 24195 9346 0x00000080
[87734.114777] Call Trace:
[87734.114779] ? __schedule+0x29b/0x8b0
[87734.114781] ? preempt_count_add+0x79/0xb0
[87734.114783] schedule+0x32/0x90
[87734.114786] cv_wait_common+0xf6/0x130 [spl]
[87734.114787] ? wait_woken+0x80/0x80
[87734.114807] zil_commit.part.8+0x80/0x940 [zfs]
[87734.114809] ? _raw_spin_lock+0x13/0x40
[87734.114809] ? _raw_spin_unlock+0x16/0x30
[87734.114812] ? tsd_set+0x203/0x4f0 [spl]
[87734.114813] ? preempt_count_add+0x79/0xb0
[87734.114833] zfs_fsync+0x70/0xd0 [zfs]
[87734.114853] zpl_fsync+0x63/0x90 [zfs]
[87734.114854] ? __fget+0x6e/0xa0
[87734.114855] do_fsync+0x38/0x70
[87734.114857] __x64_sys_fdatasync+0x13/0x20
[87734.114858] do_syscall_64+0x65/0x180
[87734.114859] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[87734.114860] RIP: 0033:0x7f28924bf73f
[87734.114861] Code: Bad RIP value.
[87734.114862] RSP: 002b:00007f287f043cf0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[87734.114863] RAX: ffffffffffffffda RBX: 000000000000005d RCX: 00007f28924bf73f
[87734.114864] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 000000000000005d
[87734.114864] RBP: 00007f287f043d40 R08: 0000000000000000 R09: 0000000000000080
[87734.114865] R10: 0000000000001000 R11: 0000000000000293 R12: 0000000000000000
[87734.114865] R13: d763a120f905d5d9 R14: 0000000000000000 R15: 0000000000001000
[87734.114878] INFO: task CIPCServer::Thr:13852 blocked for more than 120 seconds.
[87734.114879] Tainted: P OE 4.19.23-1-MANJARO #1
[87734.114880] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87734.114881] CIPCServer::Thr D 0 13852 13418 0x20020080
[87734.114882] Call Trace:
[87734.114884] ? __schedule+0x29b/0x8b0
[87734.114885] ? __switch_to_asm+0x40/0x70
[87734.114886] ? preempt_count_add+0x79/0xb0
[87734.114887] schedule+0x32/0x90
[87734.114890] cv_wait_common+0xf6/0x130 [spl]
[87734.114892] ? wait_woken+0x80/0x80
[87734.114911] zil_commit.part.8+0x80/0x940 [zfs]
[87734.114913] ? _raw_spin_lock+0x13/0x40
[87734.114913] ? _raw_spin_unlock+0x16/0x30
[87734.114916] ? tsd_set+0x203/0x4f0 [spl]
[87734.114918] ? preempt_count_add+0x79/0xb0
[87734.114937] zfs_fsync+0x70/0xd0 [zfs]
[87734.114957] zpl_fsync+0x63/0x90 [zfs]
[87734.114958] ? __fget+0x6e/0xa0
[87734.114959] do_fsync+0x38/0x70
[87734.114960] __ia32_sys_fdatasync+0x13/0x20
[87734.114962] do_fast_syscall_32+0xa7/0x2a0
[87734.114963] entry_SYSCALL_compat_after_hwframe+0x45/0x4d
[87856.992274] INFO: task txg_sync:1006 blocked for more than 120 seconds.
[87856.992279] Tainted: P OE 4.19.23-1-MANJARO #1
[87856.992281] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87856.992284] txg_sync D 0 1006 2 0x80000080
[87856.992287] Call Trace:
[87856.992297] ? __schedule+0x29b/0x8b0
[87856.992340] ? zio_taskq_dispatch+0x6f/0x90 [zfs]
[87856.992343] schedule+0x32/0x90
[87856.992347] io_schedule+0x12/0x40
[87856.992354] cv_wait_common+0xaa/0x130 [spl]
[87856.992358] ? wait_woken+0x80/0x80
[87856.992396] zio_wait+0x113/0x1b0 [zfs]
[87856.992431] dsl_pool_sync+0xb9/0x400 [zfs]
[87856.992473] spa_sync+0x48b/0xd30 [zfs]
[87856.992513] txg_sync_thread+0x2c6/0x4a0 [zfs]
[87856.992516] ? _raw_spin_unlock_irq+0x1d/0x30
[87856.992553] ? txg_thread_exit.isra.4+0x60/0x60 [zfs]
[87856.992558] ? __thread_exit+0x20/0x20 [spl]
[87856.992562] thread_generic_wrapper+0x6f/0x80 [spl]
[87856.992565] kthread+0x112/0x130
[87856.992567] ? kthread_park+0x80/0x80
[87856.992570] ret_from_fork+0x22/0x40
[87856.992714] INFO: task TaskSchedulerFo:9759 blocked for more than 120 seconds.
[87856.992717] Tainted: P OE 4.19.23-1-MANJARO #1
[87856.992718] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87856.992720] TaskSchedulerFo D 0 9759 9346 0x00000080
[87856.992722] Call Trace:
[87856.992725] ? __schedule+0x29b/0x8b0
[87856.992728] schedule+0x32/0x90
[87856.992730] io_schedule+0x12/0x40
[87856.992735] cv_wait_common+0xaa/0x130 [spl]
[87856.992738] ? wait_woken+0x80/0x80
[87856.992775] zio_wait+0x113/0x1b0 [zfs]
[87856.992811] zil_commit.part.8+0x85c/0x940 [zfs]
[87856.992848] zfs_fsync+0x70/0xd0 [zfs]
[87856.992884] zpl_fsync+0x63/0x90 [zfs]
[87856.992888] ? __fget+0x6e/0xa0
[87856.992891] do_fsync+0x38/0x70
[87856.992893] __x64_sys_fdatasync+0x13/0x20
[87856.992896] do_syscall_64+0x65/0x180
[87856.992899] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[87856.992901] RIP: 0033:0x7f28924bf73f
[87856.992907] Code: Bad RIP value.
[87856.992908] RSP: 002b:00007f28810473e0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[87856.992910] RAX: ffffffffffffffda RBX: 0000000000000073 RCX: 00007f28924bf73f
[87856.992911] RDX: 0000000000000000 RSI: 000037ea5e9df880 RDI: 0000000000000073
[87856.992912] RBP: 00007f2881047430 R08: 0000000000000000 R09: 00000000f1b86835
[87856.992913] R10: 00000000f1b86835 R11: 0000000000000293 R12: 00007f28924bf700
[87856.992914] R13: 00007f2896b3a3f0 R14: 0000000000000000 R15: 000037ea5e9df8a0
[87856.992918] INFO: task TaskSchedulerFo:24195 blocked for more than 120 seconds.
[87856.992920] Tainted: P OE 4.19.23-1-MANJARO #1
[87856.992921] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87856.992922] TaskSchedulerFo D 0 24195 9346 0x00000080
[87856.992924] Call Trace:
[87856.992927] ? __schedule+0x29b/0x8b0
[87856.992930] ? preempt_count_add+0x79/0xb0
[87856.992933] schedule+0x32/0x90
[87856.992938] cv_wait_common+0xf6/0x130 [spl]
[87856.992940] ? wait_woken+0x80/0x80
[87856.992975] zil_commit.part.8+0x80/0x940 [zfs]
[87856.992978] ? _raw_spin_lock+0x13/0x40
[87856.992979] ? _raw_spin_unlock+0x16/0x30
[87856.992984] ? tsd_set+0x203/0x4f0 [spl]
[87856.992986] ? preempt_count_add+0x79/0xb0
[87856.993021] zfs_fsync+0x70/0xd0 [zfs]
[87856.993056] zpl_fsync+0x63/0x90 [zfs]
[87856.993058] ? __fget+0x6e/0xa0
[87856.993060] do_fsync+0x38/0x70
[87856.993062] __x64_sys_fdatasync+0x13/0x20
[87856.993064] do_syscall_64+0x65/0x180
[87856.993067] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[87856.993068] RIP: 0033:0x7f28924bf73f
[87856.993071] Code: Bad RIP value.
[87856.993071] RSP: 002b:00007f287f043cf0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[87856.993073] RAX: ffffffffffffffda RBX: 000000000000005d RCX: 00007f28924bf73f
[87856.993074] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 000000000000005d
[87856.993075] RBP: 00007f287f043d40 R08: 0000000000000000 R09: 0000000000000080
[87856.993076] R10: 0000000000001000 R11: 0000000000000293 R12: 0000000000000000
[87856.993077] R13: d763a120f905d5d9 R14: 0000000000000000 R15: 0000000000001000
[87856.993092] INFO: task CIPCServer::Thr:13852 blocked for more than 120 seconds.
[87856.993094] Tainted: P OE 4.19.23-1-MANJARO #1
[87856.993096] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87856.993097] CIPCServer::Thr D 0 13852 13418 0x20020084
[87856.993099] Call Trace:
[87856.993102] ? __schedule+0x29b/0x8b0
[87856.993104] ? __switch_to_asm+0x40/0x70
[87856.993106] ? preempt_count_add+0x79/0xb0
[87856.993109] schedule+0x32/0x90
[87856.993113] cv_wait_common+0xf6/0x130 [spl]
[87856.993116] ? wait_woken+0x80/0x80
[87856.993150] zil_commit.part.8+0x80/0x940 [zfs]
[87856.993153] ? _raw_spin_lock+0x13/0x40
[87856.993154] ? _raw_spin_unlock+0x16/0x30
[87856.993159] ? tsd_set+0x203/0x4f0 [spl]
[87856.993161] ? preempt_count_add+0x79/0xb0
[87856.993196] zfs_fsync+0x70/0xd0 [zfs]
[87856.993230] zpl_fsync+0x63/0x90 [zfs]
[87856.993232] ? __fget+0x6e/0xa0
[87856.993234] do_fsync+0x38/0x70
[87856.993236] __ia32_sys_fdatasync+0x13/0x20
[87856.993238] do_fast_syscall_32+0xa7/0x2a0
[87856.993241] entry_SYSCALL_compat_after_hwframe+0x45/0x4d
Also the process is still stuck in D even waiting far longer:
$ ps aux | grep D
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
...
root 1006 0.0 0.0 0 0 ? D Feb23 0:19 [txg_sync]
My 32Gb memory is still only 40% full
S.M.A.R.T data of all disks is fine and it seems the unmount is stuck at random disks each time.
My pool also is fine:
pool: tank
state: ONLINE
scan: resilvered 11,8T in 13h51m with 0 errors on Wed Jan 2 02:00:28 2019
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sdx_crypt10 ONLINE 0 0 0
sdx_crypt11 ONLINE 0 0 0
sdx_crypt7 ONLINE 0 0 0
sdx_crypt6 ONLINE 0 0 0
cache
nvme-Samsung_SSD_960_EVO_500GB_S3EUNX0J215828E ONLINE 0 0 0
errors: No known data errors
My pool consists of 4 10Tb LUKS encrypted disks. And one m.2 NVME drive as cache.
Here my iostat -mx output:
iostat -mx
Linux 4.19.23-1-MANJARO (I-KNOW-YOU.torgato.de) 25.02.2019 _x86_64_ (16 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
5,86 0,00 1,90 4,02 0,00 88,22
Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
nvme1n1 0,66 3,77 0,02 0,10 0,01 2,92 0,91 43,67 0,33 1,19 0,61 24,97 26,17 136,18 60,30
nvme0n1 0,03 0,99 0,00 0,08 0,00 0,00 0,00 0,00 0,44 2,01 0,60 36,25 81,47 592,53 60,21
sda 0,22 4,78 0,01 0,10 0,00 0,03 0,00 0,71 4,66 2,96 0,00 47,71 21,38 0,43 0,22
sdb 0,22 4,80 0,01 0,10 0,00 0,03 0,00 0,72 4,58 3,01 0,00 47,46 21,32 0,43 0,22
sdc 0,22 4,78 0,01 0,10 0,00 0,03 0,00 0,72 4,63 2,96 0,00 47,53 21,37 0,43 0,22
sdd 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 1,69 0,00 0,00 19,63 0,00 1,65 0,00
sde 0,22 4,80 0,01 0,10 0,00 0,03 0,00 0,71 4,53 2,84 0,00 47,66 21,26 0,43 0,22
dm-0 0,51 4,54 0,01 0,10 0,00 0,00 0,00 0,00 23,68 0,73 0,02 20,92 22,54 2,59 1,31
dm-1 0,51 4,55 0,01 0,10 0,00 0,00 0,00 0,00 22,65 0,72 0,01 20,97 22,43 2,48 1,26
dm-2 0,51 4,55 0,01 0,10 0,00 0,00 0,00 0,00 24,22 0,73 0,02 20,96 22,48 2,65 1,34
dm-3 0,51 4,53 0,01 0,10 0,00 0,00 0,00 0,00 23,64 0,73 0,28 21,04 22,56 20,12 10,14
Shutdown log:
shutdown log:
Feb 23 01:12:39 I-KNOW-YOU.torgato.de systemd[1]: Stopped Cryptography Setup for sdx_crypt10.
Feb 23 01:12:39 I-KNOW-YOU.torgato.de audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-cryptsetup@sdx_crypt10 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
Feb 23 01:12:39 I-KNOW-YOU.torgato.de kernel: audit: type=1131 audit(1550880759.194:135): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-cryptsetup@sdx_crypt10 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
Feb 23 01:12:39 I-KNOW-YOU.torgato.de systemd-cryptsetup[28103]: device-mapper: remove ioctl on sdx_crypt6 failed: Device or resource busy
Feb 23 01:12:39 I-KNOW-YOU.torgato.de systemd-cryptsetup[28103]: Device sdx_crypt6 is still in use.
Feb 23 01:12:39 I-KNOW-YOU.torgato.de systemd-cryptsetup[28103]: Failed to deactivate: Device or resource busy
Feb 23 01:12:39 I-KNOW-YOU.torgato.de systemd[1]: systemd-cryptsetup@sdx_crypt6.service: Control process exited, code=exited, status=1/FAILURE
Feb 23 01:12:39 I-KNOW-YOU.torgato.de systemd[1]: systemd-cryptsetup@sdx_crypt6.service: Failed with result 'exit-code'.
Feb 23 01:12:39 I-KNOW-YOU.torgato.de systemd[1]: Stopped Cryptography Setup for sdx_crypt6.
Feb 23 01:12:39 I-KNOW-YOU.torgato.de audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-cryptsetup@sdx_crypt6 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
Feb 23 01:12:39 I-KNOW-YOU.torgato.de kernel: audit: type=1131 audit(1550880759.217:136): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-cryptsetup@sdx_crypt6 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Received SIGINT.
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Condition check resulted in Generate shutdown-ramfs being skipped.
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Unmounting /tank/Torge...
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Unmounting Temporary Directory (/tmp)...
Feb 23 01:13:06 I-KNOW-YOU.torgato.de umount[28167]: umount: /tank/Torge: target is busy.
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: tank-Torge.mount: Mount process exited, code=exited, status=32/n/a
Feb 23 01:13:06 I-KNOW-YOU.torgato.de umount[28168]: umount: /tmp: target is busy.
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Failed unmounting /tank/Torge.
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: tmp.mount: Mount process exited, code=exited, status=32/n/a
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Failed unmounting Temporary Directory (/tmp).
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Received SIGINT.
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Unmounting Temporary Directory (/tmp)...
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Condition check resulted in Generate shutdown-ramfs being skipped.
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Unmounting /tank/Torge...
Feb 23 01:13:06 I-KNOW-YOU.torgato.de umount[28170]: umount: /tmp: target is busy.
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: tmp.mount: Mount process exited, code=exited, status=32/n/a
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Failed unmounting Temporary Directory (/tmp).
Feb 23 01:13:06 I-KNOW-YOU.torgato.de umount[28171]: umount: /tank/Torge: target is busy.
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: tank-Torge.mount: Mount process exited, code=exited, status=32/n/a
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Failed unmounting /tank/Torge.
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Received SIGINT.
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Condition check resulted in Generate shutdown-ramfs being skipped.
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Unmounting /tank/Torge...
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Unmounting Temporary Directory (/tmp)...
Feb 23 01:13:06 I-KNOW-YOU.torgato.de umount[28174]: umount: /tank/Torge: target is busy.
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: tank-Torge.mount: Mount process exited, code=exited, status=32/n/a
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Failed unmounting /tank/Torge.
Feb 23 01:13:06 I-KNOW-YOU.torgato.de umount[28175]: umount: /tmp: target is busy.
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: tmp.mount: Mount process exited, code=exited, status=32/n/a
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Failed unmounting Temporary Directory (/tmp).
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Received SIGINT.
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Unmounting /tank/Torge...
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Unmounting Temporary Directory (/tmp)...
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Condition check resulted in Generate shutdown-ramfs being skipped.
Feb 23 01:13:06 I-KNOW-YOU.torgato.de umount[28177]: umount: /tank/Torge: target is busy.
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: tank-Torge.mount: Mount process exited, code=exited, status=32/n/a
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Failed unmounting /tank/Torge.
Feb 23 01:13:06 I-KNOW-YOU.torgato.de umount[28178]: umount: /tmp: target is busy.
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: tmp.mount: Mount process exited, code=exited, status=32/n/a
Feb 23 01:13:06 I-KNOW-YOU.torgato.de systemd[1]: Failed unmounting Temporary Directory (/tmp).
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: Received SIGINT.
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: Condition check resulted in Generate shutdown-ramfs being skipped.
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: Unmounting Temporary Directory (/tmp)...
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: Unmounting /tank/Torge...
Feb 23 01:13:07 I-KNOW-YOU.torgato.de umount[28181]: umount: /tmp: target is busy.
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: tmp.mount: Mount process exited, code=exited, status=32/n/a
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: Failed unmounting Temporary Directory (/tmp).
Feb 23 01:13:07 I-KNOW-YOU.torgato.de umount[28182]: umount: /tank/Torge: target is busy.
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: tank-Torge.mount: Mount process exited, code=exited, status=32/n/a
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: Failed unmounting /tank/Torge.
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: Received SIGINT.
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: Condition check resulted in Generate shutdown-ramfs being skipped.
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: Unmounting /tank/Torge...
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: Unmounting Temporary Directory (/tmp)...
Feb 23 01:13:07 I-KNOW-YOU.torgato.de umount[28184]: umount: /tank/Torge: target is busy.
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: tank-Torge.mount: Mount process exited, code=exited, status=32/n/a
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: Failed unmounting /tank/Torge.
Feb 23 01:13:07 I-KNOW-YOU.torgato.de umount[28185]: umount: /tmp: target is busy.
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: tmp.mount: Mount process exited, code=exited, status=32/n/a
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: Failed unmounting Temporary Directory (/tmp).
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: Received SIGINT.
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: Condition check resulted in Generate shutdown-ramfs being skipped.
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: Unmounting Temporary Directory (/tmp)...
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: Unmounting /tank/Torge...
Feb 23 01:13:07 I-KNOW-YOU.torgato.de umount[28187]: umount: /tmp: target is busy.
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: tmp.mount: Mount process exited, code=exited, status=32/n/a
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: Failed unmounting Temporary Directory (/tmp).
Feb 23 01:13:07 I-KNOW-YOU.torgato.de umount[28188]: umount: /tank/Torge: target is busy.
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: tank-Torge.mount: Mount process exited, code=exited, status=32/n/a
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: Failed unmounting /tank/Torge.
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: Received SIGINT.
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: Forcibly rebooting: Ctrl-Alt-Del was pressed more than 7 times within 2s
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd[1]: Shutting down.
Feb 23 01:13:07 I-KNOW-YOU.torgato.de kernel: systemd-shutdow: 49 output lines suppressed due to ratelimiting
Feb 23 01:13:07 I-KNOW-YOU.torgato.de systemd-shutdown[1]: Syncing filesystems and block devices.
HERE IT HANGS
Smart data sdb/sdx_crypt6
[I-KNOW-YOU log]# smartctl -a /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-4.19.23-1-MANJARO] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Seagate IronWolf
Device Model: ST10000VN0004-1ZD101
Serial Number: ZA20SV9S
LU WWN Device Id: 5 000c50 09396796e
Firmware Version: SC60
User Capacity: 10.000.831.348.736 bytes [10,0 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Mon Feb 25 00:35:55 2019 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 584) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 881) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x50bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 083 064 044 Pre-fail Always - 190984434
3 Spin_Up_Time 0x0003 088 086 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 099 099 020 Old_age Always - 1689
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 091 060 045 Pre-fail Always - 1383480407
9 Power_On_Hours 0x0032 088 088 000 Old_age Always - 10784 (194 247 0)
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 099 099 020 Old_age Always - 1028
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 096 000 Old_age Always - 2851901801115
189 High_Fly_Writes 0x003a 085 085 000 Old_age Always - 15
190 Airflow_Temperature_Cel 0x0022 067 045 040 Old_age Always - 33 (Min/Max 18/39)
191 G-Sense_Error_Rate 0x0032 099 099 000 Old_age Always - 3678
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 76
193 Load_Cycle_Count 0x0032 082 082 000 Old_age Always - 37441
194 Temperature_Celsius 0x0022 033 055 000 Old_age Always - 33 (0 18 0 0 0)
195 Hardware_ECC_Recovered 0x001a 011 001 000 Old_age Always - 190984434
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0023 100 100 001 Pre-fail Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 5122 (129 181 0)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 56499736146
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 254319596354
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Please this is so damn annoying. Any data I can collect to help get to the bottom of this? Why is the kernel not even able to reboot?
Jonathon from the Manjaro Team had me check the i/o scheduler
I did and it turned out, that while zfs did set the i/o scheduler of the dm cryptsetup devices to none
(I'm using luks encrypted disks) it did not manage to change the scheduler of the underlying physical sdx disks to none
as well. They were set to mq-deadline
instead
I now changed all of them to none as well and until now the problem did not reappear. It is far from long enough to be sure, but it is definitely already a notable time that passed and my hopes are up.
So if you have this issue and you use LUKS devices, check that the io scheduler is set to none on both layers and if not change it.
Update: can also be "noop" on your system
(I'm using luks encrypted disks)
Hmmm, me too! Is anybody here who NOT used LUKS but faced with this issue?
@CySlider you use ZFS pool on LUKS or LUKS vdev on ZFS?
check that the io scheduler is set to none
Scheduler for HDDs is called "noop", not "none". Maybe this is important.
(So, I changed it to "noop" and stay fingers crossed)
On my system it is "none" but you are right it might also be called "noop". You can check the available modes with: cat /sys/block/THE_DEVICE/queue/scheduler
@denizzzka I am NOT using LUKS, but I am using ZFS native encryption.
The scheduler for all (12) devices is set to mq-deadline
. I will set these to none
and see what happens!
If you followed by guide to persist changes after reboot. I had an error in there and corrected the post. Make sure you did not copy the mistake.
(I'm using luks encrypted disks)
Hmmm, me too! Is anybody here who NOT used LUKS but faced with this issue?
@CySlider you use ZFS pool on LUKS or LUKS vdev on ZFS?
nearly missed your edit.
I am not sure which of your options it is. I use LUKS transparently to ZFS. I set up my whole drives individually with LUKS. mount them decrypted. And only after that added them as devices to ZFS. Hope this answers your question.
I started adding new partition into rpool mirror.
While resilvering I added a new pool and started transferring gigabyte files to it from rpool. Then it caused: https://gist.github.com/denizzzka/9fb23dbeb6db3ba6074eed04d6da47bb
Probably, if I were use rpool everything would hang, but rpool is used only for read and not affected.
Be aware that the message itself is not an indicator of something going wrong. It could also just mean your disks are really busy for two minutes. It is only a problem if it does get totally stuck as it did in my case.
Be aware that the message itself is not an indicator of something going wrong.
After that copying process is just stalled...
Still hangs with noop
queue scheduler for /dev/sd[a-z]
Still hangs with
noop
queue scheduler for /dev/sd[a-z]
Same result here. I switched it over to bfq
to see what happens.
The hangs reported here are very similar to ones we've seen reported in the past. They're caused by an I/O which is submitted to the kernel but for some as-yet-unknown reason neither completes, fails, nor times out. This results in the observed hang. Since the issue can't be reproduced easily we have not be able to determine which component in the stack is responsible (ZFS, LUKS, scsi layer/driver, or hardware).
In order the better understand how this is possible we've added diagnostic and optional recovery code. These changes are not in the 0.7.13 release but they are in the current 0.8.0-rc releases.
@bglod since you're running ZFS's native encryption you should also have this debugging. By default, it's largely disabled but if a hung I/O is detected a new deadman event will be posted. If you encounter this issue again it would be help our analysis if you could check zpool events -v
for such an event.
If you'd like to enable the recovery logic, which attempts to retry the hung I/O, you'll need to set zfs_deadman_failmode
property to continue
. This should allow the system to continue operating after the zfs_deadman_ziotime_ms=300s
time is reached.
zfs_deadman_failmode (charp)
Controls the failure behavior when the deadman detects a
"hung" I/O. Valid values are wait, continue, and panic.
wait - Wait for a "hung" I/O to complete. For each "hung" I/O
a "deadman" event will be posted describing that I/O.
continue - Attempt to recover from a "hung" I/O by re-dis‐
patching it to the I/O pipeline if possible.
panic - Panic the system. This can be used to facilitate an
automatic fail-over to a properly configured fail-over partner.
Default value: wait.
Furthermore, we want to identify the root cause for this issue so if anyone is running an 0.8.0-rc and is willing to provide additional debug logs please let me know. There are a few more debugging options we can try enabling to get more detailed information about the outstanding I/Os.
I did not see a hang when the scheduler was set to bfq
for the last 10 days or so. This doesn't mean bfq
is the way to go, of course. zpool events
did not show any deadman events, but it seems the event log only goes back as far as the last reboot. Because I hadn't experienced another hang, it's understandable there were no deadman entries.
I upgraded my zfs from becdcec7b9cd6b8beaa2273495b97c06de24e9e5 to 060f0226e6396a3c7104fedc8d2af7063a27c1f9 (latest for Arch's zfs-linux-git) and set the zfs_deadman_failmode=continue
module option. I also changed the scheduler back to the default of mq-deadline
.
I setup a script to periodically check for deadman events and e-mail me if any are found. I will report back with any findings.
Update: still locks up. Running zpool events -v
just hangs forever.
Hi, after upgrading to 0.8 I'm having a similar ZFS stuck issue that I'm able to reproduce with mdtest over NFS, as detailed in https://github.com/zfsonlinux/zfs/issues/7473
I provided further logs in the issue above but let me know if there's anything else I can provide.
Once the system go stuck I ran:
# echo continue > /sys/module/zfs/parameters/zfs_deadman_failmode
But it remains stuck and no events have been recorded. In this case txg_sync is in D state, but there's no process consuming CPU on the node. Perhaps I had to set the zfs_deadman_failmode mode before ZFS got stuck? Should I set zio_deadman_log_all to 1?
Can this issue be related with latest kernel issue "Incorrect disk IO caused by blk-mq direct issue can lead to file system corruption"? https://bugzilla.kernel.org/show_bug.cgi?id=201685
Can this issue be related with latest kernel issue "Incorrect disk IO caused by blk-mq direct issue can lead to file system corruption"? https://bugzilla.kernel.org/show_bug.cgi?id=201685
I don't think so, I'm using 'none' for all the (nvme) devices in this zpool:
# cat /sys/block/nvme*/queue/scheduler
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
[none] mq-deadline kyber
I managed to reproduce the stuck on 0.8.0-rc3_126_g5dbf8b4ed, now with zpool events. Details in https://github.com/zfsonlinux/zfs/issues/7473
If it is still actual to reproduce this issue, here is my way to reproduce:
make
this nice compressor program https://github.com/byronknoll/cmix~/Dev/cmix/cmix -c 100Mb_file.bin result.cmix
Cmix consumes a lot of RAM, maybe this is important
Any updates? My issue is suddenly back after being gone for 6 weeks. Really annoying
New "ZFS sync hang" event, #2-2019 in five months:i
kernel 4.14.105-gentoo-r1 #1 SMP (elevator=deadline)
SPL: Loaded module v0.7.12-r0-gentoo ZFS: Loaded module v0.7.12-r0-gentoo, ZFS pool version 5000, ZFS filesystem version 5
3631 root 20 0 0 0 0 R 100.0 0.0 793:51.76 kworker/u32:4
hang AGAIN, hogging 100% CPU.
Again while bacula was updating its postgres database, a 12Gb db on a zfs filesystem.
Any sync command was hanging and: cat /proc/3631/stack[] 0xffffffffffffffff
Same incident happened on this system:
24/11/2017 #1-2017 06/05/2018 #1-2018 06/05/2018 #2-2018 12/01/2019 #1-2019 17/05/2019 #2-2019
under different kernels and different ZFS/SPL versions, from v0.7.7-r0-gentoo to to v0.7.12-r0-gentoo.
Just hoping it will help to debug this really boring problem.
Only way to recover was a "reboot -f -n".
Only way to recover was a "reboot -f -n".
All mine hang events caused hangs of all GUI and consoles too
Reading other related "ZFS sync hanging" issues, a sync hang may be related to an hw problem, for example a flaky cable.
Not my case! I had the same events on two different systems, pretty sure it is not an hw problem.
My systems hadn't any other problem beside "sync hanging" apparently.
The only relation I can see? They have, on same disks, a mix of ext4 and ZFS pools.
May be the "smoking gun"?
Issue repruduced on my system after upgrade to Linux kernel 4.19.0-5-amd64 and ZFS sources ver 0.8.0
From my statistics, the issue show up, when it show up, really a rare event, ONLY on systems where there is a mix, on the same disks, of ZFS pools and ext3/4 filesystems.
I'm beginning to think the problem is not really a ZFS ONLY issue.
May be a kernel bug which show up just when ZFS and ext3/4 I/O requests mix on the same system for the same disks?
In other words when on the VERY SAME I/O queue there is a mix of I/O requests, for the same kernel worker, directed to the same disk, coming from an ext3/4 filesystem and from ZFS?
ONLY on systems where there is a mix, on the same disks, of ZFS pools and ext3/4 filesystems.
My case also refers to this: I use ZFS rpool and zvol what contains LUKS ext4 /home partition
I understand your layout, denizzzka. Mine is a little bit different, I've MBR partitioned disks which contains native ZFS pools and native ext3/ext4 filesystems, no zvol at all.
But I guess, the effect is the same: a mix of ext3/4 and ZFS I/O requests may be handled by the same kworker, which, for what we have seen, in some case, hangs, for unknown reasons.
May I ask to the others posting to this issue to report if they have the same disk/filesystems layout?
Another info: I handn't never seen this event in Linux kernels with version < 4.xx. I still have a couple of lin ux kernels 3.16.5, ZFS v0.6.3-r0-gentoo, with the very same layout, and I've never seens this hang to happen.
Same for me, I have an ext4 root, everything else is ZFS.
No ext3/4 partitions here. Just vfat on /boot (because EFI) and ZFS for everything else.
System information
Describe the problem you're observing
sync hang ,mysql hang,kworker cpu 100
Describe how to reproduce the problem
high frequency create/destroy/clone
Include any warning/errors/backtraces from the system logs
dmsg: [ 189.990968] Adjusting tsc more than 11% (8039035 vs 7759471) [ 2522.644734] INFO: task mysqld:6608 blocked for more than 120 seconds. [ 2522.644790] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 2522.644854] mysqld D 0000000000000000 0 6608 5602 0x00000000 [ 2522.644860] ffff880feb913da0 0000000000000086 ffff881006eac500 ffff880feb913fd8 [ 2522.644865] ffff880feb913fd8 ffff880feb913fd8 ffff881006eac500 ffff880fffc8a180 [ 2522.644869] ffff880fffc8a000 ffff880fffc8a188 ffff880fffc8a028 0000000000000000 [ 2522.644873] Call Trace: [ 2522.644882] [] schedule+0x29/0x70
[ 2522.644906] [] cv_wait_common+0x125/0x150 [spl]
[ 2522.644911] [] ? wake_up_atomic_t+0x30/0x30
[ 2522.644922] [] cv_wait+0x15/0x20 [spl]
[ 2522.644995] [] zil_commit.part.12+0x8b/0x830 [zfs]
[ 2522.645006] [] ? spl_kmem_alloc+0xc7/0x170 [spl]
[ 2522.645017] [] ? tsd_set+0x324/0x500 [spl]
[ 2522.645022] [] ? mutex_lock+0x12/0x2f
[ 2522.645075] [] zil_commit+0x17/0x20 [zfs]
[ 2522.645128] [] zfs_fsync+0x77/0xf0 [zfs]
[ 2522.645179] [] zpl_fsync+0x65/0x90 [zfs]
[ 2522.645186] [] do_fsync+0x65/0xa0
[ 2522.645191] [] SyS_fsync+0x10/0x20
[ 2522.645196] [] system_call_fastpath+0x16/0x1b
[ 2522.645202] INFO: task mysqld:7514 blocked for more than 120 seconds.
[ 2522.645245] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2522.645296] mysqld D ffff880fb554bda0 0 7514 5602 0x00000000
[ 2522.645299] ffff880fb554bd30 0000000000000086 ffff880ff4c8f300 ffff880fb554bfd8
[ 2522.645303] ffff880fb554bfd8 ffff880fb554bfd8 ffff880ff4c8f300 ffff880fb554bdb0
[ 2522.645307] ffff88107ff8dec0 0000000000000002 ffffffff811f9420 ffff880fb554bda0
[ 2522.645312] Call Trace:
[ 2522.645316] [] ? unlock_two_nondirectories+0x60/0x60
[ 2522.645321] [] schedule+0x29/0x70
[ 2522.645324] [] inode_wait+0xe/0x20
[ 2522.645328] [] wait_on_bit+0x60/0x90
[ 2522.645335] [] inode_wait_for_writeback+0xaf/0xf0
[ 2522.645339] [] ? wake_atomic_t_function+0x40/0x40
[ 2522.645345] [] inode_wait_for_writeback+0x26/0x40
[ 2522.645348] [] evict+0x95/0x170
[ 2522.645351] [] iput+0xf5/0x180
[ 2522.645357] [] do_unlinkat+0x1ae/0x2b0
[ 2522.645362] [] ? SyS_readlinkat+0xd1/0x140
[ 2522.645367] [] SyS_unlink+0x16/0x20
[ 2522.645371] [] system_call_fastpath+0x16/0x1b
[ 2522.645402] INFO: task sync:2653 blocked for more than 120 seconds.
[ 2522.645443] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2522.645494] sync D ffffffff8120f8c0 0 2653 1455 0x00000000
[ 2522.645498] ffff880ffedd7d50 0000000000000082 ffff880f01449700 ffff880ffedd7fd8
[ 2522.645502] ffff880ffedd7fd8 ffff880ffedd7fd8 ffff880f01449700 ffff880ffedd7e80
[ 2522.645506] ffff880ffedd7e88 7fffffffffffffff ffff880f01449700 ffffffff8120f8c0
[ 2522.645509] Call Trace:
[ 2522.645515] [] ? generic_write_sync+0x60/0x60
[ 2522.645519] [] schedule+0x29/0x70
[ 2522.645523] [] schedule_timeout+0x209/0x2d0
[ 2522.645527] [] ? __queue_work+0x136/0x320
[ 2522.645530] [] ? queue_delayed_work+0xaa/0x1a0
[ 2522.645534] [] ? try_to_grab_pending+0xb1/0x160
[ 2522.645538] [] ? generic_write_sync+0x60/0x60
[ 2522.645543] [] wait_for_completion+0x116/0x170
[ 2522.645548] [] ? wake_up_state+0x20/0x20
[ 2522.645552] [] sync_inodes_sb+0xb7/0x1e0
[ 2522.645557] [] ? generic_write_sync+0x60/0x60
[ 2522.645561] [] sync_inodes_one_sb+0x19/0x20
[ 2522.645565] [] iterate_supers+0xb2/0x110
[ 2522.645570] [] sys_sync+0x44/0xb0
[ 2522.645575] [] system_call_fastpath+0x16/0x1b
[ 2642.739175] INFO: task mysqld:6608 blocked for more than 120 seconds.
[ 2642.739228] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2642.739284] mysqld D 0000000000000000 0 6608 5602 0x00000000
[ 2642.739290] ffff880feb913da0 0000000000000086 ffff881006eac500 ffff880feb913fd8
[ 2642.739296] ffff880feb913fd8 ffff880feb913fd8 ffff881006eac500 ffff880fffc8a180
[ 2642.739300] ffff880fffc8a000 ffff880fffc8a188 ffff880fffc8a028 0000000000000000
[ 2642.739305] Call Trace:
[ 2642.739315] [] schedule+0x29/0x70
[ 2642.739340] [] cv_wait_common+0x125/0x150 [spl]
[ 2642.739345] [] ? wake_up_atomic_t+0x30/0x30
[ 2642.739357] [] cv_wait+0x15/0x20 [spl]
[ 2642.739434] [] zil_commit.part.12+0x8b/0x830 [zfs]
[ 2642.739447] [] ? spl_kmem_alloc+0xc7/0x170 [spl]
[ 2642.739460] [] ? tsd_set+0x324/0x500 [spl]
[ 2642.739466] [] ? mutex_lock+0x12/0x2f
[ 2642.739526] [] zil_commit+0x17/0x20 [zfs]
[ 2642.739587] [] zfs_fsync+0x77/0xf0 [zfs]
[ 2642.739647] [] zpl_fsync+0x65/0x90 [zfs]
[ 2642.739654] [] do_fsync+0x65/0xa0
[ 2642.739659] [] SyS_fsync+0x10/0x20
[ 2642.739665] [] system_call_fastpath+0x16/0x1b
[ 2642.739670] INFO: task mysqld:7514 blocked for more than 120 seconds.
[ 2642.739727] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2642.739793] mysqld D ffff880fb554bda0 0 7514 5602 0x00000000
[ 2642.739798] ffff880fb554bd30 0000000000000086 ffff880ff4c8f300 ffff880fb554bfd8
[ 2642.739803] ffff880fb554bfd8 ffff880fb554bfd8 ffff880ff4c8f300 ffff880fb554bdb0
[ 2642.739808] ffff88107ff8dec0 0000000000000002 ffffffff811f9420 ffff880fb554bda0
[ 2642.739812] Call Trace:
[ 2642.739818] [] ? unlock_two_nondirectories+0x60/0x60
[ 2642.739823] [] schedule+0x29/0x70
[ 2642.739826] [] inode_wait+0xe/0x20
[ 2642.739831] [] wait_on_bit+0x60/0x90
[ 2642.739839] [] inode_wait_for_writeback+0xaf/0xf0
[ 2642.739843] [] ? wake_atomic_t_function+0x40/0x40
[ 2642.739850] [] inode_wait_for_writeback+0x26/0x40
[ 2642.739854] [] evict+0x95/0x170
[ 2642.739857] [] iput+0xf5/0x180
[ 2642.739862] [] do_unlinkat+0x1ae/0x2b0
[ 2642.739868] [] ? SyS_readlinkat+0xd1/0x140
[ 2642.739873] [] SyS_unlink+0x16/0x20
[ 2642.739878] [] system_call_fastpath+0x16/0x1b
[ 2642.739897] INFO: task sync:2653 blocked for more than 120 seconds.
[ 2642.739951] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2642.740017] sync D ffffffff8120f8c0 0 2653 1455 0x00000000
[ 2642.740021] ffff880ffedd7d50 0000000000000082 ffff880f01449700 ffff880ffedd7fd8
[ 2642.740026] ffff880ffedd7fd8 ffff880ffedd7fd8 ffff880f01449700 ffff880ffedd7e80
[ 2642.740031] ffff880ffedd7e88 7fffffffffffffff ffff880f01449700 ffffffff8120f8c0
[ 2642.740036] Call Trace:
[ 2642.740042] [] ? generic_write_sync+0x60/0x60
[ 2642.740047] [] schedule+0x29/0x70
[ 2642.740051] [] schedule_timeout+0x209/0x2d0
[ 2642.740056] [] ? __queue_work+0x136/0x320
[ 2642.740060] [] ? queue_delayed_work+0xaa/0x1a0
[ 2642.740064] [] ? try_to_grab_pending+0xb1/0x160
[ 2642.740069] [] ? generic_write_sync+0x60/0x60
[ 2642.740093] [] wait_for_completion+0x116/0x170
[ 2642.740100] [] ? wake_up_state+0x20/0x20
[ 2642.740105] [] sync_inodes_sb+0xb7/0x1e0
[ 2642.740110] [] ? generic_write_sync+0x60/0x60
[ 2642.740116] [] sync_inodes_one_sb+0x19/0x20
[ 2642.740121] [] iterate_supers+0xb2/0x110
[ 2642.740127] [] sys_sync+0x44/0xb0
[ 2642.740134] [] system_call_fastpath+0x16/0x1b
[ 2762.834495] INFO: task mysqld:6608 blocked for more than 120 seconds.
[ 2762.834547] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2762.834604] mysqld D 0000000000000000 0 6608 5602 0x00000000
[ 2762.834609] ffff880feb913da0 0000000000000086 ffff881006eac500 ffff880feb913fd8
[ 2762.834615] ffff880feb913fd8 ffff880feb913fd8 ffff881006eac500 ffff880fffc8a180
[ 2762.834619] ffff880fffc8a000 ffff880fffc8a188 ffff880fffc8a028 0000000000000000
[ 2762.834624] Call Trace:
[ 2762.834634] [] schedule+0x29/0x70
[ 2762.834659] [] cv_wait_common+0x125/0x150 [spl]
[ 2762.834665] [] ? wake_up_atomic_t+0x30/0x30
[ 2762.834677] [] cv_wait+0x15/0x20 [spl]
[ 2762.834748] [] zil_commit.part.12+0x8b/0x830 [zfs]
[ 2762.834761] [] ? spl_kmem_alloc+0xc7/0x170 [spl]
[ 2762.834774] [] ? tsd_set+0x324/0x500 [spl]
[ 2762.834779] [] ? mutex_lock+0x12/0x2f
[ 2762.834843] [] zil_commit+0x17/0x20 [zfs]
[ 2762.834908] [] zfs_fsync+0x77/0xf0 [zfs]
[ 2762.834971] [] zpl_fsync+0x65/0x90 [zfs]
[ 2762.834978] [] do_fsync+0x65/0xa0
[ 2762.834983] [] SyS_fsync+0x10/0x20
[ 2762.834989] [] system_call_fastpath+0x16/0x1b
[ 2762.834994] INFO: task mysqld:7514 blocked for more than 120 seconds.
[ 2762.835051] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2762.835118] mysqld D ffff880fb554bda0 0 7514 5602 0x00000000
[ 2762.835122] ffff880fb554bd30 0000000000000086 ffff880ff4c8f300 ffff880fb554bfd8
[ 2762.835127] ffff880fb554bfd8 ffff880fb554bfd8 ffff880ff4c8f300 ffff880fb554bdb0
[ 2762.835132] ffff88107ff8dec0 0000000000000002 ffffffff811f9420 ffff880fb554bda0
[ 2762.835137] Call Trace:
[ 2762.835143] [] ? unlock_two_nondirectories+0x60/0x60
[ 2762.835148] [] schedule+0x29/0x70
[ 2762.835151] [] inode_wait+0xe/0x20
[ 2762.835156] [] wait_on_bit+0x60/0x90
[ 2762.835164] [] inode_wait_for_writeback+0xaf/0xf0
[ 2762.835168] [] ? wake_atomic_t_function+0x40/0x40
[ 2762.835175] [] inode_wait_for_writeback+0x26/0x40
[ 2762.835179] [] evict+0x95/0x170
[ 2762.835183] [] iput+0xf5/0x180
[ 2762.835188] [] do_unlinkat+0x1ae/0x2b0
[ 2762.835195] [] ? SyS_readlinkat+0xd1/0x140
[ 2762.835199] [] SyS_unlink+0x16/0x20
[ 2762.835205] [] system_call_fastpath+0x16/0x1b
[ 2762.835223] INFO: task sync:2653 blocked for more than 120 seconds.
[ 2762.835277] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2762.835343] sync D ffffffff8120f8c0 0 2653 1455 0x00000000
[ 2762.835348] ffff880ffedd7d50 0000000000000082 ffff880f01449700 ffff880ffedd7fd8
[ 2762.835352] ffff880ffedd7fd8 ffff880ffedd7fd8 ffff880f01449700 ffff880ffedd7e80
[ 2762.835357] ffff880ffedd7e88 7fffffffffffffff ffff880f01449700 ffffffff8120f8c0
[ 2762.835362] Call Trace:
[ 2762.835369] [] ? generic_write_sync+0x60/0x60
[ 2762.835374] [] schedule+0x29/0x70
[ 2762.835378] [] schedule_timeout+0x209/0x2d0
[ 2762.835382] [] ? __queue_work+0x136/0x320
[ 2762.835386] [] ? queue_delayed_work+0xaa/0x1a0
[ 2762.835390] [] ? try_to_grab_pending+0xb1/0x160
[ 2762.835396] [] ? generic_write_sync+0x60/0x60
[ 2762.835409] [] wait_for_completion+0x116/0x170
[ 2762.835417] [] ? wake_up_state+0x20/0x20
[ 2762.835422] [] sync_inodes_sb+0xb7/0x1e0
[ 2762.835427] [] ? generic_write_sync+0x60/0x60
[ 2762.835433] [] sync_inodes_one_sb+0x19/0x20
[ 2762.835438] [] iterate_supers+0xb2/0x110
[ 2762.835443] [] sys_sync+0x44/0xb0
[ 2762.835451] [] system_call_fastpath+0x16/0x1b
[ 2882.929813] INFO: task mysqld:6608 blocked for more than 120 seconds.
[ 2882.929861] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2882.929913] mysqld D 0000000000000000 0 6608 5602 0x00000000
[ 2882.929919] ffff880feb913da0 0000000000000086 ffff881006eac500 ffff880feb913fd8
[ 2882.929924] ffff880feb913fd8 ffff880feb913fd8 ffff881006eac500 ffff880fffc8a180
[ 2882.929928] ffff880fffc8a000 ffff880fffc8a188 ffff880fffc8a028 0000000000000000
[ 2882.929932] Call Trace:
[ 2882.929942] [] schedule+0x29/0x70
[ 2882.929967] [] cv_wait_common+0x125/0x150 [spl]
[ 2882.929972] [] ? wake_up_atomic_t+0x30/0x30
[ 2882.929983] [] __cv_wait+0x15/0x20 [spl]
[ 2882.930049] [] zil_commit.part.12+0x8b/0x830 [zfs]
[ 2882.930059] [] ? spl_kmem_alloc+0xc7/0x170 [spl]
[ 2882.930070] [] ? tsd_set+0x324/0x500 [spl]
[ 2882.930075] [] ? mutex_lock+0x12/0x2f
[ 2882.930129] [] zil_commit+0x17/0x20 [zfs]
[ 2882.930181] [] zfs_fsync+0x77/0xf0 [zfs]
[ 2882.930232] [] zpl_fsync+0x65/0x90 [zfs]
[ 2882.930238] [] do_fsync+0x65/0xa0
[ 2882.930243] [] SyS_fsync+0x10/0x20
[ 2882.930248] [] system_call_fastpath+0x16/0x1b
[ 4802.367346] perf interrupt took too long (2507 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
sync process stack: [] sync_inodes_sb+0xb7/0x1e0
[] sync_inodes_one_sb+0x19/0x20
[] iterate_supers+0xb2/0x110
[] sys_sync+0x44/0xb0
[] system_call_fastpath+0x16/0x1b
[] 0xffffffffffffffff
kworker process stack: cat /proc/3445/stack [] dmu_zfetch+0x455/0x4f0 [zfs]
[] dbuf_read+0x8ea/0x9f0 [zfs]
[] dnode_hold_impl+0xc6/0xc30 [zfs]
[] 0xffffffffffffffff
cat /proc/3445/stack [] 0xffffffffffffffff
[root@ifcos ~]# cat /proc/3445/stack
[] zfs_zget+0xfc/0x250 [zfs]
[] zfs_get_data+0x57/0x2d0 [zfs]
[] zil_commit.part.12+0x41c/0x830 [zfs]
[] zil_commit+0x17/0x20 [zfs]
[] zpl_writepages+0xd6/0x170 [zfs]
[] do_writepages+0x1e/0x40
[] __writeback_single_inode+0x40/0x220
[] writeback_sb_inodes+0x25e/0x420
[] wb_writeback+0xff/0x2f0
[] bdi_writeback_workfn+0x115/0x460
[] process_one_work+0x17b/0x470
[] worker_thread+0x11b/0x400
[] kthread+0xcf/0xe0
[] ret_from_fork+0x58/0x90
[] 0xffffffffffffffff
cat /proc/3445/stack [] dmu_zfetch+0x455/0x4f0 [zfs]
[] __dbuf_hold_impl+0x135/0x5a0 [zfs]
[] 0xffffffffffffffff
cat /proc/3445/stack [] dbuf_find+0x1c9/0x1d0 [zfs]
[] __dbuf_hold_impl+0x42/0x5a0 [zfs]
[] 0xffffffffffffffff
cat /proc/3445/stack [] dmu_zfetch+0x455/0x4f0 [zfs]
[] dbuf_read+0x756/0x9f0 [zfs]
[] dnode_hold_impl+0xc6/0xc30 [zfs]