Closed tuomari closed 6 years ago
from the trace: looks like txg is stuck (txg_wait_open & txg_wait_synced)... and blkdev_issue_discard seems to be slow.
My experience maybe completely unrelated. I experienced a very noticeable slow down of write operations to my SSD pool when I ran master versions of ZOL that don't support TRIM last summer. I can see that you use these too now. I solved my problem by switching to the branch https://github.com/dweeezil/zfs/tree/ntrim2-next by @dweeezil and running a TRIM. Before trimming my devs, I could observe write operations take 0.5 - 4 seconds regardless of their size. I've never experienced io slowdowns since then. I use very cheap SSD for my system - ata-TS256GMTS400_C056900183
I have been studying this symptom for the last week or 2.. and it seems to consistent occur with several different kinds of virtualization, proxmox, xen, vmware, and virtualbox.. I also have seen that it is mostly described as being bad disks. These are high quality sas drives. Will look at trim too.
On the systems I am working on I also just discovered this.
2018-03-14 05:04:47 lp_pool 255G 42.8G 0 276 0 24.2M 2018-03-14 05:04:52 lp_pool 255G 42.8G 0 242 0 29.7M 2018-03-14 05:04:57 lp_pool 255G 42.8G 2 198 9.60K 9.74M 2018-03-14 05:05:02 lp_pool 179G 119G 458 380 56.1M 9.08M 2018-03-14 05:05:07 lp_pool 179G 119G 504 119 62.0M 11.6M 2018-03-14 05:05:12 lp_pool 179G 119G 693 89 85.4M 10.8M 2018-03-14 05:05:17 lp_pool 179G 119G 482 269 57.6M 15.1M
The txg_sync hung at 5:04:50. It is stunning that the alloc and free change radically at 05:05:02. Can this just be an accounting issue with virtualized disks?
Here is the trace
Mar 14 05:04:50 server kernel: [76223.342494] INFO: task txg_sync:3546 blocked for more than 120 seconds.
Mar 14 05:04:50 server kernel: [76223.342555] Tainted: P O 4.4.0-31-generic #50-Ubuntu
Mar 14 05:04:50 server kernel: [76223.342603] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 14 05:04:50 server kernel: [76223.342659] txg_sync D ffff880c980c7aa8 0 3546 2 0x00000000
Mar 14 05:04:50 server kernel: [76223.342667] ffff880c980c7aa8 ffff881b39111b80 ffff8810b9858dc0 ffff880c98c76e00
Mar 14 05:04:50 server kernel: [76223.342673] ffff880c980c8000 ffff8810ba516d00 7fffffffffffffff ffff8817296c70c8
Mar 14 05:04:50 server kernel: [76223.342677] 0000000000000001 ffff880c980c7ac0 ffffffff81829a25 0000000000000000
Mar 14 05:04:50 server kernel: [76223.342681] Call Trace:
Mar 14 05:04:50 server kernel: [76223.342694] [
There's no quite enough information in this issue to say for certain, but #7307 which has been merged could results in this lockup. @tuomari if possible could you test the latest master which has this fix applied and see if you're still able to reproduce the issue.
@behlendorf I finally got the code to the server, and was not able to reproduce this problem.
System information
Describe the problem you're observing
I am running multiple kvm virtuals from zvols. When moving large amounts of data from one zvol to another, zfs stops writing data to all disks. There are no errors in
zpool events
, syslog, or anywhere I could find in the host system. I was able to repeat the problem with the latest master (ZFS 0.7.0-338_g41532e5
andSPL 0.7.0-29_g378c6ed
.(Maybe) noteworthy points
sync=disabled
( yes, I know this is dangerous and stupid... )zfs send
running fromssdtank
totank
on the background, limited withpv
to 10M/slog
-disks on thetank
pool was completely fullMy disk layout is as follows:
Include any warning/errors/backtraces from the system logs
result from
echo w > /proc/sysrq-trigger