zfs recv hangs / deadlock

onlime commented 9 years ago

Hi there,

We are experiencing zfs send|receive deadlocks on two production servers with high load. Both servers are running Proxmox VE 3.4 with ZoL v0.6.3-1.2, ZFS RAID-1 with two SSD drives attached to LSI/Avago 9207-8i HBAs in each server.

System information:

CPUs:               2 x 6-core
Memory:             128GB (8 x Kingston KVR16R11D4/16HA)
VM/Hypervisor:      yes (OpenVZ)
ECC mem:            yes
Disks:              2 x Intel SSD DC S3610 1.6TB SSDSC2BX016T401
Controller:         LSI/Avago HBA 9207-8i (with latest 20.00.02.00-IT initator target firmware)
Distribution:       Proxmox VE 3.4 (Debian GNU/Linux based)
Kernel version:     $ uname -a
                    Linux hn3 2.6.32-37-pve #1 SMP Wed Mar 18 08:19:56 CET 2015 x86_64 GNU/Linux
SPL/ZFS source:     Packages built from GIT repository
SPL/ZFS version:    $ dmesg | grep -E 'SPL:|ZFS:'
                    SPL: Loaded module v0.6.3-1.2
                    ZFS: Loaded module v0.6.3-1.2, ZFS pool version 5000, ZFS filesystem version 5
                    SPL: using hostid 0xa8c00802
System services:    Production (2 OpenVZ containers)
Short  description: zfs send|receive hangs (on zfs recv side) on filesystem replication with zrep

Both systems are configured identically with the only exception CPUs:

hn3: Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz
hn4: Intel(R) Xeon(R) CPU E5-2630L v2 @ 2.40GHz

We are running two OpenVZ containers on two separate ZFS filesystems:

$ zfs list
NAME                 USED  AVAIL  REFER  MOUNTPOINT
rpool                971G   494G   144K  /rpool
rpool/ROOT           838G   494G   144K  /rpool/ROOT
rpool/ROOT/pve-1     958M   494G   958M  /
rpool/ROOT/pve-111   420G   494G   419G  /var/lib/vz/private/111
rpool/ROOT/pve-222   417G   494G   416G  /var/lib/vz/private/222
rpool/swap           133G   626G   649M  -

In particular pve-222 (webserver) causes a lot of IO load during peak business hours. pve-111 (mailserver) may also generate high IO load from time to time. With the powerful SSDs underneath, IO is never a real issue, though.

We are replicating both ZFS file systems to the secondary host node, e.g. from hn4 to hn3, using ZREP which basically does simple zfs send|receive using the -I flag (sends all intermediary snapshots from the first snapshot to the second snapshot). If both OpenVZ containers are running on hn4 and none on hn3, replication always succeeds. Once we migrate pve-222 over to hn3, migration completes but as soon as we start the next replication (zfs send|receive), it hangs. Actually, zfs recv hangs on the destination side (in this case hn4).

I have tested the same thing with a third rpool/ROOT/pve-333 filesystem containing only a small OpenVZ container (< 1GB). I was able to replicate it perfectly well from one host node to the other, failover with ZREP worked in both directions and zfs send|receive never hung.

Syslog on hn4 upon hanging zfs receive process:

Apr  6 14:57:58 hn4 kernel: INFO: task zfs:533834 blocked for more than 120 seconds.
Apr  6 14:57:58 hn4 kernel:      Tainted: P           ---------------    2.6.32-37-pve #1
Apr  6 14:57:58 hn4 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr  6 14:57:58 hn4 kernel: zfs           D ffff880ace574840     0 533834 533831    0 0x00000000
Apr  6 14:57:58 hn4 kernel: ffff8805f032d750 0000000000000082 ffff881a56b40bd0 0000000000000007
Apr  6 14:57:58 hn4 kernel: 0000000000000007 ffff8813bfcaccd0 ffff8815ed91f810 ffff881a56b40bc0
Apr  6 14:57:58 hn4 kernel: 0000000000000001 ffff8815ed91f7f0 000000010f82372a ffff8805f032d708
Apr  6 14:57:58 hn4 kernel: Call Trace:
Apr  6 14:57:58 hn4 kernel: [<ffffffff81563fe5>] rwsem_down_failed_common+0x95/0x1e0
Apr  6 14:57:58 hn4 kernel: [<ffffffff81564186>] rwsem_down_read_failed+0x26/0x30
Apr  6 14:57:58 hn4 kernel: [<ffffffff812a0494>] call_rwsem_down_read_failed+0x14/0x30
Apr  6 14:57:58 hn4 kernel: [<ffffffff81563864>] ? down_read+0x24/0x2b
Apr  6 14:57:58 hn4 kernel: [<ffffffffa0288b1f>] zfs_inactive+0x4f/0x250 [zfs]
Apr  6 14:57:58 hn4 kernel: [<ffffffffa029dc00>] ? zpl_inode_delete+0x0/0x30 [zfs]
Apr  6 14:57:58 hn4 kernel: [<ffffffffa029db3e>] zpl_clear_inode+0xe/0x10 [zfs]
Apr  6 14:57:58 hn4 kernel: [<ffffffff811cbdfc>] clear_inode+0xac/0x140
Apr  6 14:57:58 hn4 kernel: [<ffffffffa029dc20>] zpl_inode_delete+0x20/0x30 [zfs]
Apr  6 14:57:58 hn4 kernel: [<ffffffff811cd7e6>] generic_delete_inode+0xa6/0x1c0
Apr  6 14:57:58 hn4 kernel: [<ffffffff811cd955>] generic_drop_inode+0x55/0x70
Apr  6 14:57:58 hn4 kernel: [<ffffffff811cbd16>] iput+0xc6/0x100
Apr  6 14:57:58 hn4 kernel: [<ffffffffa028d774>] zfs_rezget+0x114/0x4f0 [zfs]
Apr  6 14:57:58 hn4 kernel: [<ffffffffa0280868>] zfs_resume_fs+0x288/0x3d0 [zfs]
Apr  6 14:57:58 hn4 kernel: [<ffffffffa0275f04>] zfs_ioc_recv+0x944/0xc40 [zfs]
Apr  6 14:57:58 hn4 kernel: [<ffffffffa01869c9>] ? zprop_name_to_prop+0x39/0x50 [zcommon]
Apr  6 14:57:58 hn4 kernel: [<ffffffffa011398e>] ? strfree+0xe/0x10 [spl]
Apr  6 14:57:58 hn4 kernel: [<ffffffffa011387b>] ? kmem_free_debug+0x4b/0x150 [spl]
Apr  6 14:57:58 hn4 kernel: [<ffffffffa0233960>] ? rrw_exit+0x50/0x150 [zfs]
Apr  6 14:57:58 hn4 kernel: [<ffffffffa01179d8>] ? strdup+0x48/0xa0 [spl]
Apr  6 14:57:58 hn4 kernel: [<ffffffffa02771de>] zfsdev_ioctl+0x49e/0x510 [zfs]
Apr  6 14:57:58 hn4 kernel: [<ffffffff8104b2f4>] ? __do_page_fault+0x1e4/0x490
Apr  6 14:57:58 hn4 kernel: [<ffffffff811c3c6a>] vfs_ioctl+0x2a/0xa0
Apr  6 14:57:58 hn4 kernel: [<ffffffff811c429e>] do_vfs_ioctl+0x7e/0x5a0
Apr  6 14:57:58 hn4 kernel: [<ffffffff811c480f>] sys_ioctl+0x4f/0x80
Apr  6 14:57:58 hn4 kernel: [<ffffffff8100b182>] system_call_fastpath+0x16/0x1b

zfs recv process was in D state (uninterruptible sleep):

hn4$ ps aux | grep zfs
root        1264  0.0  0.0      0     0 ?        S<   Apr03   0:00 [zfs_iput_taskq/]
root        5211  0.0  0.0      0     0 ?        S    Apr03   0:01 [flush-zfs-1]
root      348170  0.0  0.0      0     0 ?        S    Apr03   0:15 [flush-zfs-4]
root      533834  8.1  0.0  32792  1576 ?        Ds   14:55   0:19 zfs recv rpool/ROOT/pve-171

Stack trace of another case when the same thing happened in the other direction (hn4->hn3):

Apr  3 14:54:11 hn3 kernel: INFO: task df:320651 blocked for more than 120 seconds.
Apr  3 14:54:11 hn3 kernel:      Tainted: P           ---------------    2.6.32-37-pve #1
Apr  3 14:54:11 hn3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr  3 14:54:11 hn3 kernel: df            D ffff8809f0f80680     0 320651      1    0 0x00000004
Apr  3 14:54:11 hn3 kernel: ffff8804de759d38 0000000000000086 0000000000000000 ffffffff811d2145
Apr  3 14:54:11 hn3 kernel: ffff88103cc2afb0 ffff88044e5e3034 ffff8804de759d28 ffffffff811befaa
Apr  3 14:54:11 hn3 kernel: 0000534c94577393 ffffffff8116d29b 0000000105718ffc 0000000000000a34
Apr  3 14:54:11 hn3 kernel: Call Trace:
Apr  3 14:54:11 hn3 kernel: [<ffffffff811d2145>] ? mntput_no_expire+0x25/0xc0
Apr  3 14:54:11 hn3 kernel: [<ffffffff811befaa>] ? __link_path_walk+0x4fa/0x880
Apr  3 14:54:11 hn3 kernel: [<ffffffff8116d29b>] ? handle_pte_fault+0x9b/0x1140
Apr  3 14:54:11 hn3 kernel: [<ffffffffa0125405>] cv_wait_common+0x115/0x1d0 [spl]
Apr  3 14:54:11 hn3 kernel: [<ffffffff810a6890>] ? autoremove_wake_function+0x0/0x40
Apr  3 14:54:11 hn3 kernel: [<ffffffffa0125515>] __cv_wait+0x15/0x20 [spl]
Apr  3 14:54:11 hn3 kernel: [<ffffffffa023573c>] rrw_enter_read+0x3c/0x150 [zfs]
Apr  3 14:54:11 hn3 kernel: [<ffffffffa02805d1>] zfs_statvfs+0x41/0x160 [zfs]
Apr  3 14:54:11 hn3 kernel: [<ffffffff8104b2f4>] ? __do_page_fault+0x1e4/0x490
Apr  3 14:54:11 hn3 kernel: [<ffffffffa029fb6e>] zpl_statfs+0xe/0x20 [zfs]
Apr  3 14:54:11 hn3 kernel: [<ffffffff811e6d06>] statfs_by_sb+0x76/0x120
Apr  3 14:54:11 hn3 kernel: [<ffffffff811e6f55>] vfs_statfs+0x25/0xc0
Apr  3 14:54:11 hn3 kernel: [<ffffffff811e7036>] user_statfs+0x46/0x90
Apr  3 14:54:11 hn3 kernel: [<ffffffff811e7100>] sys_statfs+0x20/0x40
Apr  3 14:54:11 hn3 kernel: [<ffffffff8100b182>] system_call_fastpath+0x16/0x1b

Could this be related to #1638 ? Is there any possibility, this is fixed in ZoL 0.6.4? We are aware you have released 0.6.4 yesterday and are very glad to hear these news. But as we are running Proxmox VE 3.4 with a non standard (non stock Debian) 2.6.32 kernel (Proxmox actually runs a RHEL kernel), it will probably not be easy to install ZoL 0.6.4 on these machines. Any advice greatly appreciated! We are struggling with this issue already for a long time and this zfs recv deadlock never happened on any other machines (we are running Proxmox VE 3.4 with ZFS RAID-1 and RAID-10 on 10+ more servers without any issues, even on two other servers with identical hardware setup).

Thanks! Best regards, Philip

onlime commented 9 years ago

I forgot the add this additional information... We are getting the following messages on the receiving server's console once this deadlock happens:

INFO: task txg_sync:1280 blocked for more than 120 seconds.

behlendorf commented 9 years ago

@onlime could you post a back trace from the txg_sync thread. That should shed some light on what's going on. Also I wouldn't be surprised if your issue is already resolved in the 0.6.4 tag. The new tag compiles fine for kernels as old as 2.6.32 (to support older RHEL and SLES releases) so you should be able to upgrade.

onlime commented 9 years ago

@behlendorf Thanks a lot for giving us hope. As I was on vacation for 3 weeks, I put this on hold, asked Dietmar Maurer from ProxmoxVE to deploy ZoL 0.6.4, and waited for the next release of ProxmoxVE. He released ZoL 0.6.4.1 in pve-no-subscription (the testing repo of ProxmoxVE) on May 7th. I was able to upgrade our servers last Friday 2015-05-08. Since then, this zfs receive deadlock did not pop up any more. Great! We are even noticing a huge speedup on zfs send|receive after the upgrade from ZoL 0.6.3 to 0.6.4.1 - some zpools are syncing now 3 times faster. Keep up the good work! Best regards, Philip

behlendorf commented 9 years ago

@onlime that's great news, thanks for the update. You can probably attribute the improved send/resv performance to the following commit b738bc5a0f8ccd0281ed06831c34fbe31d2b2138 (if you're interested in that sort of thing!)

onlime commented 9 years ago

@behlendorf Sure, I was very much interested in the possible reason for this great performance boost. Thanks for the reference. I reconfirm the zfs recv deadlock did not show up any more. As 3 work days have now passed and this hickup didn't happen again on our production servers, I guess this problem is solved for good in 0.6.4.1. Cheers!

openzfs / zfs

zfs recv hangs / deadlock #3272