openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.42k stars 1.72k forks source link

ZFS hung during zfs send/receive over SSH on new ZFS version #12521

Closed hb0nes closed 1 year ago

hb0nes commented 3 years ago

System information

Distribution Name | Debian Buster Distribution Version | Kernel Version | 4.19.0-8-amd64 Architecture | x86_64 OpenZFS Version | zfs-2.0.3-8~bpo10+1, kmod-2.0.3-8~bpo10+1

Describe the problem you're observing

We create and sync zfs snapshots every 15 minutes (with pyznap, it just uses zfs send and recv over ssh). ZFS hung on the receiving side - 15 minutes later we received errors on the sending side that the recv process was already (still) running.

Describe how to reproduce the problem

I don't know how. It hasn't happened before. The ZFS version is quite new so it is most likely a (rare) bug.

Include any warning/errors/backtraces from the system logs

The first kernel log stacktrace on the receiving end:

Aug 30 10:17:31 random kernel: [4583210.679133] INFO: task z_zrele:4450 blocked for more than 120 seconds.
Aug 30 10:17:31 random kernel: [4583210.679160]       Tainted: P           OE     4.19.0-8-amd64 #1 Debian 4.19.98-1
Aug 30 10:17:31 random kernel: [4583210.679181] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 30 10:17:31 random kernel: [4583210.679204] z_zrele         D    0  4450      2 0x80000000
Aug 30 10:17:31 random kernel: [4583210.679206] Call Trace:
Aug 30 10:17:31 random kernel: [4583210.679215]  ? __schedule+0x2a2/0x870
Aug 30 10:17:31 random kernel: [4583210.679217]  schedule+0x28/0x80
Aug 30 10:17:31 random kernel: [4583210.679218]  io_schedule+0x12/0x40
Aug 30 10:17:31 random kernel: [4583210.679222]  __lock_page+0x118/0x170
Aug 30 10:17:31 random kernel: [4583210.679224]  ? page_cache_tree_insert+0xe0/0xe0
Aug 30 10:17:31 random kernel: [4583210.679228]  truncate_inode_pages_range+0x64e/0x820
Aug 30 10:17:31 random kernel: [4583210.679231]  ? _cond_resched+0x15/0x30
Aug 30 10:17:31 random kernel: [4583210.679232]  ? down_write+0xe/0x40
Aug 30 10:17:31 random kernel: [4583210.679235]  ? unmap_mapping_pages+0x5e/0x130
Aug 30 10:17:31 random kernel: [4583210.679236]  truncate_pagecache+0x43/0x60
Aug 30 10:17:31 random kernel: [4583210.679311]  zpl_evict_inode+0x2c/0x50 [zfs]
Aug 30 10:17:31 random kernel: [4583210.679316]  evict+0xd2/0x1a0
Aug 30 10:17:31 random kernel: [4583210.679322]  taskq_thread+0x2e7/0x530 [spl]
Aug 30 10:17:31 random kernel: [4583210.679326]  ? wake_up_q+0x70/0x70
Aug 30 10:17:31 random kernel: [4583210.679329]  ? taskq_thread_spawn+0x50/0x50 [spl]
Aug 30 10:17:31 random kernel: [4583210.679331]  kthread+0x112/0x130
Aug 30 10:17:31 random kernel: [4583210.679333]  ? kthread_bind+0x30/0x30
Aug 30 10:17:31 random kernel: [4583210.679335]  ret_from_fork+0x1f/0x40

This repeats itself until a new one appears:

Aug 30 10:33:37 random kernel: [4584177.320140] INFO: task zfs:35669 blocked for more than 120 seconds.
Aug 30 10:33:37 random kernel: [4584177.320161]       Tainted: P           OE     4.19.0-8-amd64 #1 Debian 4.19.98-1
Aug 30 10:33:37 random kernel: [4584177.320182] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 30 10:33:37 random kernel: [4584177.320204] zfs             D    0 35669  35658 0x80000000
Aug 30 10:33:37 random kernel: [4584177.320205] Call Trace:
Aug 30 10:33:37 random kernel: [4584177.320208]  ? __schedule+0x2a2/0x870
Aug 30 10:33:37 random kernel: [4584177.320209]  schedule+0x28/0x80
Aug 30 10:33:37 random kernel: [4584177.320213]  taskq_wait_outstanding+0x86/0xd0 [spl]
Aug 30 10:33:37 random kernel: [4584177.320216]  ? finish_wait+0x80/0x80
Aug 30 10:33:37 random kernel: [4584177.320288]  zfsvfs_teardown+0x56/0x320 [zfs]
Aug 30 10:33:37 random kernel: [4584177.320331]  zfs_ioc_recv_impl+0x53e/0xf30 [zfs]
Aug 30 10:33:37 random kernel: [4584177.320363]  ? dbuf_read+0x2b9/0x520 [zfs]
Aug 30 10:33:37 random kernel: [4584177.320407]  zfs_ioc_recv+0x173/0x300 [zfs]
Aug 30 10:33:37 random kernel: [4584177.320450]  ? spa_name_compare+0xa/0x20 [zfs]
Aug 30 10:33:37 random kernel: [4584177.320452]  ? avl_find+0x58/0x90 [zavl]
Aug 30 10:33:37 random kernel: [4584177.320455]  ? refcount_inc_checked+0x5/0x30
Aug 30 10:33:37 random kernel: [4584177.320457]  ? apparmor_capable+0x6b/0xc0
Aug 30 10:33:37 random kernel: [4584177.320458]  ? refcount_inc_checked+0x5/0x30
Aug 30 10:33:37 random kernel: [4584177.320459]  ? apparmor_capable+0x6b/0xc0
Aug 30 10:33:37 random kernel: [4584177.320462]  ? security_capable+0x38/0x50
Aug 30 10:33:37 random kernel: [4584177.320462]  ? _cond_resched+0x15/0x30
Aug 30 10:33:37 random kernel: [4584177.320465]  ? __kmalloc_node+0x1e1/0x2b0
Aug 30 10:33:37 random kernel: [4584177.320468]  ? spl_kmem_alloc+0xc6/0x110 [spl]
Aug 30 10:33:37 random kernel: [4584177.320511]  zfsdev_ioctl_common+0x5b5/0x830 [zfs]
Aug 30 10:33:37 random kernel: [4584177.320513]  ? kmalloc_large_node+0x37/0x60
Aug 30 10:33:37 random kernel: [4584177.320514]  ? __kmalloc_node+0x20a/0x2b0
Aug 30 10:33:37 random kernel: [4584177.320554]  zfsdev_ioctl+0x4f/0xe0 [zfs]
Aug 30 10:33:37 random kernel: [4584177.320558]  do_vfs_ioctl+0xa4/0x630
Aug 30 10:33:37 random kernel: [4584177.320559]  ? handle_mm_fault+0xd6/0x200
Aug 30 10:33:37 random kernel: [4584177.320561]  ksys_ioctl+0x60/0x90
Aug 30 10:33:37 random kernel: [4584177.320562]  __x64_sys_ioctl+0x16/0x20
Aug 30 10:33:37 random kernel: [4584177.320564]  do_syscall_64+0x53/0x110
Aug 30 10:33:37 random kernel: [4584177.320566]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Aug 30 10:33:37 random kernel: [4584177.320568] RIP: 0033:0x7f6258a7b427
Aug 30 10:33:37 random kernel: [4584177.320572] Code: Bad RIP value.
Aug 30 10:33:37 random kernel: [4584177.320572] RSP: 002b:00007ffd2ac89198 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Aug 30 10:33:37 random kernel: [4584177.320573] RAX: ffffffffffffffda RBX: 00007ffd2ac8c8f0 RCX: 00007f6258a7b427
Aug 30 10:33:37 random kernel: [4584177.320574] RDX: 00007ffd2ac89340 RSI: 0000000000005a1b RDI: 0000000000000005
Aug 30 10:33:37 random kernel: [4584177.320574] RBP: 00007ffd2ac8d930 R08: 0000000000000003 R09: 00007f6258b46d80
Aug 30 10:33:37 random kernel: [4584177.320575] R10: 0000562135da2010 R11: 0000000000000246 R12: 00007ffd2ac89340
Aug 30 10:33:37 random kernel: [4584177.320575] R13: 00007ffd2ac93790 R14: 0000000000000000 R15: 0000000000000000

After this, no more zfs kernel messages appeared.

rincebrain commented 3 years ago

I've not seen this on 4.19.0-17-amd64 despite a lot of sending, so maybe you can upgrade your way out of your problems?

(Not saying it's not a genuine issue, just offering a data point.)

stale[bot] commented 2 years ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.