ZFS unmount stuck with "Dentry still in use"

crabique commented 3 years ago

Type	Version/Name
Distribution Name	Ubuntu
Distribution Version	18.04.5 LTS
Kernel Version	5.3.0-73-generic
Architecture	x86_64
OpenZFS Version	0.8.5

Describe the problem you're observing

For our backup system, we create a recursive ZFS snapshot and iterate the child snapshots, mounting them with mount -t zfs into a directory, running a process on them and then doing umount -fl. There is about ~10000 snapshots we go through every time, the processes are reasonably parallelized and mountpoints do not collide.

Sometimes, a ZFS snapshot can't be unmounted with umount -fl, an error is shown in dmesg output. This happens very randomly, so far it happened 3 times on 3 separate servers with different minor versions of ZFS 0.8, the details above are from the live case right now before we reboot it.

The process does not react to kill signals or anything except for full system reboot.

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     2330446 36.3  0.0   1588   904 ?        R    09:39 172:18 umount -fl /snapshot/storage-1/redacted

Initially, we were using the .zfs/snapshot virtual directory to interface with files and the issue was also present there with some snapshots getting stuck, but in a different way, there was a D-state process like this one:

# ps axo ppid,pid,stat,wchan,cmd | grep D
   PPID     PID STAT WCHAN  CMD
      2     681 D<   call_u [spl_delay_taskq]
      2    4531 D    cv_tim [txg_sync]
3353092 3353093 D    rwsem_ /sbin/mount.zfs storage-1/redacted@2021.07.14_04.00.00Z /storage-1/redacted/.zfs/snapshot/2021.07.14_04.00.00Z -n -o rw
3380588 3380589 D    rwsem_ /sbin/mount.zfs storage-1/redacted@2021.07.14_04.00.00Z /storage-1/redacted/.zfs/snapshot/2021.07.14_04.00.00Z -n -o rw
      2 3927497 D<   rq_qos [z_wr_int]

Additionally, using the .zfs directory caused frequent NFS issues and elevated system load, so we decided the automount was the root of the issue and we would kill two birds with one stone if we just mount snapshots elsewhere, because we would be also able to unmount them "lazily".

Unfortunately, even though it helped with the load and NFS issues aspects, the umount issue manifested again in another way.

Describe how to reproduce the problem

10000 datasets under a single parent dataset, a recursive snapshot, then mount and unmount them until the trace below is shown in dmesg.

Include any warning/errors/backtraces from the system logs

[Thu Jul 29 09:38:58 2021] BUG: Dentry 00000000a55d315f{i=4d2,n=js}  still in use (1) [unmount of zfs zfs]
[Thu Jul 29 09:38:58 2021] ------------[ cut here ]------------
[Thu Jul 29 09:38:58 2021] WARNING: CPU: 4 PID: 2330446 at /build/linux-hwe-NVnjsY/linux-hwe-5.3.0/fs/dcache.c:1596 umount_check+0x69/0x80
[Thu Jul 29 09:38:58 2021] Modules linked in: sctp binfmt_misc nft_counter nft_chain_nat ipt_rpfilter xt_multiport nft_compat nf_tables ip_set_hash_ip ip_set_hash_net veth xt_physdev xt_addrtype xt_set ip
_set_hash_ipportip ip_set_hash_ipportnet ip_set_hash_ipport ip_set_bitmap_port ip_set dummy xt_tcpudp iptable_raw xt_CT ip6table_nat ip6_tables rbd libceph xt_MASQUERADE xt_comment xt_mark iptable_nat ipt
able_filter bpfilter xt_conntrack nf_nat nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo aufs overlay zfs(POE) zunicode(PO) zavl(PO) icp(POE) nls_iso8859_1 zcommon(POE) znvpair(POE) spl(OE) zlua(POE) k
vm_amd ccp kvm joydev irqbypass input_leds serio_raw sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi nfsd br_netfilter bridge auth_rpcgss stp llc nfs_
acl lockd ip_vs_sh grace ip_vs_wrr ip_vs_rr sunrpc ip_vs nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq
[Thu Jul 29 09:38:58 2021]  async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 crypto_si
md cryptd floppy glue_helper virtio_net psmouse virtio_blk net_failover failover
[Thu Jul 29 09:38:58 2021] CPU: 4 PID: 2330446 Comm: umount Tainted: P           OE     5.3.0-73-generic #69-Ubuntu
[Thu Jul 29 09:38:58 2021] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.13.0-1ubuntu1 04/01/2014
[Thu Jul 29 09:38:58 2021] RIP: 0010:umount_check+0x69/0x80
[Thu Jul 29 09:38:58 2021] Code: 08 48 8b 46 30 48 85 c0 74 27 48 8b 50 40 51 48 c7 c7 88 96 74 b4 48 89 f1 e8 16 03 e2 ff 48 c7 c7 b8 5b 70 b4 e8 0a 03 e2 ff <0f> 0b 58 31 c0 c9 c3 31 d2 eb d9 66 90 66 2
e 0f 1f 84 00 00 00 00
[Thu Jul 29 09:38:58 2021] RSP: 0018:ffffb1b9cc297d48 EFLAGS: 00010286
[Thu Jul 29 09:38:58 2021] RAX: 0000000000000024 RBX: ffff90eea8663390 RCX: 0000000000000006
[Thu Jul 29 09:38:58 2021] RDX: 0000000000000000 RSI: 0000000000000096 RDI: ffff90f94f917440
[Thu Jul 29 09:38:58 2021] RBP: ffffb1b9cc297d50 R08: 00000000000005f6 R09: 0000000000000004
[Thu Jul 29 09:38:58 2021] R10: ffffb1b9cc297e10 R11: 0000000000000001 R12: ffff90eea8663300
[Thu Jul 29 09:38:58 2021] R13: ffff90ee91015160 R14: ffff90ee910150c0 R15: ffff90eea8663358
[Thu Jul 29 09:38:58 2021] FS:  00007f5c38611b48(0000) GS:ffff90f94f900000(0000) knlGS:0000000000000000
[Thu Jul 29 09:38:58 2021] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Thu Jul 29 09:38:58 2021] CR2: 00007ffd8690d2a8 CR3: 00000003903f4000 CR4: 0000000000340ee0
[Thu Jul 29 09:38:58 2021] Call Trace:
[Thu Jul 29 09:38:58 2021]  d_walk+0xd5/0x270
[Thu Jul 29 09:38:58 2021]  ? shrink_lock_dentry.part.18+0xf0/0xf0
[Thu Jul 29 09:38:58 2021]  do_one_tree+0x24/0x40
[Thu Jul 29 09:38:58 2021]  shrink_dcache_for_umount+0x2d/0x90
[Thu Jul 29 09:38:58 2021]  generic_shutdown_super+0x1f/0x120
[Thu Jul 29 09:38:58 2021]  kill_anon_super+0x12/0x30
[Thu Jul 29 09:38:58 2021]  zpl_kill_sb+0x1a/0x20 [zfs]
[Thu Jul 29 09:38:58 2021]  deactivate_locked_super+0x48/0x80
[Thu Jul 29 09:38:58 2021]  deactivate_super+0x40/0x60
[Thu Jul 29 09:38:58 2021]  cleanup_mnt+0xbd/0x160
[Thu Jul 29 09:38:58 2021]  __cleanup_mnt+0x12/0x20
[Thu Jul 29 09:38:58 2021]  task_work_run+0x9d/0xc0
[Thu Jul 29 09:38:58 2021]  exit_to_usermode_loop+0x109/0x130
[Thu Jul 29 09:38:58 2021]  do_syscall_64+0x115/0x130
[Thu Jul 29 09:38:58 2021]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[Thu Jul 29 09:38:58 2021] RIP: 0033:0x7f5c3859b189
[Thu Jul 29 09:38:58 2021] Code: 05 48 89 c7 e8 a5 eb ff ff 5a c3 31 f6 50 b8 a6 00 00 00 0f 05 48 89 c7 e8 91 eb ff ff 5a c3 48 63 f6 50 b8 a6 00 00 00 0f 05 <48> 89 c7 e8 7c eb ff ff 5a c3 49 89 ca 50 4
8 63 ff 4d 63 c0 b8 2f
[Thu Jul 29 09:38:58 2021] RSP: 002b:00007ffd86910310 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[Thu Jul 29 09:38:58 2021] RAX: 0000000000000000 RBX: 00007f5c38612240 RCX: 00007f5c3859b189
[Thu Jul 29 09:38:58 2021] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 00007f5c38579130
[Thu Jul 29 09:38:58 2021] RBP: 00007f5c38579a20 R08: 0000000000000000 R09: 00007f5c38579a6c
[Thu Jul 29 09:38:58 2021] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f5c38579510
[Thu Jul 29 09:38:58 2021] R13: 00007f5c38579130 R14: 0000000000000000 R15: 00007ffd86910470
[Thu Jul 29 09:38:58 2021] ---[ end trace 642ea95a326b1e18 ]---

crabique commented 3 years ago

This happened again on another server with ZFS 0.8.1 this time

[Sat Jul 31 05:50:14 2021] BUG: Dentry 000000009dee592c{i=1a5de,n=handlebars}  still in use (1) [unmount of zfs zfs]
[Sat Jul 31 05:50:14 2021] ------------[ cut here ]------------
[Sat Jul 31 05:50:14 2021] WARNING: CPU: 8 PID: 76888 at /build/linux-hwe-NVnjsY/linux-hwe-5.3.0/fs/dcache.c:1596 umount_check+0x69/0x80
[Sat Jul 31 05:50:14 2021] Modules linked in: sctp binfmt_misc xt_tcpudp iptable_raw xt_CT nft_counter nft_chain_nat ipt_rpfilter xt_multiport nft_compat nf_tables ip_set_hash_ip ip_set_hash_net veth xt_physdev xt_addrtype xt_set ip_set_hash_ipportnet ip_set_hash_ipportip ip_set_hash_ipport ip_set_bitmap_port ip_set dummy rbd libceph ip6table_nat ip6_tables xt_MASQUERADE xt_comment xt_mark iptable_nat iptable_filter bpfilter xt_conntrack nf_nat nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo aufs overlay zfs(PO) zunicode(PO) zavl(PO) icp(PO) zlua(PO) nls_iso8859_1 zcommon(PO) znvpair(PO) spl(O) kvm_amd ccp kvm irqbypass input_leds joydev serio_raw sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi br_netfilter nfsd bridge auth_rpcgss stp llc nfs_acl lockd ip_vs_sh grace ip_vs_wrr ip_vs_rr sunrpc ip_vs nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq
[Sat Jul 31 05:50:14 2021]  async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 crypto_simd cryptd psmouse glue_helper virtio_net virtio_blk net_failover failover floppy
[Sat Jul 31 05:50:14 2021] CPU: 8 PID: 76888 Comm: umount Tainted: P        W  O      5.3.0-73-generic #69-Ubuntu
[Sat Jul 31 05:50:14 2021] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.13.0-1ubuntu1 04/01/2014
[Sat Jul 31 05:50:14 2021] RIP: 0010:umount_check+0x69/0x80
[Sat Jul 31 05:50:14 2021] Code: 08 48 8b 46 30 48 85 c0 74 27 48 8b 50 40 51 48 c7 c7 88 96 74 99 48 89 f1 e8 16 03 e2 ff 48 c7 c7 b8 5b 70 99 e8 0a 03 e2 ff <0f> 0b 58 31 c0 c9 c3 31 d2 eb d9 66 90 66 2e 0f 1f 84 00 00 00 00
[Sat Jul 31 05:50:14 2021] RSP: 0018:ffffb3063888fd48 EFLAGS: 00010286
[Sat Jul 31 05:50:14 2021] RAX: 0000000000000024 RBX: ffff9a588808de10 RCX: 0000000000000006
[Sat Jul 31 05:50:14 2021] RDX: 0000000000000000 RSI: 0000000000000096 RDI: ffff9a5acfa17440
[Sat Jul 31 05:50:14 2021] RBP: ffffb3063888fd50 R08: 0000000000000ed6 R09: 0000000000000004
[Sat Jul 31 05:50:14 2021] R10: ffffb30667763d68 R11: 0000000000000001 R12: ffff9a588808dd80
[Sat Jul 31 05:50:14 2021] R13: ffff9a4fc652dca0 R14: ffff9a4fc652dc00 R15: ffff9a588808ddd8
[Sat Jul 31 05:50:14 2021] FS:  00007fc1a44ecb48(0000) GS:ffff9a5acfa00000(0000) knlGS:0000000000000000
[Sat Jul 31 05:50:14 2021] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Sat Jul 31 05:50:14 2021] CR2: 00007f44aba35fe8 CR3: 00000006726fc000 CR4: 0000000000340ee0
[Sat Jul 31 05:50:14 2021] Call Trace:
[Sat Jul 31 05:50:14 2021]  d_walk+0xd5/0x270
[Sat Jul 31 05:50:14 2021]  ? shrink_lock_dentry.part.18+0xf0/0xf0
[Sat Jul 31 05:50:14 2021]  do_one_tree+0x24/0x40
[Sat Jul 31 05:50:14 2021]  shrink_dcache_for_umount+0x2d/0x90
[Sat Jul 31 05:50:14 2021]  generic_shutdown_super+0x1f/0x120
[Sat Jul 31 05:50:14 2021]  kill_anon_super+0x12/0x30
[Sat Jul 31 05:50:14 2021]  zpl_kill_sb+0x1a/0x20 [zfs]
[Sat Jul 31 05:50:14 2021]  deactivate_locked_super+0x48/0x80
[Sat Jul 31 05:50:14 2021]  deactivate_super+0x40/0x60
[Sat Jul 31 05:50:14 2021]  cleanup_mnt+0xbd/0x160
[Sat Jul 31 05:50:14 2021]  __cleanup_mnt+0x12/0x20
[Sat Jul 31 05:50:14 2021]  task_work_run+0x9d/0xc0
[Sat Jul 31 05:50:14 2021]  exit_to_usermode_loop+0x109/0x130
[Sat Jul 31 05:50:14 2021]  do_syscall_64+0x115/0x130
[Sat Jul 31 05:50:14 2021]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[Sat Jul 31 05:50:14 2021] RIP: 0033:0x7fc1a4476189
[Sat Jul 31 05:50:14 2021] Code: 05 48 89 c7 e8 a5 eb ff ff 5a c3 31 f6 50 b8 a6 00 00 00 0f 05 48 89 c7 e8 91 eb ff ff 5a c3 48 63 f6 50 b8 a6 00 00 00 0f 05 <48> 89 c7 e8 7c eb ff ff 5a c3 49 89 ca 50 48 63 ff 4d 63 c0 b8 2f
[Sat Jul 31 05:50:14 2021] RSP: 002b:00007ffc6ea67e10 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[Sat Jul 31 05:50:14 2021] RAX: 0000000000000000 RBX: 00007fc1a44ed1c0 RCX: 00007fc1a4476189
[Sat Jul 31 05:50:14 2021] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 00007fc1a44ed6f0
[Sat Jul 31 05:50:14 2021] RBP: 00007fc1a4454970 R08: 0000000000000004 R09: 00007fc1a44549bc
[Sat Jul 31 05:50:14 2021] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fc1a44544f0
[Sat Jul 31 05:50:14 2021] R13: 00007fc1a44ed6f0 R14: 0000000000000000 R15: 00007ffc6ea67f70
[Sat Jul 31 05:50:14 2021] ---[ end trace 645483d18c6a6aec ]---

crabique commented 3 years ago

Meanwhile, this happened 4 more times with similar traces and this plagues the backup process.

I also found the stale issue #6612 and this seems to be the same thing, the circumstances are identical to what one of the commenters there does with rsync.

wdoekes commented 3 years ago

Looks like I got me one of these on:

Sep 27 18:35:09 (node3.wp.c062)
Ubuntu/Focal
Linux 5.4.0-81-generic
zfs 0.8.3-1ubuntu12.12

BUG: Dentry 000000003e7c6cf3{i=b0dc,n=Functional} still in use (1) [unmount of zfs zfs]

BUG: Dentry 000000003e7c6cf3{i=b0dc,n=Functional}  still in use (1) [unmount of zfs zfs]
------------[ cut here ]------------
WARNING: CPU: 8 PID: 18839 at fs/dcache.c:1598 umount_check.cold+0x33/0x3e
Modules linked in: xt_set xt_multiport iptable_raw ip_set_hash_ip ip_set_hash_net ip_set veth ipip tunnel4 ip_tunnel xt_statistic xt_nat ipt_REJECT nf_reject_ipv4 xt_tcpudp ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs ip6table_nat iptable_mangle xt_comment xt_mark xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfr
ack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c br_netfilter bridge stp llc iptable_filter bpfilter ip6table_filter ip6_tables aufs overlay ipmi_ssif intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel nls_iso8859_1 kvm crct10dif_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glu
t_leds 8250_dw mei_me mei intel_pch_thermal ie31200_edac ipmi_si ipmi_devintf ipmi_msghandler mac_hid acpi_pad acpi_power_meter acpi_tad sch_fq_codel ip_tables x_tables autofs4 zfs(PO) zunicode(PO) zlua(PO) zavl(PO) icp(PO) zcommon(PO) znvpair(PO)
 spl(O) mlx5_ib ib_uverbs ib_core hid_generic usbhid hid ast drm_vram_helper i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect mlx5_core sysimgblt crc32_pclmul nvme fb_sys_fops drm intel_lpss_pci pci_hyperv_intf ahci i2c_i801 intel_lpss tls nvme_core idma64 libahci virt_dma mlxfw wmi pinctrl_cannonlake video pinctrl_i

CPU: 8 PID: 18839 Comm: dockerd Tainted: P           O      5.4.0-81-generic #91-Ubuntu
Hardware name: Supermicro SYS-5039MC-H12TRF/X11SCE-F, BIOS 1.2 11/21/2019
RIP: 0010:umount_check.cold+0x33/0x3e
Code: 00 00 48 8b 40 28 4c 8b 08 48 8b 46 30 48 85 c0 74 1f 48 8b 50 40 55 48 c7 c7 40 96 d8 a0 48 89 e5 51 48 89 f1 e8 12 83 ff ff <0f> 0b 58 31 c0 c9 c3 31 d2 eb e1 49 8b 44 24 28 49 8b 77 28 48 c7
RSP: 0000:ffffa23662fdbd38 EFLAGS: 00010246
RAX: 0000000000000058 RBX: ffff965f12f16c00 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff96666ea178c8 RDI: ffff96666ea178c8
RBP: ffffa23662fdbd40 R08: ffff96666ea178c8 R09: ffffa23640cf4020
R10: ffff96665534e6a0 R11: 0000000000000001 R12: ffff965ed75149a0
R13: ffff965f12f16c90 R14: ffff965ed7514900 R15: ffff965f12f16c58
FS:  00007f1d40ff9700(0000) GS:ffff96666ea00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f84b46bd0b0 CR3: 0000000fb6bf6004 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 d_walk+0xd6/0x290
 ? dentry_free+0x70/0x70
 do_one_tree+0x25/0x40
 shrink_dcache_for_umount+0x2d/0x90
 generic_shutdown_super+0x1f/0x110
 kill_anon_super+0x18/0x30
 zpl_kill_sb+0x1b/0x20 [zfs]
 deactivate_locked_super+0x3b/0x80
 deactivate_super+0x3e/0x50
 cleanup_mnt+0x109/0x160
 __cleanup_mnt+0x12/0x20
 task_work_run+0x8f/0xb0
 exit_to_usermode_loop+0x131/0x160
 do_syscall_64+0x163/0x190
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x56465d4a0ed0
Code: 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 49 c7 c2 00 00 00 00 49 c7 c0 00 00 00 00 49 c7 c1 00 00 00 00 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
RSP: 002b:000000c000d1e700 EFLAGS: 00000202 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 000000c00005e500 RCX: 000056465d4a0ed0
RDX: 0000000000000000 RSI: 0000000000000002 RDI: 000000c0010b8ba0
RBP: 000000c000d1e758 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000202 R12: ffffffffffffffff
R13: 0000000000000020 R14: 000000000000001f R15: 0000000000000055
---[ end trace fc3b9314ae4e7463 ]---

This machine is a kubernetes node. And one of the docker processes is stuck since this occurrence:

docker inspect / docker logs: hangs
the parent /usr/bin/containerd-shim-runc-v2 is gone
trying to docker rm -f it, does not work

I'm afraid lots of logs have been rotated away already. But I don't think I would've found much.

I do however see that these node are real heavy on copying of data at times, because for some (most? all?) pods the application contents are copied to a work location on startup. (At least 9k files copied in 2 seconds. Several times an hour.)

There are not ridiculously many filesystems, as far as I can tell:

# zfs list -H -r -t all | wc -l
2597

# zfs list -H -r | grep docker | wc -l
1296

# zfs list -H -r -t all | grep docker | wc -l
2568

According to k8s, the pod is Terminating:

    State:          Terminated
      Reason:       ContainerStatusUnknown
      Message:      The container could not be located when the pod was terminated
      Exit Code:    137
      Started:      Mon, 01 Jan 0001 00:00:00 +0000
      Finished:     Mon, 01 Jan 0001 00:00:00 +0000
    Ready:          False

Not sure if any unmounting has taken place. All zfs access is done through k8s/docker.

Looks like the ZFS filesystem is broken:

# timeout -s9 5s mount -t zfs local-storage/docker/606e56866c3605035938c405827a500408b7fc956afe536928fe4a17148a049f /mnt/
Killed

# mount | grep mnt
(void)

# mount -t zfs local-storage/docker/606e56866c3605035938c405827a500408b7fc956afe536928fe4a17148a049f-init /mnt/

# mount | grep mnt
local-storage/docker/606e56866c3605035938c405827a500408b7fc956afe536928fe4a17148a049f-init on /mnt type zfs (rw,relatime,xattr,noacl)

# ps fax | grep zfs
2698835 pts/0    D      0:00 /sbin/mount.zfs local-storage/docker/606e56866c3605035938c405827a500408b7fc956afe536928fe4a17148a049f /mnt -o rw
2705175 pts/0    D      0:00 /sbin/mount.zfs local-storage/docker/606e56866c3605035938c405827a500408b7fc956afe536928fe4a17148a049f /mnt -o rw
2711607 pts/0    D      0:00 /sbin/mount.zfs local-storage/docker/606e56866c3605035938c405827a500408b7fc956afe536928fe4a17148a049f /mnt -o rw
2714752 pts/0    D      0:00 /sbin/mount.zfs local-storage/docker/606e56866c3605035938c405827a500408b7fc956afe536928fe4a17148a049f /mnt -o rw

Not sure if I can find any more info.

In the mean time, other jobs with the same specs can be (and have) started just fine. But pruning the filesystem is broken at this point (likely because of this 606e mount that won't allow access):

# docker image prune -af --filter 'until=12h' 
Error response from daemon: a prune operation is already running

Let me know if there is somewhere where I can look to find more helpful details.

stale[bot] commented 2 years ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

behlendorf commented 2 years ago

Closing. This should be resolved as of the 2.0 release.

openzfs / zfs