ZFS deadlock with zap_lockdir_impl on zap_rwlock

satabdidas commented 2 years ago

System information

Type	Version/Name
Distribution Name	Amazon Linux
Distribution Version	2
Linux Kernel	5.10.102-99.473
Architecture	aarch64
ZFS Version	2.1.2

Describe the issue you are experiencing Exposing ZFS on linux over NFS on AWS ec2 and doing rsync (or other sustained write-only load) appears to intermittently lead to a deadlock on zap->zap_rwlock. The server becomes unresponsive and we can't ssh to it. We have to reboot the host to be able to bring the host back online.

Describe how to reproduce the issue

We have been reliably able to reproduce it by running cp commands in parallel from the client. It takes some time for the host to be unresponsive but it surely becomes so.
We also have been able to reproduce the problem intermittently by running frametest which also generates sustained heavy writes.

Including any warning/errors/backtraces from the system logs

We see the following stack trace when the issue occurs -

 #9 [ffff800016e1b580] rwsem_optimistic_spin at ffff800010120c48
#10 [ffff800016e1b5f0] rwsem_down_read_slowpath at ffff800010cc30c0
#11 [ffff800016e1b690] down_read at ffff800010cc3510
#12 [ffff800016e1b6b0] zap_lockdir_impl at ffff8000090d455c [zfs]
#13 [ffff800016e1b770] zap_lockdir at ffff8000090d531c [zfs]
#14 [ffff800016e1b7e0] zap_lookup at ffff8000090d6b6c [zfs]
#15 [ffff800016e1b840] zfs_id_overblockquota at ffff8000090f4b74 [zfs]
#16 [ffff800016e1b8b0] zfs_write at ffff8000090f9c9c [zfs]
#17 [ffff800016e1ba60] zpl_iter_write at ffff80000913b820 [zfs]
#18 [ffff800016e1baf0] do_iter_readv_writev at ffff8000103a1e44
#19 [ffff800016e1bb60] do_iter_write at ffff8000103a30dc
#20 [ffff800016e1bba0] vfs_iter_write at ffff8000103a31e8
#21 [ffff800016e1bbb0] nfsd_vfs_write at ffff80000948d3a0 [nfsd]
#22 [ffff800016e1bc70] nfsd_write at ffff80000948d954 [nfsd]
#23 [ffff800016e1bce0] nfsd3_proc_write at ffff80000949b84c [nfsd]
#24 [ffff800016e1bd20] nfsd_dispatch at ffff8000094864f0 [nfsd]
#25 [ffff800016e1bd60] svc_process_common at ffff80000932816c [sunrpc]
#26 [ffff800016e1bde0] svc_process at ffff8000093284a8 [sunrpc]
#27 [ffff800016e1be00] nfsd at ffff800009486060 [nfsd]
#28 [ffff800016e1be50] kthread at ffff8000100d4e68

By commenting out the call to zfs_id_overblockquota from zfs_write, the deadlock does not reproduce (with the likely side effect that quota handling will be broken).

From the crash dump we are suspecting another process is the owner of the lock zap_rwlock and it didn't release it.

Here's how we have the above suspicion. We inspected the zap_rwlock structure which is of type krwlock_t. Here's how reached to it.

Here's a snippet of bt -FFx output -

#11 [ffff800016e1b690] down_read at ffff800010cc3510
    ffff800016e1b690: ffff800016e1b6b0 zap_lockdir_impl+668
    ffff800016e1b6a0: [ffff000cad645800:kmalloc-512] [ffff0017ab8ced00:dmu_buf_impl_t]
#12 [ffff800016e1b6b0] zap_lockdir_impl at ffff8000090d455c [zfs]
    ffff800016e1b6b0: ffff800016e1b770 zap_lockdir+192
    ffff800016e1b6c0: 0000000000000000 __func__.20
    ffff800016e1b6d0: 0000000000000000 0000000000000002
    ffff800016e1b6e0: 0000000000000001 0000000000000000
    ffff800016e1b6f0: ffff800016e1b830 000000006d5a0000
    ffff800016e1b700: [ffff00178d03b058:zfs_znode_cache] [ffff00178d03af18:zfs_znode_cache]
    ffff800016e1b710: 0000000000000001 [ffff000cad6458b0:kmalloc-512]
    ffff800016e1b720: [ffff00193f1e5d00:dnode_t] 0002000000000200
    ffff800016e1b730: 0000000000000028 0000000000000000
    ffff800016e1b740: ffffffff03000001 0000000000000200
    ffff800016e1b750: 0000000000000000 0000000000000200
    ffff800016e1b760: 0000000000000001 71739ab2d922ea00

We then inspected ffff0017ab8ced00:dmu_buf_impl_t

crash> struct dmu_buf_impl_t ffff0017ab8ced00
struct dmu_buf_impl_t {
  db = {
    db_object = 3,
    db_offset = 0,
    db_size = 512,
    db_data = 0xffff000a7f138600
  },
  <<snip>>
  db_cache_link = {
    next = 0xdead000000000100,
    prev = 0xdead000000000122
  },
  db_caching_status = DB_NO_CACHE,
  db_user = 0xffff000cad645800, // <------------ Gets assigned to zap_t *zap
  db_user_immediate_evict = 0 '\000',
  db_freed_in_flight = 0 '\000',
  db_pending_evict = 0 '\000',
  db_dirtycnt = 0 '\000'
}

Then we inspected db_user since it gets assigned to zap from here - zap_t *zap = dmu_buf_get_user(db);

crash> struct zap_t 0xffff000cad645800
struct zap_t {
  zap_dbu = {
    dbu_tqent = {
      tqent_lock = {
        {
          rlock = {
            raw_lock = {
              {
                val = {
                  counter = 0
                },
                {
                  locked = 0 '\000',
                  pending = 0 '\000'
                },
                {
                  locked_pending = 0,
                  tail = 0
                }
              }
            }
          }
        }
      },
      tqent_waitq = {
        lock = {
          {
            rlock = {
              raw_lock = {
                {
                  val = {
                    counter = 0
                  },
                  {
                    locked = 0 '\000',
                    pending = 0 '\000'
                  },
                  {
                    locked_pending = 0,
                    tail = 0
                  }
                }
              }
            }
          }
        },
        head = {
          next = 0xffff000cad645810,
          prev = 0xffff000cad645810
        }
      },
      tqent_timer = {
        entry = {
          next = 0x0,
          pprev = 0x0
        },
        expires = 0,
        function = 0x0,
        flags = 61
      },
      tqent_list = {
        next = 0xffff000cad645848,
        prev = 0xffff000cad645848
      },
      tqent_id = 0,
      tqent_func = 0x0,
      tqent_arg = 0x0,
      tqent_taskq = 0x0,
      tqent_flags = 0,
      tqent_birth = 0
    },
    dbu_evict_func_sync = 0xffff8000090d3550 <zap_evict_sync+20>,
    dbu_evict_func_async = 0x0
  },
  zap_objset = 0xffff0003ef35a800,
  zap_object = 3,
  zap_dbuf = 0xffff0017ab8ced00,
  zap_rwlock = {
    rw_rwlock = {
      count = {
        counter = 256
      },
      owner = {
        counter = -281453659007423 // <--------- Checked the owner field
      },
      osq = {
        tail = {
          counter = 0
        }
      },
      wait_lock = {
        raw_lock = {
          {
            val = {
              counter = 0
            },
            {
              locked = 0 '\000',
              pending = 0 '\000'
            },
            {
              locked_pending = 0,
              tail = 0
            }
          }
        }
      },
      wait_list = {
        next = 0xffff000cad6458c8,
        prev = 0xffff000cad6458c8
      }
    },
    rw_owner = 0x0
  },
  zap_ismicro = B_TRUE,
  zap_normflags = 0,
  zap_salt = 1783175,
  zap_u = {
    zap_fat = {
      zap_num_entries_mtx = {
        m_mutex = {
          owner = {
            counter = 458752
          },
          wait_lock = {
            {
              rlock = {
                raw_lock = {
                  {
                    val = {
                      counter = 0
                    },
                    {
                      locked = 0 '\000',
                      pending = 0 '\000'
                    },
                    {
                      locked_pending = 0,
                      tail = 0
                    }
                  }
                }
              }
            }
          },
          osq = {
            tail = {
              counter = 0
            }
          },
          wait_list = {
            next = 0xffff8000090d3500 <mze_compare+20>,
            prev = 0x0
          }
        },
        m_lock = {
          {
            rlock = {
              raw_lock = {
                {
                  val = {
                    counter = 0
                  },
                  {
                    locked = 0 '\000',
                    pending = 0 '\000'
                  },
                  {
                    locked_pending = 0,
                    tail = 0
                  }
                }
              }
            }
          }
        },
        m_owner = 0x30
      },
      zap_block_shift = 0
    },
    zap_micro = {
      zap_num_entries = 0,
      zap_num_chunks = 7,
      zap_alloc_next = 0,
      zap_avl = {
        avl_root = 0x0,
        avl_compar = 0xffff8000090d3500 <mze_compare+20>,
        avl_offset = 0,
        avl_numnodes = 0,
        avl_size = 48
      }
    }
  }
}

Then we checked the owner field which is -281453659007423 (0xFFFF0004F6A25641) and we found a task for 0xFFFF0004F6A25640 -

crash> task 0xFFFF0004F6A25640
PID: 12190  TASK: ffff0004f6a25640  CPU: 21  COMMAND: "nfsd"
<snip>

And here's the current stack trace of the above task -

crash> set 12190
    PID: 12190
COMMAND: "nfsd"
   TASK: ffff0004f6a25640  [THREAD_INFO: ffff0004f6a25640]
    CPU: 21
  STATE: TASK_INTERRUPTIBLE
crash> bt
PID: 12190  TASK: ffff0004f6a25640  CPU: 21  COMMAND: "nfsd"
 #0 [ffff800016ef3c60] __switch_to at ffff800010019d5c
 #1 [ffff800016ef3c80] __schedule at ffff800010cbfb58
 #2 [ffff800016ef3d00] schedule at ffff800010cc0010
 #3 [ffff800016ef3d20] schedule_timeout at ffff800010cc4484
 #4 [ffff800016ef3d90] svc_get_next_xprt at ffff80000933e894 [sunrpc]
 #5 [ffff800016ef3dd0] svc_recv at ffff80000933f36c [sunrpc]
 #6 [ffff800016ef3e00] nfsd at ffff800009486048 [nfsd]
 #7 [ffff800016ef3e50] kthread at ffff8000100d4e68

Since the above stack trace is not within any zfs functions, we suspect this thread took the lock and exited without releasing it.

rincebrain commented 2 years ago

2.1.3 and 2.1.4 contain a couple of fixes for deadlocks, you might want to try 2.1.4 and see if it magically goes away.

satabdidas commented 2 years ago

2.1.3 and 2.1.4 contain a couple of fixes for deadlocks, you might want to try 2.1.4 and see if it magically goes away.

We reproduced this issue with 2.1.3. Trying with 2.1.4 is a good idea. Will try. Thanks

satabdidas commented 2 years ago

We are seeing the issue with ZFS 2.1.4 as well.

satabdidas commented 2 years ago

We upgraded the linux kernel to 5.15.44 and used zfs 2.1.4. We see almost same stack traces. But most importantly we see a memory corruption -

Jun 14 18:59:17 ip-198-19-23-132 kernel: list_add corruption. prev->next should be next (ffff000592c924c8), but was 0000000000000000. (prev=ffff0005c22ce870).
Jun 14 18:59:17 ip-198-19-23-132 kernel: ------------[ cut here ]------------
Jun 14 18:59:17 ip-198-19-23-132 kernel: kernel BUG at lib/list_debug.c:26!
Jun 14 18:59:18 ip-198-19-23-132 kernel: Internal error: Oops - BUG: 0 [#1] SMP
Jun 14 18:59:18 ip-198-19-23-132 kernel: Modules linked in: rpcsec_gss_krb5 nfsd auth_rpcgss nfs_acl lockd grace raid0 dm_raid raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq xt_LOG nf_log_syslog xt_limit iptable_filter iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_raw iptable_security bpfilter bcache crc64 sunrpc nls_ascii nls_cp437 vfat fat dm_mirror dm_region_hash dm_log dm_mod dax aes_ce_blk crypto_simd cryptd aes_ce_cipher ghash_ce gf128mul sha2_ce sha256_arm64 sha1_ce button zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zcommon(POE) znvpair(POE) zavl(POE) icp(POE) spl(OE) nbd binfmt_misc ena efivarfs
Jun 14 18:59:18 ip-198-19-23-132 kernel: CPU: 13 PID: 9342 Comm: nfsd Tainted: P           OE     5.15.44-21.123.amzn2.aarch64 #1
Jun 14 18:59:18 ip-198-19-23-132 kernel: Hardware name: Amazon EC2 c6gn.16xlarge/, BIOS 1.0 11/1/2018
Jun 14 18:59:18 ip-198-19-23-132 kernel: pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
Jun 14 18:59:18 ip-198-19-23-132 kernel: pc : __list_add_valid+0x88/0x8c
Jun 14 18:59:18 ip-198-19-23-132 kernel: lr : __list_add_valid+0x88/0x8c
Jun 14 18:59:18 ip-198-19-23-132 kernel: sp : ffff80000ea6b600
Jun 14 18:59:18 ip-198-19-23-132 kernel: x29: ffff80000ea6b600 x28: 0000000000000003 x27: ffff800000f21290
Jun 14 18:59:18 ip-198-19-23-132 kernel: x26: 0000000000000000 x25: ffff0005c22ce870 x24: ffff000592c924c8
Jun 14 18:59:18 ip-198-19-23-132 kernel: x23: ffff000592c924c4 x22: ffff80000ea6b668 x21: 0000000000000002
Jun 14 18:59:18 ip-198-19-23-132 kernel: x20: ffff000592c924b0 x19: ffffffffffffff00 x18: 0000000000000030
Jun 14 18:59:18 ip-198-19-23-132 kernel: x17: 3863343239633239 x16: 3530303066666666 x15: 0720072007200720
Jun 14 18:59:18 ip-198-19-23-132 kernel: x14: 072e072907300737 x13: 0720072007200720 x12: 072e072907300737
Jun 14 18:59:18 ip-198-19-23-132 kernel: x11: ffff800009de8ed8 x10: 00000000ffff0000 x9 : ffff800008237338
Jun 14 18:59:18 ip-198-19-23-132 kernel: x8 : 0000000000000000 x7 : ffff800009868ed8 x6 : 000000000002fffd
Jun 14 18:59:18 ip-198-19-23-132 kernel: x5 : 0000000000000000 x4 : ffff00226fc091c8 x3 : 0000000000000000
Jun 14 18:59:18 ip-198-19-23-132 kernel: x2 : 0000000000000000 x1 : ffff000422f1c180 x0 : 0000000000000075
Jun 14 18:59:18 ip-198-19-23-132 kernel: Call trace:
Jun 14 18:59:18 ip-198-19-23-132 kernel: __list_add_valid+0x88/0x8c
Jun 14 18:59:18 ip-198-19-23-132 kernel: rwsem_down_read_slowpath+0xe4/0x4d0
Jun 14 18:59:18 ip-198-19-23-132 kernel: down_read+0x4c/0xd0
Jun 14 18:59:18 ip-198-19-23-132 kernel: zap_lockdir_impl+0x29c/0x340 [zfs]
Jun 14 18:59:18 ip-198-19-23-132 kernel: zap_lockdir+0xc0/0xe0 [zfs]
Jun 14 18:59:18 ip-198-19-23-132 kernel: zap_lookup+0x68/0x118 [zfs]
Jun 14 18:59:18 ip-198-19-23-132 kernel: zfs_id_overblockquota+0xcc/0x168 [zfs]
Jun 14 18:59:18 ip-198-19-23-132 kernel: zfs_write+0x238/0xabc [zfs]
Jun 14 18:59:18 ip-198-19-23-132 kernel: zpl_iter_write+0xf8/0x180 [zfs]
Jun 14 18:59:18 ip-198-19-23-132 kernel: do_iter_readv_writev+0x108/0x1ac
Jun 14 18:59:18 ip-198-19-23-132 kernel: do_iter_write+0x98/0x188
Jun 14 18:59:18 ip-198-19-23-132 kernel: vfs_iter_write+0x28/0x3c
Jun 14 18:59:18 ip-198-19-23-132 kernel: nfsd_vfs_write+0x2b0/0x518 [nfsd]
Jun 14 18:59:18 ip-198-19-23-132 kernel: nfsd4_write+0x154/0x1e4 [nfsd]
Jun 14 18:59:18 ip-198-19-23-132 kernel: nfsd4_proc_compound+0x47c/0x6b0 [nfsd]
Jun 14 18:59:18 ip-198-19-23-132 kernel: nfsd_dispatch+0x108/0x238 [nfsd]
Jun 14 18:59:18 ip-198-19-23-132 kernel: svc_process_common+0x4a4/0x6f0 [sunrpc]
Jun 14 18:59:18 ip-198-19-23-132 kernel: svc_process+0x88/0xe0 [sunrpc]
Jun 14 18:59:18 ip-198-19-23-132 kernel: nfsd+0x100/0x15c [nfsd]
Jun 14 18:59:18 ip-198-19-23-132 kernel: kthread+0x114/0x118

I looked at the call chain from rwsem_down_read_slowpath till __list_add_valid -

rwsem_down_read_slowpath
|_ rwsem_add_waiter
  |_ list_add_tail
    |_ __list_add
      |_ __list_add_valid

The __list_ad_valid throws an error because the prev->next in the list is not same as the current node code link

The call chain above is trying to add an entry to the krwlock_t->rw_rwlock->wait_list and the kernel detects a memory corruption in the krwlock_t->rw_rwlock->wait_list

This could be a kernel issue or a ZFS issue or something else.

I am looking for a way to debug ZFS and find out if the memory corruption has been caused by ZFS. Is there anyway I can do that?

rincebrain commented 2 years ago

I don't recall any exciting fixes that would be relevant, but I'd suggest trying on 2.1.5 before you do much else.

You could try playing with kASAN, though I don't know if it'd catch this kind of problem, depending on precisely what it's doing. I believe that's supported on aarch64...

satabdidas commented 2 years ago

I tried kasan with 5.10 kernel but it didn't report any thing related. I'll try 2.1.5.

Here's what I got from kasan -

[Thu Jun  2 14:16:25 2022] ==================================================================
[Thu Jun  2 14:16:25 2022] BUG: KASAN: slab-out-of-bounds in raid_status+0x64/0x1120 [dm_raid]
[Thu Jun  2 14:16:25 2022] Read of size 4 at addr ffff0004b29b8348 by task vgs/3983

[Thu Jun  2 14:16:25 2022] CPU: 20 PID: 3983 Comm: vgs Tainted: P           OE     5.10.112-108.499.amzn2.aarch64 #1
[Thu Jun  2 14:16:25 2022] Hardware name: Amazon EC2 c6gn.16xlarge/, BIOS 1.0 11/1/2018
[Thu Jun  2 14:16:25 2022] Call trace:
[Thu Jun  2 14:16:25 2022]  dump_backtrace+0x0/0x300
[Thu Jun  2 14:16:25 2022]  show_stack+0x24/0x30
[Thu Jun  2 14:16:25 2022]  dump_stack+0x100/0x170
[Thu Jun  2 14:16:25 2022]  print_address_description.constprop.0+0x68/0x20c
[Thu Jun  2 14:16:25 2022]  __kasan_report+0xe0/0x138
[Thu Jun  2 14:16:25 2022]  kasan_report+0x44/0xd0
[Thu Jun  2 14:16:25 2022]  __asan_load4+0x94/0xcc
[Thu Jun  2 14:16:25 2022]  raid_status+0x64/0x1120 [dm_raid]
[Thu Jun  2 14:16:25 2022]  retrieve_status+0xd4/0x2c8 [dm_mod]
[Thu Jun  2 14:16:25 2022]  table_status+0x140/0x1e0 [dm_mod]
[Thu Jun  2 14:16:25 2022]  ctl_ioctl+0x240/0x448 [dm_mod]
[Thu Jun  2 14:16:25 2022]  dm_ctl_ioctl+0x1c/0x38 [dm_mod]
[Thu Jun  2 14:16:25 2022]  __arm64_sys_ioctl+0xdc/0x120
[Thu Jun  2 14:16:25 2022]  el0_svc_common.constprop.0+0xbc/0x308
[Thu Jun  2 14:16:25 2022]  do_el0_svc+0x90/0xb8
[Thu Jun  2 14:16:25 2022]  el0_svc+0x20/0x30
[Thu Jun  2 14:16:25 2022]  el0_sync_handler+0x9c/0x120

[Thu Jun  2 14:16:25 2022] Allocated by task 3757:
[Thu Jun  2 14:16:25 2022]  kasan_save_stack+0x28/0x60
[Thu Jun  2 14:16:25 2022]  __kasan_kmalloc.constprop.0+0xc8/0xe8
[Thu Jun  2 14:16:25 2022]  kasan_kmalloc+0x10/0x1c
[Thu Jun  2 14:16:25 2022]  kmem_cache_alloc_trace+0x1bc/0x328
[Thu Jun  2 14:16:25 2022]  create_strip_zones+0xb4/0xf40 [raid0]
[Thu Jun  2 14:16:25 2022]  raid0_run+0x358/0x3f4 [raid0]
[Thu Jun  2 14:16:25 2022]  md_run+0x588/0xecc
[Thu Jun  2 14:16:25 2022]  raid_ctr+0x638/0x15b4 [dm_raid]
[Thu Jun  2 14:16:25 2022]  dm_table_add_target+0x284/0x580 [dm_mod]
[Thu Jun  2 14:16:25 2022]  table_load+0x21c/0x4a8 [dm_mod]
[Thu Jun  2 14:16:25 2022]  ctl_ioctl+0x240/0x448 [dm_mod]
[Thu Jun  2 14:16:25 2022]  dm_ctl_ioctl+0x1c/0x38 [dm_mod]
[Thu Jun  2 14:16:25 2022]  __arm64_sys_ioctl+0xdc/0x120
[Thu Jun  2 14:16:25 2022]  el0_svc_common.constprop.0+0xbc/0x308
[Thu Jun  2 14:16:25 2022]  do_el0_svc+0x90/0xb8
[Thu Jun  2 14:16:25 2022]  el0_svc+0x20/0x30
[Thu Jun  2 14:16:25 2022]  el0_sync_handler+0x9c/0x120

[Thu Jun  2 14:16:25 2022] The buggy address belongs to the object at ffff0004b29b8300
                            which belongs to the cache kmalloc-128 of size 128
[Thu Jun  2 14:16:25 2022] The buggy address is located 72 bytes inside of
                            128-byte region [ffff0004b29b8300, ffff0004b29b8380)
[Thu Jun  2 14:16:25 2022] The buggy address belongs to the page:
[Thu Jun  2 14:16:25 2022] page:000000003e79ac1c refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x4f29b8
[Thu Jun  2 14:16:25 2022] head:000000003e79ac1c order:1 compound_mapcount:0
[Thu Jun  2 14:16:25 2022] flags: 0x2fffff0000010200(slab|head)
[Thu Jun  2 14:16:25 2022] raw: 2fffff0000010200 dead000000000100 dead000000000122 ffff0003c000fc00
[Thu Jun  2 14:16:25 2022] raw: 0000000000000000 0000000000200020 00000001ffffffff 0000000000000000
[Thu Jun  2 14:16:25 2022] page dumped because: kasan: bad access detected

satabdidas commented 2 years ago

One more point to mention, when we turn off ZSTD, the frametest workload runs fine. But when we turn it on, we can see the hang within a few seconds.

satabdidas commented 2 years ago

I don't recall any exciting fixes that would be relevant, but I'd suggest trying on 2.1.5 before you do much else.

Tried running with zfs 2.1.5 and I could still recreate the deadlock with frametest.

PaulZ-98 commented 2 years ago

It may change the behavior because of timing and using more memory, but if you create a debug build (./configure --enable-debug) and once the module is loaded, before running the test,

echo "1" > /sys/module/zfs/parameters/reference_tracking_enable

then zfs will record the name of the function that took the rwlock but never released it. Examine rc_list using crash.

satabdidas commented 2 years ago

Thank you for the pointer. Will do.

On Tue, Aug 16, 2022 at 9:47 AM Paul Zuchowski @.***> wrote:

It may change the behavior because of timing and using more memory, but if you create a debug build (./configure --enable-debug) and once the module is loaded, before running the test,

echo "1" > /sys/module/zfs/parameters/reference_tracking_enable

then zfs will record the name of the function that took the rwlock but never released it. Examine rc_list using crash.

— Reply to this email directly, view it on GitHub https://github.com/openzfs/zfs/issues/13419#issuecomment-1216662689, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAW3AC4B67FYJP4GLSHSG63VZOLWXANCNFSM5VDELEZQ . You are receiving this because you authored the thread.Message ID: @.***>

stale[bot] commented 1 year ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

openzfs / zfs

ZFS deadlock with zap_lockdir_impl on zap_rwlock #13419