openzfsonwindows / openzfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
473 stars 17 forks source link

kstat_hold() blocks forever #6

Open lundman opened 3 years ago

lundman commented 3 years ago

I had a deadlock with zpool import -d c:/src/diskimage.bin funk with the stack:

    nt!KeWaitForMultipleObjects+0x287   C/C++/ASM
    OpenZFS!spl_cv_wait+0x104 [C:\src\openzfs\module\os\windows\spl\spl-condvar.c @ 123]    C/C++/ASM
    OpenZFS!kstat_hold+0xaf [C:\src\openzfs\module\os\windows\spl\spl-kstat.c @ 610]    C/C++/ASM
    OpenZFS!kstat_hold_bykid+0x54 [C:\src\openzfs\module\os\windows\spl\spl-kstat.c @ 646]  C/C++/ASM
    OpenZFS!kstat_delete+0xd2 [C:\src\openzfs\module\os\windows\spl\spl-kstat.c @ 1295] C/C++/ASM
    OpenZFS!procfs_list_destroy+0x5e [C:\src\openzfs\module\os\windows\spl\spl-proc_list.c @ 141]   C/C++/ASM
    OpenZFS!spa_read_history_destroy+0x4b [C:\src\openzfs\module\zfs\spa_stats.c @ 146] C/C++/ASM
    OpenZFS!spa_stats_destroy+0x3b [C:\src\openzfs\module\zfs\spa_stats.c @ 1013]   C/C++/ASM
    OpenZFS!spa_remove+0x3c3 [C:\src\openzfs\module\zfs\spa_misc.c @ 816]   C/C++/ASM
    OpenZFS!spa_tryimport+0x5f7 [C:\src\openzfs\module\zfs\spa.c @ 6222]    C/C++/ASM
    OpenZFS!zfs_ioc_pool_tryimport+0x76 [C:\src\openzfs\module\zfs\zfs_ioctl.c @ 1646]  C/C++/ASM

Where we are stuck here:

kstat_hold(avl_tree_t *t, ekstat_t *template)
{
    kstat_t *ksp;
    ekstat_t *e;

    mutex_enter(&kstat_chain_lock);
    for (;;) {
        ksp = avl_find(t, template, NULL);
        if (ksp == NULL)
            break;
        e = (ekstat_t *)ksp;
        if (e->e_owner == NULL) {
            e->e_owner = (void *)curthread;
            break;
        }
        cv_wait(&e->e_cv, &kstat_chain_lock);
    }

e->e_owner is the value of curthread already, so it is waiting for ourselves to wake us up. This could suggest a leak in kstat_hold somewhere,. perhaps from procfs_list_destroy - code we got from FreeBSD. Check and see if they have fixed anything in this area.

datacore-rm commented 2 years ago

Later commits done in spl-proc_list.c/procfs_list_destroy() could resolve this issue?

lundman commented 2 years ago

I think it is a bit racey, between mutex_exit and destroy. I've not come across it in quite a while, so hoping other things fixed it. Have you come across it?

datacore-rm commented 2 years ago

I have not come across this issue anytime. Thanks.

datacore-rm commented 2 years ago

In kstat_install() and kstat_delete() there is below check where there is no corresponding kstat_release(). Can there be any chance that kstat_hold_bykid() returns some another not null ksp?

if (kstat_hold_bykid(ksp->ks_kid, zoneid) != ksp) { cmn_err(CE_WARN, "kstat_delete(%p): does not exist", (void *)ksp); return; }