openzfsonwindows / openzfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
402 stars 15 forks source link

Bug check: Re-initializing IO Remove Lock #358

Closed datacore-rm closed 3 months ago

datacore-rm commented 3 months ago
5: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

DRIVER_VERIFIER_DETECTED_VIOLATION (c4)
A device driver attempting to corrupt the system has been caught.  This is
because the driver was specified in the registry as being suspect (by the
administrator) and the kernel has enabled substantial checking of this driver.
If the driver attempts to corrupt the system, bugchecks 0xC4, 0xC1 and 0xA will
be among the most commonly seen crashes.
Arguments:
Arg1: 00000000000000d7, Re-initializing Remove Lock.
Arg2: ffffe70244be9570, Address of the chk build Remove Lock structure used internally by Verifier.
Arg3: ffffe702000e4a28, Address of the Remove Lock structure specified by the driver.
    A Remove Lock cannot be re-initialized, even after calling
    IoReleaseRemoveLockAndWait, because other threads might still be
    using that lock (by calling IoAcquireRemoveLock). The driver should
    allocate the Remove Lock inside its device extension, and initialize
    it a single time. The lock will be deleted together with the device
    extension.
Arg4: 0000000000000000

VIRTUAL_MACHINE:  VMware

BUGCHECK_CODE:  c4

BUGCHECK_P1: d7

BUGCHECK_P2: ffffe70244be9570

BUGCHECK_P3: ffffe702000e4a28

BUGCHECK_P4: 0

PROCESS_NAME:  zfs.exe

STACK_TEXT:  
ffffaa86`b9413018 fffff802`08272eb4 : 00000000`000000c4 00000000`000000d7 ffffe702`44be9570 ffffe702`000e4a28 : nt!KeBugCheckEx
ffffaa86`b9413020 fffff802`08278098 : 00000000`00000001 00000000`00000000 00000000`00000020 ffffe702`000e4a28 : nt!VerifierBugCheckIfAppropriate+0xe0
ffffaa86`b9413060 fffff802`e2be301d : 00000000`00000000 00000000`00000000 ffffe702`00f17a00 fffff802`07a20015 : nt!VerifierIoInitializeRemoveLockEx+0x78
ffffaa86`b94130d0 fffff802`e2bd3188 : ffffe702`0000ba80 ffffe701`fc490740 ffffe701`fc490760 fffff802`07e00010 : ZFSin!wzvol_assign_targetid+0x4d [C:\BuildAgent\work\9f5ca3432854b81\module\os\windows\zfs\zfs_windows_zvol_scsi.c @ 169] 
ffffaa86`b9413130 fffff802`e2bd3e40 : 00000000`00000003 ffffe702`000e3878 ffffe702`005bc000 00000304`b6c70013 : ZFSin!zvol_os_attach+0xe8 [C:\BuildAgent\work\9f5ca3432854b81\module\os\windows\zfs\zvol_os.c @ 636] 
ffffaa86`b9413230 fffff802`e2acb1fd : ffffe702`000e3878 00000000`00000000 00000304`b6c709d0 ffffaa86`b9413400 : ZFSin!zvol_os_create_minor+0x2f0 [C:\BuildAgent\work\9f5ca3432854b81\module\os\windows\zfs\zvol_os.c @ 802] 
ffffaa86`b94132c0 fffff802`e2aafda8 : 00000000`00000000 00000000`00000000 ffffe702`0df238a0 ffffaa86`b9413450 : ZFSin!dmu_objset_create+0xbd [C:\BuildAgent\work\9f5ca3432854b81\module\zfs\dmu_objset.c @ 1323] 
ffffaa86`b9413390 fffff802`e2aac1c2 : 00000000`00000000 ffffe701`40000010 ffffe702`5d3eaea0 fffff802`e2a94e7c : ZFSin!zfs_ioc_create+0x2c8 [C:\BuildAgent\work\9f5ca3432854b81\module\zfs\zfs_ioctl.c @ 3460] 
ffffaa86`b9413470 fffff802`e2bcca44 : 00000000`00000000 ffffaa86`b94135f0 00000000`00000000 ffffe701`fc47e648 : ZFSin!zfsdev_ioctl_common+0x312 [C:\BuildAgent\work\9f5ca3432854b81\module\zfs\zfs_ioctl.c @ 7804] 
ffffaa86`b9413510 fffff802`e2bbfcef : ffffaa86`b9414000 ffffaa86`b94135f0 ffffe702`5d3eaf70 00000000`00000000 : ZFSin!zfsdev_ioctl+0xc4 [C:\BuildAgent\work\9f5ca3432854b81\module\os\windows\zfs\zfs_ioctl_os.c @ 944] 
ffffaa86`b9413590 fffff802`e2bbf4cd : ffff11f7`9879402b ffffe702`54c24118 fffff802`07ebdda9 fffff802`08279a71 : ZFSin!ioctlDispatcher+0x54f [C:\BuildAgent\work\9f5ca3432854b81\module\os\windows\zfs\zfs_vnops_windows.c @ 4606] 
ffffaa86`b9413660 fffff802`07bb4407 : ffffe702`5d3eafb8 fffff802`0827322e ffffe702`00000001 ffffe702`00000001 : ZFSin!dispatcher+0x1fd [C:\BuildAgent\work\9f5ca3432854b81\module\os\windows\zfs\zfs_vnops_windows.c @ 5529] 
ffffaa86`b9413770 fffff802`08267458 : ffffe702`5d3eaea0 ffffe701`fca0bbb0 ffffe702`5d3eaea0 fffff802`0827391e : nt!IopfCallDriver+0x53
ffffaa86`b94137b0 fffff802`07c4f00d : ffffe702`5d3eaea0 00000000`00000002 ffffe701`fca0bbb0 ffffe702`513c3040 : nt!IovCallDriver+0x264
ffffaa86`b94137f0 fffff802`07ebdda9 : 00000000`00000000 ffffe702`5d3eaea0 ffffaa86`b9413b60 fffff802`08266dc5 : nt!IofCallDriver+0x1cebcd
ffffaa86`b9413830 fffff802`07e86491 : ffffe702`00000000 ffffaa86`b9413b60 00000000`00010000 ffffaa86`b9413b60 : nt!IopSynchronousServiceTail+0x189
ffffaa86`b94138d0 fffff802`07e86536 : 00007ff6`dcdbd450 00000000`00000000 00000000`00000000 00000000`00000000 : nt!IopXxxControlFile+0xc61
ffffaa86`b9413a00 fffff802`07c3cb35 : 00000000`00000000 00000000`00000000 00000000`00000000 0000007f`d032d448 : nt!NtDeviceIoControlFile+0x56
ffffaa86`b9413a70 00007ffe`2a7ff8e4 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceCopyEnd+0x25
0000007f`d032da88 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x00007ffe`2a7ff8e4

FAULTING_SOURCE_LINE:  C:\BuildAgent\work\9f5ca3432854b81\module\os\windows\zfs\zfs_windows_zvol_scsi.c

FAULTING_SOURCE_FILE:  C:\BuildAgent\work\9f5ca3432854b81\module\os\windows\zfs\zfs_windows_zvol_scsi.c

FAULTING_SOURCE_LINE_NUMBER:  169

FAULTING_SOURCE_CODE:  
   165:     }
   166: 
   167:     IoInitializeRemoveLock(pIoRemLock, 'KLRZ', 0, 0);
   168: 
>  169:     if (STATUS_SUCCESS != IoAcquireRemoveLock(pIoRemLock, zv)) {
   170:         dprintf("ZFS: Unable to assign targetid - can't acquire "
   171:             "the remlock.\n");
   172:         ASSERT("Unable to assign targetid - can't acquire remlock.");
   173:     } else {
   174:         for (uint8_t l = 0;

SYMBOL_NAME:  ZFSin!wzvol_assign_targetid+4d
datacore-rm commented 3 months ago

zvol_assign_targetid() allocates memory for the IO_REMOVE_LOCK structure from kmem cache. Once it is freed, it can allocate the same cached memory in the next allocation call for a new zvol. Seems like this is making the driver verifier to flag that the same IO_REMOVE_LOCK structure is reinitialized second time.

lundman commented 3 months ago

The docs say

After the IoReleaseRemoveLockAndWait routine returns, the driver should consider the device to be in a 
state in which it is ready to be removed and cannot perform I/O operations. Therefore, the driver must 
not call IoInitializeRemoveLock to re-initialize the remove lock. Violation of this rule while the driver 
is being verified by [Driver Verifier]
(https://learn.microsoft.com/en-us/windows-hardware/drivers/what-s-new-in-driver-development) 
will result in a bug check.

So the PIO_REMOVE_LOCK should be used with the driver, and driver unloading, not with a device unloading? Doesn't sound like you can re-allocate remove locks?

datacore-rm commented 3 months ago

This issue can be repro by creating and destroying ZVOL multiple times in a loop. Added debug logs to print the allocation and freed pointers. When kmem cache re-allocates the memory, we get this crash.

for /l %%x in (1, 1, 500) do (
    echo %%x
    zfs.exe create -s -V 1TB tank/testzvol
    zfs destroy tank/testzvol
)

It is documented in the code file zfs_windows_zvol_scsi.c programming notes: the remove lock must be dynamically allocated because it cannot be reinitialized.

So I will test by replacing it with ExAllocatePoolWithTag().

lundman commented 3 months ago

Could just memset the memory of the removelock. kmem_cache will not zero memory (for speed).

datacore-rm commented 3 months ago

Tried memset() and also below test code. Still it bugchecked.

    PIO_REMOVE_LOCK pIoRemLock1 = kmem_zalloc(sizeof(*pIoRemLock1), KM_SLEEP);

    if (!pIoRemLock1) {
        return (0);
    }

    IoInitializeRemoveLock(pIoRemLock1, NULL, 0, 0);

    if (STATUS_SUCCESS != IoAcquireRemoveLock(pIoRemLock1, NULL)) {
        return (0);
    }

    IoReleaseRemoveLockAndWait(pIoRemLock1, NULL);

      //mimic kmem_zalloc() has returned same cached object
    //kmem_free(pIoRemLock1, sizeof(*pIoRemLock1));
    //pIoRemLock1 = kmem_zalloc(sizeof(*pIoRemLock1), KM_SLEEP);

       bzero(pIoRemLock1, sizeof(*pIoRemLock1));
    IoInitializeRemoveLock(pIoRemLock1, NULL, 0, 0);

In Bugcheck summary Arg2: description suggests that DV creates a shadow remove lock structure internally for tracking the actual remove lock.

5: kd> !pool ffffe70244be9570
Pool page ffffe70244be9570 region is Nonpaged pool
 ffffe70244be90b0 size:   c0 previous size:    0  (Allocated)  WfpL
 ffffe70244be9170 size:   c0 previous size:    0  (Allocated)  WfpL
 ffffe70244be9230 size:   c0 previous size:    0  (Allocated)  WfpL
 ffffe70244be92f0 size:   c0 previous size:    0  (Allocated)  WfpL
 ffffe70244be93b0 size:   c0 previous size:    0  (Allocated)  WfpL
 ffffe70244be9470 size:   c0 previous size:    0  (Allocated)  VfAT
*ffffe70244be9530 size:   c0 previous size:    0  (Allocated) *VfAT
        Pooltag VfAT : Verifier AVL trees, Binary : nt!Vf
 ffffe70244be95f0 size:   c0 previous size:    0  (Allocated)  WfpL

Tested by allocating using ExAllocatePoolWithTag(). When same memory was re-allocated, DV did not complain. I will test more and create a PR.