openzfsonwindows / openzfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
494 stars 18 forks source link

Driver load BSOD with Aomei driver #384

Open lundman opened 5 months ago

lundman commented 5 months ago

I have been trying to find out why loading openfs.sys will BSOD if Aomei's BackUpper is installed. The crash is generally not instant, but a few second (1-10s) later. The stack is indicative of stack overflow, or list corruption and never related to either driver. Usually when receiving network IO.

I do not think there is a bug in Aomei, but rather simply triggered by them querying openzfs.sys.

1) However, I can ignore all Driver->MajorFunction calls from Aomei, and we will STILL crash. So it isn't anything returned in dispatcher. But what else is there?

2) If I do not call StorPortInitialize() we do not crash. I do not think there is a storport issue, but rather that it creates a device type that Aomei is interested in, and if we do not have storport, whatever code that triggers the BSOD is just not executed.

3) It doesn't always BSOD. Of 5 loads, it crashes about 4 times. Sometimes it loads fine and I can use it, benchmark etc. So presumably a race somewhere, but not to calling MajorFunctions.

I am unsure what to do next, a lot of things recommend verifier - but this is a driver load issue, so I do not know if that will help.

I doubt Aomei developers are interested in debugging.

Example stack of BSOD:

!analyze -v

``` BAD_POOL_CALLER (c2) The current thread is making a bad pool request. Typically this is at a bad IRQL level or double freeing the same allocation, etc. Arguments: Arg1: 000000000000000d, Attempt to release quota on a corrupted pool allocation. Arg2: ffffe3086bb6b5c0, Address of pool Arg3: 00000000fffff807, Pool allocation's tag Arg4: 7ac30bb7e5c716f2, Quota process pointer (bad). Debugging Details: ------------------ BUGCHECK_CODE: c2 BUGCHECK_P1: d BUGCHECK_P2: ffffe3086bb6b5c0 BUGCHECK_P3: fffff807 BUGCHECK_P4: 7ac30bb7e5c716f2 PROCESS_NAME: System STACK_TEXT: : ffffbe8b`fd852e00 fffff807`38d12870 ffff8b01`24367180 00000000`00000001 : nt!DbgBreakPointWithStatus : ffff8b01`00000003 ffffbe8b`fd852e00 fffff807`38e4e9f0 00000000`000000c2 : nt!KiBugCheckDebugBreak+0x12 : 00006504`000a0000 ffffbe8b`fd853680 7ac30bb7`e5c716f2 fffff807`394f42ba : nt!KeBugCheck2+0xba3 : 00000000`000000c2 00000000`0000000d ffffe308`6bb6b5c0 00000000`fffff807 : nt!KeBugCheckEx+0x107 : ffffe308`6bb6b5b0 ffffe308`fffff807 00000000`00000d40 00000000`00000000 : nt!ExReturnPoolQuota+0x222647 : ffffe308`00000000 ffffbe8b`fd853620 ffffe308`00000000 00000000`00000000 : nt!ExFreePoolWithTag+0x1e5 : fffff807`3b4732fa 00000000`00000000 ffffe308`66c08fe0 ffffe308`61886210 : NETIO!WfpPoolFree+0x1a : ffffe308`6bb6b5c0 ffffe308`6aee2568 ffffe308`6b4fae90 ffffe308`667a3a88 : NETIO!StreamDerefCalloutContext+0x2f : ffffe308`6b9fb4b0 ffffe308`6aee2568 00000000`00000125 ffffe308`6e62e130 : NETIO!StreamDeleteCalloutContext+0x76 : ffffe308`6b9fb4b0 ffffe308`6aee2568 ffffe308`6b4fae90 00000000`00000000 : NETIO!WfpNotifyFlowContextDelete+0x34 : ffffbcd2`2a00ff00 ffffe308`6b4fae90 ffffe308`6aee2520 ffffbe8b`fd853810 : NETIO!KfdAleNotifyFlowDeletion+0x1ae : ffffe308`67066010 ffffe308`67066010 ffffe308`66826500 00000000`00000000 : tcpip!TcpCleanupTcbWorkQueueRoutine+0x15a : ffffe308`66997cc0 fffff807`3b911d80 ffffe308`66997cc0 fffff807`38cf50ff : tcpip!TcpCloseTcbWorkQueueRoutine+0x11b : ffffe308`66a86260 ffffe308`61f6e650 00000000`00000000 ffffe308`00000000 : NETIO!NetiopIoWorkItemRoutine+0x57 : ffffe308`618a3c90 ffffe308`6b47c040 ffffbe8b`fd853b00 ffffe308`00000000 : nt!IopProcessWorkItem+0x8e : ffffe308`6b47c040 00000000`0000076e ffffe308`6b47c040 fffff807`38cf49e0 : nt!ExpWorkerThread+0x155 : fffff807`37795180 ffffe308`6b47c040 fffff807`38d1b520 5b5b5b5b`5b5b5b5b : nt!PspSystemThreadStartup+0x57 : ffffbe8b`fd854000 ffffbe8b`fd84e000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x34 SYMBOL_NAME: NETIO!WfpPoolFree+1a MODULE_NAME: NETIO IMAGE_NAME: NETIO.SYS ```

lundman commented 5 months ago
BSOD type 2

``` KERNEL_SECURITY_CHECK_FAILURE (139) A kernel component has corrupted a critical data structure. The corruption could potentially allow a malicious user to gain control of this machine. Arguments: Arg1: 0000000000000003, A LIST_ENTRY has been corrupted (i.e. double remove). Arg2: ffffbe8bfd845600, Address of the trap frame for the exception that caused the BugCheck Arg3: ffffbe8bfd845558, Address of the exception record for the exception that caused the BugCheck Arg4: 0000000000000000, Reserved Debugging Details: ------------------ BUGCHECK_CODE: 139 BUGCHECK_P1: 3 BUGCHECK_P2: ffffbe8bfd845600 BUGCHECK_P3: ffffbe8bfd845558 BUGCHECK_P4: 0 TRAP_FRAME: ffffbe8bfd845600 -- (.trap 0xffffbe8bfd845600) NOTE: The trap frame does not contain all registers. Some register values may be zeroed or incorrect. rax=ffffe3086b309310 rbx=0000000000000000 rcx=0000000000000003 rdx=ffffe3086bd4a780 rsi=0000000000000000 rdi=0000000000000000 rip=fffff80738d20b8c rsp=ffffbe8bfd845790 rbp=ffffe308667a3d40 r8=ade4b591e32e688b r9=ffffe3086674d920 r10=0000000000000000 r11=0000000000007070 r12=0000000000000000 r13=0000000000000000 r14=0000000000000000 r15=0000000000000000 iopl=0 nv up ei pl nz na pe nc nt!RtlRemoveEntryHashTable+0x4c: fffff807`38d20b8c cd29 int 29h Resetting default scope EXCEPTION_RECORD: ffffbe8bfd845558 -- (.exr 0xffffbe8bfd845558) ExceptionAddress: fffff80738d20b8c (nt!RtlRemoveEntryHashTable+0x000000000000004c) ExceptionCode: c0000409 (Security check failure or stack buffer overrun) ExceptionFlags: 00000001 NumberParameters: 1 Parameter[0]: 0000000000000003 Subcode: 0x3 FAST_FAIL_CORRUPT_LIST_ENTRY PROCESS_NAME: System ERROR_CODE: (NTSTATUS) 0xc0000409 - The system detected an overrun of a stack-based buffer in this application. This overrun could potentially allow a malicious user to gain control of this application. EXCEPTION_CODE_STR: c0000409 EXCEPTION_PARAMETER1: 0000000000000003 EXCEPTION_STR: 0xc0000409 STACK_TEXT: : ffffbe8b`fd844c30 fffff807`38d12870 fffff807`37795180 00000000`00000001 : nt!DbgBreakPointWithStatus : fffff807`00000003 ffffbe8b`fd844c30 fffff807`38e4e9f0 00000000`00000139 : nt!KiBugCheckDebugBreak+0x12 : 00000000`2e7af43a 00000000`00000000 ffffe308`6bd4a7a0 fffff807`38e9adbc : nt!KeBugCheck2+0xba3 : 00000000`00000139 00000000`00000003 ffffbe8b`fd845600 ffffbe8b`fd845558 : nt!KeBugCheckEx+0x107 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiBugCheckDispatch+0x69 : ffffe308`61202140 00000000`00000000 00000000`00000000 fffff807`38c78349 : nt!KiFastFailDispatch+0xb2 : ffffe308`66c06fe0 00000000`00000001 ffffbe8b`fd845850 fffff807`3b473349 : nt!KiRaiseSecurityCheckFailure+0x346 : ffffe308`66c06fe0 badbadfa`badbadfa ffffbe8b`fd845850 fffff807`3b603747 : nt!RtlRemoveEntryHashTable+0x4c : ffffbe8b`fd845878 ffffbe8b`fd8458a9 00000000`00000000 00000000`00000000 : NETIO!FreeSomeCacheBucketEntries+0x7d : ffffe308`667a38e0 00000000`00000000 00000000`00000400 00000000`00000000 : NETIO!FreeSomeCacheEntries+0x3e : ffffe308`66a88640 00000000`0000ff00 00000000`00000000 fffff807`38ce0a70 : NETIO!UpdateCacheLruBucket+0x828 : ffffe308`66000000 ffffe308`667a3de8 00000000`00000000 ffffe308`6688ad80 : NETIO!LruCleanupWorkItemRoutine+0x9b : ffffe308`66a88640 ffffe308`66fcc520 00000000`00000000 ffffe308`00000000 : NETIO!NetiopIoWorkItemRoutine+0x57 : ffffe308`618a3c90 ffffe308`6aed3040 ffffbe8b`fd845b00 00796c70`00000000 : nt!IopProcessWorkItem+0x8e : ffffe308`6aed3040 00000000`000004f0 ffffe308`6aed3040 fffff807`38cf49e0 : nt!ExpWorkerThread+0x155 : fffff807`37795180 ffffe308`6aed3040 fffff807`38d1b520 00000000`00000000 : nt!PspSystemThreadStartup+0x57 : ffffbe8b`fd846000 ffffbe8b`fd840000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x34 SYMBOL_NAME: NETIO!FreeSomeCacheBucketEntries+7d MODULE_NAME: NETIO IMAGE_NAME: NETIO.SYS ```

lundman commented 5 months ago

verifier spat this out though:

ATTEMPTED_WRITE_TO_READONLY_MEMORY

``` ATTEMPTED_WRITE_TO_READONLY_MEMORY (be) An attempt was made to write to readonly memory. The guilty driver is on the stack trace (and is typically the current instruction pointer). When possible, the guilty driver's name (Unicode string) is printed on the BugCheck screen and saved in KiBugCheckDriver. Arguments: Arg1: fffff8073bbfd740, Virtual address for the attempted write. Arg2: 00000001be48a121, PTE contents. Arg3: ffffbe8bfc513fd0, (reserved) Arg4: 000000000000000a, (reserved) Debugging Details: ------------------ KEY_VALUES_STRING: 1 Key : Analysis.CPU.mSec Value: 7030 Key : Analysis.DebugAnalysisManager Value: Create Key : Analysis.Elapsed.mSec Value: 40305 Key : Analysis.Init.CPU.mSec Value: 546 Key : Analysis.Init.Elapsed.mSec Value: 112129 Key : Analysis.Memory.CommitPeak.Mb Value: 89 Key : WER.OS.Branch Value: ni_release Key : WER.OS.Timestamp Value: 2022-05-06T12:50:00Z Key : WER.OS.Version Value: 10.0.22621.1 BUGCHECK_CODE: be BUGCHECK_P1: fffff8073bbfd740 BUGCHECK_P2: 1be48a121 BUGCHECK_P3: ffffbe8bfc513fd0 BUGCHECK_P4: a PROCESS_NAME: System TRAP_FRAME: ffffbe8bfc513fd0 -- (.trap 0xffffbe8bfc513fd0) NOTE: The trap frame does not contain all registers. Some register values may be zeroed or incorrect. rax=fffff8073bbfd730 rbx=0000000000000000 rcx=48d68b48204f8b48 rdx=fffff8073bbfd73e rsi=0000000000000000 rdi=0000000000000000 rip=fffff8073ae1c049 rsp=ffffbe8bfc514160 rbp=0000000000000008 r8=fffff8073bbfd730 r9=ffffbe8bfc514240 r10=ffffe3086d890720 r11=ffffbe8bfc514240 r12=0000000000000000 r13=0000000000000000 r14=0000000000000000 r15=0000000000000000 iopl=0 nv up ei ng nz na po nc storport!StorPortNotification+0x1eea9: fffff807`3ae1c049 c7401002200708 mov dword ptr [rax+10h],8072002h ds:fffff807`3bbfd740=4c000044 Resetting default scope LOCK_ADDRESS: fffff8073967a7c0 -- (!locks fffff8073967a7c0) Resource @ nt!PiEngineLock (0xfffff8073967a7c0) Exclusively owned Contention Count = 16 NumberOfExclusiveWaiters = 1 Threads: ffffe30861972080-01<*> Threads Waiting On Exclusive Access: ffffe308619da080 1 total locks PNP_TRIAGE_DATA: Lock address : 0xfffff8073967a7c0 Thread Count : 1 Thread address: 0xffffe30861972080 Thread wait : 0xb7a9 STACK_TEXT: : ffffbe8b`fc5137b0 fffff807`38d12870 ffff8b01`24367180 fffff807`3bbfd701 : nt!DbgBreakPointWithStatus : ffff8b01`00000003 ffffbe8b`fc5137b0 fffff807`38e4ea80 ffffbe8b`fc513d60 : nt!KiBugCheckDebugBreak+0x12 : 00000000`00000000 00000000`00000000 00000001`be48a121 fffff807`38e3a414 : nt!KeBugCheck2+0xba3 : 00000000`000000be fffff807`3bbfd740 00000001`be48a121 ffffbe8b`fc513fd0 : nt!KeBugCheckEx+0x107 : 00000001`be48a121 00000000`00000003 ffffbe8b`fc513f69 00000000`00000000 : nt!MiRaisedIrqlFault+0x17ed10 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!MmAccessFault+0x363 : 00000000`00000000 00000000`00000000 4c4f565a`00000020 00000000`0000007c : nt!KiPageFault+0x369 : 01dabc84`312c5356 ffffe308`6ddd3580 ffffe308`692b0f80 00000000`000000ff : storport!StorPortNotification+0x1eea9 : fffff801`aae81000 fffff801`00000001 0001a64c`00935000 ffffbe8b`fc5147e0 : OpenZFS!wzvol_HwReportLog+0x2ee [C:\src\openzfs\module\os\windows\zfs\zfs_windows_zvol.c @ 605] : 00000000`00000000 00000000`00000000 00000000`00000000 ffffbe8b`00000000 : OpenZFS!ScsiOpReportLuns+0x5f [C:\src\openzfs\module\os\windows\zfs\zfs_windows_zvol_scsi.c @ 866] : ffffe308`6af83040 fffff807`38de2cfb 00000001`bc077ee4 00000000`00989680 : OpenZFS!ScsiExecuteMain+0x249 [C:\src\openzfs\module\os\windows\zfs\zfs_windows_zvol_scsi.c @ 346] : ffffe308`6d3e51a0 00000000`00000000 ffffe308`6d8901a0 ffffe308`6af83040 : OpenZFS!wzvol_HwStartIo+0x74 [C:\src\openzfs\module\os\windows\zfs\zfs_windows_zvol.c @ 721] : ffffbe8b`fc514638 ffffe308`6d1c3bd8 ffffe308`6d1c3b60 ffffe308`6b2efc90 : storport!RaidAdapterPostScatterGatherExecute+0x22a : ffffe308`61879da0 ffffbe8b`00080158 ffffbe8b`fc514670 ffffe308`61dfd100 : storport!RaUnitStartIo+0x310 : ffffe308`6d1c3bd8 ffffe308`6d1c3b60 00000000`00000006 ffffe308`6b2efc00 : storport!RaidStartIoPacket+0x50f : 00000030`61dfd1a0 00000001`00000054 ffffe308`6d1c3b60 fffff801`aab1f2f5 : storport!RaUnitScsiIrp+0x23e : ffffe308`61dfd1c0 ffffbe8b`fc514950 ffffe308`61dfd1c0 00000000`00989680 : storport!RaDriverScsiIrp+0x58 : 00000000`000000c0 fffff807`38c79cc7 ffffffff`ffffffff 00000000`00000032 : OpenZFS!dispatcher+0x5f9 [C:\src\openzfs\module\os\windows\zfs\zfs_vnops_windows.c @ 7484] : ffffe308`6d1c3b60 ffffe308`6d3e5050 ffffe308`6d3e5050 00000000`00000000 : nt!DifIRP_MJ_INTERNAL_DEVICE_CONTROLWrapper+0xdb : ffff9cce`673c6108 ffff9cce`67339e30 ffff9cf1`8436c098 fffff807`38d3a0ef : nt!IopfCallDriver+0x53 : ffffe308`6d1c3b60 ffffbe8b`fc514b90 ffff9cf1`8436c098 00000000`00000000 : nt!IovCallDriver+0x5f : fffff807`3ae78800 fffff807`3ae82000 ffffe308`6d936790 ffffe308`6d1c3b60 : nt!IofCallDriver+0x2272df : ffffe308`6d3e51a0 ffffbe8b`fc514d00 00000000`00000fff ffff8b01`24400000 : storport!RaSendIrpSynchronous+0x62 : ffffbe8b`fc514fb0 ffffbe8b`fc514fa0 ffffbe8b`fc514ff0 00000000`00000002 : storport!RaidBusEnumeratorIssueSynchronousRequest+0x23d : ffffe308`6b2efc90 00000000`00000000 ffffbe8b`fc514db0 ffffbe8b`fc514fb0 : storport!RaidBusEnumeratorIssueReportLuns+0x70 : 00000000`00000000 00000000`00000000 ffffbe8b`fc514fb0 ffffbe8b`00000000 : storport!RaidBusEnumeratorGetLunListFromTarget+0x76 : 00000000`00000000 00000000`00000000 00000000`00000000 ffffe308`6d8901a0 : storport!RaidBusEnumeratorGetLunList+0x54 : 00000000`00000000 ffffbe8b`fc515000 ffffe308`61dfd710 00000000`00000000 : storport!RaidAdapterEnumerateBus+0x99 : ffffe308`6d8901a0 ffffbe8b`fc5150f0 00000000`00000000 fffff801`aa57b10a : storport!RaidAdapterRescanBus+0xb1 : 00000000`00000000 ffffe308`70006c68 ffffe308`6ba6a010 ffffe308`705d0de8 : storport!RaidAdapterQueryDeviceRelationsIrp+0x190 : ffffe308`61dfd6f0 fffff807`394db950 ffffe308`61dfd710 ffffe308`61dfd710 : storport!RaidAdapterPnpIrp+0x144 : ffffe308`61dfd710 ffffbe8b`fc515380 ffffe308`61dfd710 00000000`00989680 : storport!RaDriverPnpIrp+0x94 : 00000000`00000000 00000000`00000000 00000000`0000ffff 00000000`000000ff : OpenZFS!dispatcher+0x5f9 [C:\src\openzfs\module\os\windows\zfs\zfs_vnops_windows.c @ 7484] : ffffe308`6ba6a010 ffffe308`6d890050 ffffe308`6d375060 00000000`000002e6 : nt!DifIRP_MJ_PNPWrapper+0xdb : 00000000`000000b0 00000000`00000600 00000000`000000b0 00000000`00000600 : nt!IopfCallDriver+0x53 : ffffe308`6ba6a1b8 fffff807`3bbfd7e2 00000000`0000043a 00000000`00000000 : nt!IovCallDriver+0x5f : ffffe308`00000000 00000000`00001001 00000000`00000000 00000000`00000000 : nt!IofCallDriver+0x2272df : ffffe308`6ba6a010 fffff807`3bbfd7e2 ffffe308`6d375060 ffffe308`6b9c71d0 : ambakdrv+0x13ae : 00000000`00000002 ffffe308`ffffffff 00000000`00000000 00000000`00000001 : nt!IopfCallDriver+0x53 : 00000000`00000000 00000000`00000001 ffffbe8b`00000439 00000000`00000000 : nt!IovCallDriver+0x5f : fffff807`39090560 00000000`6d547845 00000000`00000000 fffff807`394c968d : nt!IofCallDriver+0x2272df : ffffe308`6ba6a010 fffff807`391338fe ffffe308`6b9c7080 ffffe308`6db8cb40 : volsnap!VolSnapPnp+0x142 : ffffe308`6b9c7080 ffffbe8b`fc5157a0 ffffe308`618a0080 fffff807`394e8c35 : nt!IopfCallDriver+0x53 : ffffe308`6ba6a010 ffffbe8b`fc5157a0 ffffe308`6b9c7080 00000000`00000000 : nt!IovCallDriver+0x5f : ffffe308`618a0080 ffffe308`6a8455f0 00000000`00000000 00000000`00000000 : nt!IofCallDriver+0x2272df : ffffe308`618a0080 00000000`00000000 ffffe308`6a8455f0 fffff807`394c968d : nt!PnpAsynchronousCall+0xe6 : ffffe308`6b56d520 00000000`00000000 ffffe308`618a0080 fffff807`38cdfd41 : nt!PnpSendIrp+0x9e : ffffe308`6a8455f0 ffffe308`6b56d548 ffffe308`6b56d520 ffffe308`6b56d520 : nt!PnpQueryDeviceRelations+0x51 : ffffe308`6b56d520 ffffbe8b`fc5158a8 ffffe308`6b56d520 00000000`00000000 : nt!PipEnumerateDevice+0xc8 : ffffe308`6b56d500 ffffe308`6b831e01 ffffbe8b`fc515940 fffff807`00000000 : nt!PipProcessDevNodeTree+0x187 : 00000001`00000003 ffffe308`6b56d520 ffffe308`61972080 00000000`00000000 : nt!PiRestartDevice+0xc94a3 : ffffe308`61972080 ffffe308`6189aaa0 fffff807`39767ac0 fffff807`00000000 : nt!PnpDeviceActionWorker+0x654 : ffffe308`61972080 00000000`000003d0 ffffe308`61972080 fffff807`38cf49e0 : nt!ExpWorkerThread+0x155 : fffff807`37795180 ffffe308`61972080 fffff807`38d1b520 00000000`00000000 : nt!PspSystemThreadStartup+0x57 : ffffbe8b`fc516000 ffffbe8b`fc510000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x34 FAULTING_SOURCE_LINE: C:\src\openzfs\module\os\windows\zfs\zfs_windows_zvol.c FAULTING_SOURCE_FILE: C:\src\openzfs\module\os\windows\zfs\zfs_windows_zvol.c FAULTING_SOURCE_LINE_NUMBER: 605 FAULTING_SOURCE_CODE: 601: pHBAExt, 602: pWnode, 603: 0xFF); 604: > 605: ExFreePoolWithTag(pWnode, MP_TAG_GENERAL); 606: } else { 607: } 608: #endif 609: } 610: SYMBOL_NAME: OpenZFS!wzvol_HwReportLog+2ee MODULE_NAME: OpenZFS IMAGE_NAME: OpenZFS.sys ```

lundman commented 5 months ago
unloading driver

``` nt!DbgBreakPointWithStatus C/C++/ASM nt!KiBugCheckDebugBreak+0x12 C/C++/ASM nt!KeBugCheck2+0xba3 C/C++/ASM nt!KeBugCheckEx+0x107 C/C++/ASM nt!ExReturnPoolQuota+0x222647 C/C++/ASM nt!ExFreePoolWithTag+0x1e5 C/C++/ASM storport!RaidDeleteDeferredQueue+0x1b C/C++/ASM storport!RaidDeleteAdapter+0x12e C/C++/ASM storport!RaidAdapterRemoveDeviceIrp+0x8e C/C++/ASM storport!RaidAdapterPnpIrp+0x16100 C/C++/ASM storport!RaDriverPnpIrp+0x94 C/C++/ASM > OpenZFS!dispatcher+0x5f9 [C:\src\openzfs\module\os\windows\zfs\zfs_vnops_windows.c @ 7484] C/C++/ASM nt!DifIRP_MJ_PNPWrapper+0xdb C/C++/ASM nt!IopfCallDriver+0x53 C/C++/ASM nt!IovCallDriver+0x5f C/C++/ASM nt!IofCallDriver+0x2272df C/C++/ASM nt!ViFilterDispatchPnp+0x121 C/C++/ASM nt!IopfCallDriver+0x53 C/C++/ASM nt!IovCallDriver+0x5f C/C++/ASM nt!IofCallDriver+0x2272df C/C++/ASM ambakdrv+0x16da C/C++/ASM ambakdrv+0x15b4 C/C++/ASM ambakdrv+0x1478 C/C++/ASM nt!DifIRP_MJ_PNPWrapper+0xdb C/C++/ASM nt!IopfCallDriver+0x53 C/C++/ASM nt!IovCallDriver+0x5f C/C++/ASM nt!IofCallDriver+0x2272df C/C++/ASM nt!ViFilterDispatchPnp+0x121 C/C++/ASM nt!IopfCallDriver+0x53 C/C++/ASM nt!IovCallDriver+0x5f C/C++/ASM nt!IofCallDriver+0x2272df C/C++/ASM volsnap!VolSnapPnp+0x9e C/C++/ASM nt!IopfCallDriver+0x53 C/C++/ASM nt!IovCallDriver+0x5f C/C++/ASM nt!IofCallDriver+0x2272df C/C++/ASM nt!IopSynchronousCall+0xf8 C/C++/ASM nt!IopRemoveDevice+0x108 C/C++/ASM nt!PnpRemoveLockedDeviceNode+0x1a8 C/C++/ASM nt!PnpDeleteLockedDeviceNode+0x52 C/C++/ASM nt!PnpDeleteLockedDeviceNodes+0xd3 C/C++/ASM nt!PnpProcessQueryRemoveAndEject+0x3b3 C/C++/ASM nt!PnpProcessTargetDeviceEvent+0x109 C/C++/ASM nt!PnpDeviceEventWorker+0x2ca C/C++/ASM nt!ExpWorkerThread+0x155 C/C++/ASM nt!PspSystemThreadStartup+0x57 C/C++/ASM nt!KiStartSystemThread+0x34 C/C++/ASM BAD_POOL_CALLER (c2) The current thread is making a bad pool request. Typically this is at a bad IRQL level or double freeing the same allocation, etc. Arguments: Arg1: 000000000000000d, Attempt to release quota on a corrupted pool allocation. Arg2: ffffad8f3c788d40, Address of pool Arg3: 00000000fffff807, Pool allocation's tag Arg4: a5ff4d454236bb53, Quota process pointer (bad). Bad previous allocation size @ffffad8f3c788d30, last size was 0 *** *** An error (or corruption) in the pool was detected; *** Attempting to diagnose the problem. *** *** Use !poolval ffffad8f3c788000 for more details. Pool page [ ffffad8f3c788000 ] is INVALID. Analyzing linked list... [ ffffad8f3c788000 ]: invalid previous size [ 0x98 ] should be [ 0x0 ] [ ffffad8f3c788000 --> ffffad8f3c7884e0 (size = 0x4e0 bytes)]: Corrupt region ```

sskras commented 5 months ago

@lundman commented 8 hours ago:

I am unsure what to do next, a lot of things recommend verifier - but this is a driver load issue, so I do not know if that will help.

I doubt Aomei developers are interested in debugging.

Maybe it's worth asking on some OSR Developer Community forum?

NTDEV looks like a close match:

Windows System Software Devs Interest List — Windows Driver Developer’s Forum

This is the category to post your questions about designing and developing all types of Windows drivers, particularly device drivers. Got a KMDF driver or UMDF driver question? Got a crash in driver that you need help ana…

Last time I communicated on OSR was circa 2000, so things might have changed since. But the community seemed very friendly. :)

lundman commented 5 months ago

Ah yeah, I want to chase down a bit further before I post over there

sskras commented 5 months ago

Oh! Thanks for the personal update :)

lundman commented 5 months ago

Probably on there still.. perhaps I come across too aristocraty....:)

lundman commented 5 months ago

This is the earliest crash (in terms of stack depth) I've seen:


Unknown bugcheck code (0)
Unknown bugcheck description
Arguments:
Arg1: 0000000000000000
Arg2: 0000000000000000
Arg3: 0000000000000000
Arg4: 0000000000000000

Debugging Details:
------------------

BUGCHECK_CODE:  0

BUGCHECK_P1: 0

BUGCHECK_P2: 0

BUGCHECK_P3: 0

BUGCHECK_P4: 0

PROCESS_NAME:  System

EXECUTE_ADDRESS: 40

FAILED_INSTRUCTION_ADDRESS: 
+0
00000000`00000040 ??              ???

ERROR_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%p referenced memory at 0x%p. The memory could not be %s.

EXCEPTION_CODE_STR:  c0000005

EXCEPTION_PARAMETER1:  0000000000000008

EXCEPTION_PARAMETER2:  0000000000000040

STACK_ADDR_RAW_STACK_SYMBOL: ffffef8ba5ce44a8

STACK_COMMAND:  dps ffffef8ba5ce44a8-0x20 ; kb

STACK_TEXT:  
ffffef8b`a5ce4488  00000000`00000000
ffffef8b`a5ce4490  ffffb487`9ac6a060
ffffef8b`a5ce4498  ffffb487`9abdd200
ffffef8b`a5ce44a0  fffff805`6b16d7e2 volsnap!VolSnapPnp+0x142
ffffef8b`a5ce44a8  fffff805`68344247 nt!IopfCallDriver+0x53
ffffef8b`a5ce44b0  ffffb487`9abdd010
ffffef8b`a5ce44b8  ffffb487`970c7be0
ffffef8b`a5ce44c0  ffffb487`9ac6a1b0
ffffef8b`a5ce44c8  00000000`00000000
ffffef8b`a5ce44d0  00000000`00000000
ffffef8b`a5ce44d8  00000000`00000000
ffffef8b`a5ce44e0  00000000`00000002
ffffef8b`a5ce44e8  fffff805`68ad222f nt!IovCallDriver+0x5f
ffffef8b`a5ce44f0  00000000`00000014
ffffef8b`a5ce44f8  ffffb487`ffffffff
ffffef8b`a5ce4500  00000000`00000000

SYMBOL_NAME:  volsnap!VolSnapPnp+142

MODULE_NAME: volsnap

IMAGE_NAME:  volsnap.sys

BUCKET_ID_FUNC_OFFSET:  142

FAILURE_BUCKET_ID:  ACCESS_VIOLATION_VRF_NULL_IP_volsnap!VolSnapPnp

OS_VERSION:  10.0.22621.1

BUILDLAB_STR:  ni_release

OSPLATFORM_TYPE:  x64

OSNAME:  Windows 10

FAILURE_ID_HASH:  {e6a35f71-d940-8230-0f7a-a6fc733dd801}

Followup:     MachineOwner
---------
lundman commented 5 months ago

OK so this could be an indication

: : nt!VerifierIofCompleteRequest+0x10
: : OpenZFS!dispatcher+0x6dc [C:\src\openzfs\module\os\windows\zfs\zfs_vnops_windows.c @ 7556] 
: : nt!DifIRP_MJ_INTERNAL_DEVICE_CONTROLWrapper+0xdb
: : nt!IopfCallDriver+0x53
: : nt!IovCallDriver+0x5f
: : nt!IofCallDriver+0x2272df
: : storport!RaSendIrpSynchronous+0x62
: : storport!RaidBusEnumeratorIssueSynchronousRequest+0x23d
: : storport!RaidBusEnumeratorIssueReportLuns+0x70
: : storport!RaidBusEnumeratorGetLunListFromTarget+0x76
: : storport!RaidBusEnumeratorGetLunList+0x54
: : storport!RaidAdapterEnumerateBus+0x99
: : storport!RaidAdapterRescanBus+0xb1
: : storport!RaidAdapterQueryDeviceRelationsIrp+0x190
: : storport!RaidAdapterPnpIrp+0x144
: : storport!RaDriverPnpIrp+0x94
: : OpenZFS!dispatcher+0x639 [C:\src\openzfs\module\os\windows\zfs\zfs_vnops_windows.c @ 7545] 

We are called, we pass down to StorPort, which ends up calling us again, and we try to call StorPort again. This is not ideal - it is odd StorPort would call us at all.

lundman commented 5 months ago

OK so fixing that, and things around there, does not fix the memory corruption - that still exists somewhere - outside of dispatcher

sskras commented 5 months ago

OK so fixing that, and things around there, does not fix the memory corruption

It' would be interesting to see the changes nevertheless. I guess they will be committed sooner or later.

lundman commented 5 months ago

heh they are all pretty desperate - trying to narrow the issue down. next is commenting out large chunks in the hopes it either goes away, or not

lundman commented 5 months ago

OK, I have take out everything ZFS. No SPL, no ZFS, no dispatcher, no dprintf, WPP.

Left is the DriverStart() calling StorPortInitialize().

emptyBSOD

``` : 00000000`000000c2 00000000`0000000d ffffe308`6e999730 00000000`fffff807 : nt!KeBugCheckEx+0x107 : ffffe308`6e999720 00000000`fffff807 00000000`000004c0 00000000`00000001 : nt!ExReturnPoolQuota+0x222647 : 00000000`00000000 ffffe308`6e5d8720 ffffbe8b`00000000 00000000`00000000 : nt!ExFreePoolWithTag+0x1e5 : ffffe308`6e5d8938 ffffe308`6e5d8050 ffffe308`6d325030 ffffe308`6bc1d550 : storport!RaidDeleteDeferredQueue+0x1b : ffffe308`6e5d81a0 ffffe308`6e5d8050 ffffe308`6d325030 ffffe308`6bc1d550 : storport!RaidDeleteAdapter+0x12e : ffffe308`6bc1d550 00000000`00000002 00000000`00000000 00000000`00010000 : storport!RaidAdapterRemoveDeviceIrp+0x8e : ffffe308`61dfd6f0 fffff807`3a6361f5 ffffe308`61dfd710 ffffbe8b`fc5c4250 : storport!RaidAdapterPnpIrp+0x16100 : ffffe308`61dfd710 ffffbe8b`fc5c4250 ffffe308`61dfd710 fffff807`39766400 : storport!RaDriverPnpIrp+0x94 : ffffe308`6bc1d550 ffffe308`6e5d8050 ffffe308`6b7ddb40 00000000`00000000 : nt!DifIRP_MJ_PNPWrapper+0xdb : 00000000`00000000 ffffe308`6b7ddca0 00000000`00000000 fffff807`37795180 : nt!IopfCallDriver+0x53 : ffffe308`6bc1d6f8 ffffe308`6bc1d550 ffffe308`6b7ddc98 00000000`00000000 : nt!IovCallDriver+0x5f : ffffe308`6b7ddca0 ffffbe8b`fc5c4430 ffffbcd2`2b6bddf7 fffff807`38d0ca21 : nt!IofCallDriver+0x2272df : 00000000`00000000 ffffe308`6b7dd9f0 ffffe308`6e5e0040 ffffe308`6aed4180 : ambakdrv+0x16da : ffffe308`6bc1d740 fffff807`3bbfd73e ffffe308`6bc1d550 ffffe308`6b7dd9f0 : ambakdrv+0x15b4 : ffffe308`6bc1d550 fffff807`3bbfd73e ffffe308`6b7dd9f0 ffffe308`6e5e0190 : ambakdrv+0x1478 : 00000000`00000000 fffff807`38c20f8d 00000000`00000000 00000000`00000000 : nt!IopfCallDriver+0x53 : ffffe308`6b7dd9f0 00000000`00000001 00000000`00000000 00000000`00000000 : nt!IovCallDriver+0x5f : ffffe308`6b7dd9f0 00000000`00000001 ffffe308`6bc1d550 fffff807`390dbede : nt!IofCallDriver+0x2272df : ffffe308`6bc1d550 fffff807`3915ae44 ffffe308`6e5e0040 ffff9f04`636be000 : volsnap!VolSnapPnp+0x9e : ffffbe8b`fc5c46e0 00000000`00000000 ffffe308`6bc1d550 fffff807`394e8c35 : nt!IopfCallDriver+0x53 : ffffe308`6e5e0040 00000000`00000000 ffffe308`6e5e0040 00000000`00000000 : nt!IovCallDriver+0x5f : ffffe308`6e5e0040 00000000`00000000 ffffbe8b`fc5c46e0 ffffe308`6bc1d550 : nt!IofCallDriver+0x2272df : 00000000`00000002 ffffe308`6e5e1300 ffffe308`6a8329f0 ffffe308`6e5e1300 : nt!IopSynchronousCall+0xf8 : ffff9f04`63f0a010 ffffe308`6a8329f0 00000000`00000200 00000000`0000000a : nt!IopRemoveDevice+0x108 : ffffe308`6e5e1300 ffffe308`6a8329f0 00000000`00000000 cb3a4008`00200001 : nt!PnpRemoveLockedDeviceNode+0x1a8 : ffffe308`6a8329f0 ffffbe8b`fc5c4830 ffffe308`6e5e1300 00000000`00000000 : nt!PnpDeleteLockedDeviceNode+0x52 : ffffe308`6e5e1300 00000000`00000002 ffffe308`6e5e1300 00000000`00000000 : nt!PnpDeleteLockedDeviceNodes+0xd3 : ffffbe8b`fc5c49b0 ffffe308`6a832900 ffffe308`6bceb400 ffff9f04`00000001 : nt!PnpProcessQueryRemoveAndEject+0x3b3 : ffff9f04`63f0a010 ffff9f04`63781a10 ffffe308`6189aa00 00000000`00000000 : nt!PnpProcessTargetDeviceEvent+0x109 : ffffe308`6189aaa0 ffffe308`6684c340 ffffbe8b`fc5c4b00 00000000`00000000 : nt!PnpDeviceEventWorker+0x2ca : ffffe308`6684c340 00000000`00000235 ffffe308`6684c340 fffff807`38cf49e0 : nt!ExpWorkerThread+0x155 : ffff8b01`24367180 ffffe308`6684c340 fffff807`38d1b520 00000000`00000000 : nt!PspSystemThreadStartup+0x57 : ffffbe8b`fc5c5000 ffffbe8b`fc5bf000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x34 SYMBOL_NAME: ambakdrv+16da MODULE_NAME: ambakdrv IMAGE_NAME: ambakdrv.sys ```

Using KdPrinEx() to try the callbacks in StorPort:

MpHwCompServReq entered
MpQueueServiceIrp entered
wzvol_HwHandlePnP: entry
MpHwAdapterControl:  ControlType = 1
MpHwAdapterControl:  ScsiStopAdapter
wzvol_StopAdapter: entry
MpHwAdapterControl - OUT
MpHwAdapterControl:  ControlType = 3
MpHwAdapterControl:  ScsiSetBootConfig
MpHwAdapterControl - OUT
MpHwFreeAdapterResources entered, pHBAExt = 0xFFFFE3086DF54770
KDTARGET: Refreshing KD connection

*** Fatal System Error: 0x000000c2
lundman commented 5 months ago

Then there is this crash:

06 ffffbe8b`fefe00b0 fffff807`390feb62     nt!KiPageFault+0x369
07 ffffbe8b`fefe0240 fffff807`3a3545a9     nt!FsRtlAllocateExtraCreateParameterFromLookasideList+0x62
08 ffffbe8b`fefe0280 fffff807`3a354a18     FLTMGR!TargetedIOCtrlGenerateECP+0x8d
09 ffffbe8b`fefe02f0 fffff807`3a354d17     FLTMGR!FltpCreateFile+0xd4
0a ffffbe8b`fefe03e0 fffff807`3b01f4b1     FLTMGR!FltCreateFileEx2+0xe7
0b ffffbe8b`fefe0510 fffff807`391adabc     fileinfo!FIPfInterfaceOpen+0x401

  fffff807`390feb52  call nt!ExAllocateFromNPagedLookasideList (fffff807`38c6f230) 
  fffff807`390feb57  xor ecx,ecx 
  fffff807`390feb59  test rax,rax    # Check if rax is NULL
  fffff807`390feb5c  je nt!FsRtlAllocateExtraCreateParameterFromLookasideList+0x1cf635 (fffff807`392ce135)  # jump if it is
* fffff807`390feb62  mov qword ptr [rax+10h],rcx  # deference rax + 16.
(* = PC)

0: kd> r
rax=0000000000000000 rbx=0000000000000003 rcx=0000000000000003
rdx=000000000000008a rsi=fffff8073bbfd701 rdi=0000000000000000
rip=fffff80738e3f520 rsp=ffffbe8bfefdf678 rbp=ffffbe8bfefdf7e0
 r8=0000000000000065  r9=0000000000000000 r10=0000000000000000
r11=0000000000000000 r12=0000000000000000 r13=0000000000000003
r14=0000000000000000 r15=0000000000000000

My x64 is a bit rusty, but it seems it checks for rax being NULL, since it isn't, it does not jump. Then the next store to rax+0x10, now rax is NULL, and we crash! Neat. What would be changing my rax on the fly, if I read that correctly.

I believe only preemption/interrupts can execute between those statements, so is it destroying rax somewhere? i386/x64 lists rax as nonvolitile.

UPDATE

rax is not NULL, I had it in mind that the registers rewind when you jump into the callstack, of course they don't. And the error is WRITE_TO_READONLY_MEMORY, which isn't the error given for NULL access.

sskras commented 5 months ago

it checks for rax being NULL, since it isn't, it does not jump. Then the next store to rax+0x10, now rax is NULL, and we crash! Neat. What would be changing my rax on the fly, if I read that correctly.

Could you step through that code by chance?

My improvised guess is that:

lundman commented 5 months ago

here is the whole function

nt!FsRtlAllocateExtraCreateParameterFromLookasideList:
fffff807`390feb00  mov qword ptr [rsp+8],rbx 
fffff807`390feb05  mov qword ptr [rsp+10h],rbp 
fffff807`390feb0a  mov qword ptr [rsp+18h],rsi 
fffff807`390feb0f  mov qword ptr [rsp+20h],rdi 
fffff807`390feb14  push r14 
fffff807`390feb16  sub rsp,30h 
fffff807`390feb1a  mov rbx,qword ptr [rsp+60h] 
fffff807`390feb1f  lea esi,[rdx+48h] 
fffff807`390feb22  mov r10d,r8d 
fffff807`390feb25  mov rbp,r9 
fffff807`390feb28  and r10d,2 
fffff807`390feb2c  mov edi,42h 
fffff807`390feb31  mov r9d,2 
fffff807`390feb37  mov r14,rcx 
fffff807`390feb3a  mov ecx,dword ptr [rbx+28h] 
fffff807`390feb3d  cmove edi,r9d 
fffff807`390feb41  cmp esi,dword ptr [rbx+2Ch] 
fffff807`390feb44  ja nt!FsRtlAllocateExtraCreateParameterFromLookasideList+0x1cf616 (fffff807`392ce116) 
# ^^ Sets rax and jumps to fffff807`390feb99
fffff807`390feb4a  mov rcx,rbx 
fffff807`390feb4d  test r10d,r10d 
fffff807`390feb50  jne nt!FsRtlAllocateExtraCreateParameterFromLookasideList+0xb5 (fffff807`390febb5) 
fffff807`390feb52  call nt!ExAllocateFromNPagedLookasideList (fffff807`38c6f230) 
fffff807`390feb57  xor ecx,ecx 
fffff807`390feb59  test rax,rax 
fffff807`390feb5c  je nt!FsRtlAllocateExtraCreateParameterFromLookasideList+0x1cf635 (fffff807`392ce135) 
# ^^ Sets rax and jumps to fffff807`390feb99
fffff807`390feb62  mov qword ptr [rax+10h],rcx 
fffff807`390feb66  mov qword ptr [rax+8],rcx 
fffff807`390feb6a  mov qword ptr [rax],48706345h 
fffff807`390feb71  movups xmm0,xmmword ptr [r14] 
fffff807`390feb75  mov qword ptr [rax+40h],rcx 
fffff807`390feb79  lea rcx,[rax+48h] 
fffff807`390feb7d  mov qword ptr [rax+28h],rbp 
fffff807`390feb81  movups xmmword ptr [rax+18h],xmm0 
fffff807`390feb85  mov dword ptr [rax+30h],edi 
fffff807`390feb88  mov dword ptr [rax+34h],esi 
fffff807`390feb8b  mov qword ptr [rax+38h],rbx 
fffff807`390feb8f  mov rax,qword ptr [rsp+68h] 
fffff807`390feb94  mov qword ptr [rax],rcx 
fffff807`390feb97  xor eax,eax 
fffff807`390feb99  mov rbx,qword ptr [rsp+40h] 
fffff807`390feb9e  mov rbp,qword ptr [rsp+48h] 
fffff807`390feba3  mov rsi,qword ptr [rsp+50h] 
fffff807`390feba8  mov rdi,qword ptr [rsp+58h] 
fffff807`390febad  add rsp,30h 
fffff807`390febb1  pop r14 
fffff807`390febb3  ret  
fffff807`390febb4  int 3 
fffff807`390febb5  call nt!ExAllocateFromNPagedLookasideList (fffff807`38c6f230) 
fffff807`390febba  jmp nt!FsRtlAllocateExtraCreateParameterFromLookasideList+0x57 (fffff807`390feb57) 
fffff807`390febbc  int 3 
fffff807`390febbd  int 3 
fffff807`390febbe  int 3 
fffff807`390febbf  int 3 
fffff807`390febc0  int 3 
fffff807`390febc1  int 3 
fffff807`390febc2  int 3 
fffff807`390febc3  int 3 

nt!AlpcpLockForCachedReferenceBlob:
fffff807`390febc4  push rbx 

~~~snip~~~
fffff807`392ce135  mov rax, qword ptr [rsp+68h] 
fffff807`392ce13a  mov qword ptr [rax],rcx 
fffff807`392ce13d  mov eax,0C000009Ah    # STATUS_INSUFFICIENT_RESOURCES
fffff807`392ce142  jmp nt!FsRtlAllocateExtraCreateParameterFromLookasideList+0x99 (fffff807390feb99) 
fffff807`392ce147  int 3 

I don't see any jumps to fffff807390feb62 but it's not like devstudio has cross-reference ability.

sskras commented 5 months ago

I don't see any jumps to fffff807390feb62 but it's not like devstudio has cross-reference ability.

It looks you're right:

Very strange indeed.

lundman commented 5 months ago

Which usually means interrupt, either hardware, preemption, DPC or similar. But I've commented out all those things, calls to __cpuidex(), assembly, etc. Could explain why the crashes are random, and often with NETIO (most common interrupts on my system). But that tends to mean it is not my code, but every time I blame the compiler it ends up being my code. Peculiar. I do wonder if I can build the sample storport project that the zvol stuff came from, and trigger it outside of this project. At least could suggest Aomei issue as well.

lundman commented 5 months ago

Turned on storport verifylevel 1, and verifier to check ambdrv, volsnap and openzfs, this is one of the quickest crashes so far:

volsnap

``` wzvol_HwFindAdapter: pHBAExt = 0xFFFFB78D5E81E110, NbBuses/MaxTargets/MaxLuns=1/2/21. MpHwAdapterControl: ControlType = 0 MpHwAdapterControl: ScsiQuerySupportedControlTypes MpHwAdapterControl - OUT wzvol_HwInitialize: entry wzvol_HwReportAdapter: entry wzvol_HwReportLink: entry wzvol_HwReportLog: entry Access violation - code c0000005 (!!! second chance !!!) 00000000`00000040 ?? ??? ERROR_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%p referenced memory at 0x%p. The memory could not be %s. EXCEPTION_CODE_STR: c0000005 EXCEPTION_PARAMETER1: 0000000000000008 EXCEPTION_PARAMETER2: 0000000000000040 STACK_ADDR_RAW_STACK_SYMBOL: fffff885929a1448 STACK_COMMAND: dps fffff885929a1448-0x20 ; kb STACK_TEXT: fffff885`929a1428 00000000`00000000 fffff885`929a1430 ffffb78d`5fa21040 fffff885`929a1438 ffffb78d`5e4a5200 fffff885`929a1440 fffff801`8270d7e2 volsnap!VolSnapPnp+0x142 fffff885`929a1448 fffff801`7f93b247 nt!IopfCallDriver+0x53 fffff885`929a1450 ffffb78d`5e4a5010 fffff885`929a1458 ffffb78d`5f672610 fffff885`929a1460 ffffb78d`5fa21190 fffff885`929a1468 00000000`00001001 fffff885`929a1470 ffffb78d`55a86a10 fffff885`929a1478 fffff801`7fbec06b nt!DifKeInitializeEventWrapper+0xbb fffff885`929a1480 fffff801`8270d785 volsnap!VolSnapPnp+0xe5 fffff885`929a1488 fffff801`800c922f nt!IovCallDriver+0x5f fffff885`929a1490 00000000`00000fff fffff885`929a1498 ffffa600`7e200000 fffff885`929a14a0 fffff801`8270d785 volsnap!VolSnapPnp+0xe5 SYMBOL_NAME: volsnap!VolSnapPnp+142 MODULE_NAME: volsnap IMAGE_NAME: volsnap.sys BUCKET_ID_FUNC_OFFSET: 142 FAILURE_BUCKET_ID: ACCESS_VIOLATION_VRF_NULL_IP_volsnap!VolSnapPnp 1: kd> r rax=0000000000000000 rbx=000000200000004c rcx=000000002f7f3938 rdx=00000000001e8480 rsi=000005010000004c rdi=0807060504030201 rip=0000000000000040 rsp=fffff885929a13f0 rbp=004f0056005a0008 r8=0000000000000000 r9=fffff78000000294 r10=fffff8017f9cfbb0 r11=0000000000000000 r12=0000000a00000000 r13=51b40aadfe995194 r14=498a014bc66015ee r15=ffffb78d5f672610 ```

This looks corrupt, stack maybe, so not sure it adds any more clues

Going up frame #4 in stack is even at address 0x0807060504030201 which seems "suspect".

It comes from volsnap:

fffff801`8270d7ca  lea rcx,[rsp+40h] 
fffff801`8270d7cf  mov qword ptr [rax-8],rcx 
fffff801`8270d7d3  mov byte ptr [rax-45h],0E0h 
fffff801`8270d7d7  mov rcx,qword ptr [rdi+20h] 
fffff801`8270d7db  call qword ptr [volsnap!_imp_IofCallDriver (fffff801`827090d0)]

IofCallDriver(DeviceObject, goes into rcx, which is 000000002f7f3938 (bad), but comes from rdi, which is 0807060504030201 (needs exorcism), and rsp at this frame is:

0xFFFFF885929A13F0  4c 00 00 00 20 00 00 00 08 00 5a 00 56 00 4f 00 4c 00 00 00 01 05 00 00 01 02 03 04 05 06 07 08 10 00 00 00 0a 0b 0c 0d 0e 0f  L... .....Z.V.O.L.........................

Maybe we return something on the stack we shouldn't.

lundman commented 5 months ago

Interesting that the last call to us wzvol_HwReportLog has

UCHAR myPortWWN[8] = { 1, 2, 3, 4, 5, 6, 7, 8 };

guenther-alka commented 5 months ago

What I have seen on powershell get-volume (Open-ZFS ..f74) Get-volume

`DriveLetter FriendlyName FileSystemType DriveType HealthStatus OperationalStatus SizeRemaining Size


                      Unknown        Removable Healthy      Unknown                     0 B       0 B

I Volume2 NTFS Fixed Healthy OK 3.88 GB 3.91 GB C NTFS Fixed Healthy OK 118.94 GB 475.95 GB Y Speicherplatz NTFS Fixed Healthy OK 19.82 GB 19.87 GB FAT32 Fixed Healthy OK 71.03 MB 99.21 MB E Volume1 NTFS Fixed Healthy OK 3.88 GB 3.91 GB D Unknown Removable Healthy Unknown 0 B 0 B NTFS Fixed Healthy OK 86.31 MB 786 MB`

Drive D is a ZFS pool and the only entry with a driveletter, unknown filesystem type,, status unknown and size 0 not sure if this is relevant.

sskras commented 5 months ago

@guenther-alka, your output is so messed up that it took me a half of an hour to reconstruct the structure:

DriveLetter FriendlyName  FileSystemType DriveType HealthStatus OperationalStatus SizeRemaining      Size
----------- ------------  -------------- --------- ------------ ----------------- -------------      ----
D                         Unknown        Removable Healthy      Unknown                     0 B       0 B
DriveLetter FriendlyName  FileSystemType DriveType HealthStatus OperationalStatus SizeRemaining      Size
----------- ------------  -------------- --------- ------------ ----------------- -------------      ----
I           Volume2       NTFS           Fixed     Healthy      OK                      3.88 GB   3.91 GB 
C                         NTFS           Fixed     Healthy      OK                    118.94 GB 475.95 GB
Y           Speicherplatz NTFS           Fixed     Healthy      OK                     19.82 GB  19.87 GB
                          FAT32          Fixed     Healthy      OK                     71.03 MB  99.21 MB 
E           Volume1       NTFS           Fixed     Healthy      OK                      3.88 GB   3.91 GB 
D                         Unknown        Removable Healthy      Unknown                     0 B       0 B
                          NTFS           Fixed     Healthy      OK                     86.31 MB    786 MB

Also I guess that you're posting to the wrong issue.

sskras commented 5 months ago

@lundman commented 2 days ago:

OK, I have take out everything ZFS. No SPL, no ZFS, no dispatcher, no dprintf, WPP.

Left is the DriverStart() calling StorPortInitialize().

[...]


*** Fatal System Error: 0x000000c2

@lundman commented 2 days ago:

But I've commented out all those things, calls to __cpuidex(), assembly, etc.

Umm, is there anything left to comment out except for DriverStart()?

But that tends to mean it is not my code, but every time I blame the compiler it ends up being my code. Peculiar.

This time I would rather blame logic coming from other drivers.

I do wonder if I can build the sample storport project that the zvol stuff came from, and trigger it outside of this project.

That would be very interesting.

At least could suggest Aomei issue as well.

To me it already suggests that Aomei is buggy. It also could be due to some NT kernel change in the newer Windows version. This is w11, right?

sskras commented 5 months ago

Found a forum in which might be lurking folks skilled in NT kernel crashes: https://www.sysnative.com/forums/forums/bsod-crashes-kernel-debugging.15/

Eg: Storport.sys BSOD case.

lundman commented 5 months ago

I took out the __attribute__(constructor) today, since it would make sense for that to crash during load and unload. Alas, crashing continues just to vex me further. :)

guenther-alka commented 5 months ago

If Windows crashes with ZFS for an unknown reason, my point was to check in what items ZFS behaves different from other volumes and undefined/unknown states are such a point.

lundman commented 5 months ago

What I have seen on powershell get-volume (Open-ZFS ..f74)

Yeah it's interesting the difference there, I do suspect we aren't handling MOUNTMGR correctly, so we will have to return to it. But that issue kicks in when something is mounted. I can crash just loading/unloading the driver, without ever using ZFS parts. But let's put that on the list to look into.

lundman commented 5 months ago

So, instead of calling our zvol's StorPortInitialize(), I grabbed SFBDVMiniport.c from https://github.com/waynemunro/service-fabric-2/blob/master/src/prod/src/BlockStore/SFBDMiniport/SFBDMiniport/SFBDVMiniport.c

and call it's StorPortInitialize() instead;

Driver Verifier: Applied for OpenZFS.sys, 0:0x3bb6dbb, 1:0xa, build 22635, key tnVYFWMVqHgmDJiKeQf9FB
OpenZFS: Started
OpenZFS_Fini
OpenZFS: Goodbye.
Driver Verifier: Removed for OpenZFS.sys, 0:0x3bb6dbb, 1:0xa, build 22635, key tnVYFWMVqHgmDJiKeQf9FB
KDTARGET: Refreshing KD connection

*** Fatal System Error: 0x00000139
                       (0x0000000000000003,0xFFFFBE8BFC90E600,0xFFFFBE8BFC90E558,0x0000000000000000)

Break instruction exception - code 80000003 (first chance)
For analysis of this file, run !analyze -v
nt!DbgBreakPointWithStatus:
fffff807`38e3f520 cc              int     3
1: kd> k
 # Child-SP          RetAddr               Call Site
00 ffffbe8b`fc90dac8 fffff807`38f84822     nt!DbgBreakPointWithStatus
01 ffffbe8b`fc90dad0 fffff807`38f83ee3     nt!KiBugCheckDebugBreak+0x12
02 ffffbe8b`fc90db30 fffff807`38e35487     nt!KeBugCheck2+0xba3
03 ffffbe8b`fc90e2a0 fffff807`38e4ace9     nt!KeBugCheckEx+0x107
04 ffffbe8b`fc90e2e0 fffff807`38e4b2b2     nt!KiBugCheckDispatch+0x69
05 ffffbe8b`fc90e420 fffff807`38e49006     nt!KiFastFailDispatch+0xb2
06 ffffbe8b`fc90e600 fffff807`38d20b8c     nt!KiRaiseSecurityCheckFailure+0x346
07 ffffbe8b`fc90e790 fffff807`3b602535     nt!RtlRemoveEntryHashTable+0x4c
08 ffffbe8b`fc90e7c0 ffffe308`6ecd1c90     0xfffff807`3b602535
09 ffffbe8b`fc90e7c8 badbadfa`badbadfa     0xffffe308`6ecd1c90
0a ffffbe8b`fc90e7d0 ffffbe8b`fc90e850     0xbadbadfa`badbadfa
0b ffffbe8b`fc90e7d8 fffff807`3b603747     0xffffbe8b`fc90e850
0c ffffbe8b`fc90e7e0 00000000`00000000     0xfffff807`3b603747

0xbadbadfabadbadfa is certainly amusing.

But it suggests that it isn't the code. I tried added various Registry entries from the .inf file without luck. Now I am curious if that Service-Fabric-2's miniport driver loads with Aomei at all.

lundman commented 5 months ago

Err the actual stack

1: kd> k
 # Child-SP          RetAddr               Call Site
00 ffffbe8b`fc90dac8 fffff807`38f84822     nt!DbgBreakPointWithStatus
01 ffffbe8b`fc90dad0 fffff807`38f83ee3     nt!KiBugCheckDebugBreak+0x12
02 ffffbe8b`fc90db30 fffff807`38e35487     nt!KeBugCheck2+0xba3
03 ffffbe8b`fc90e2a0 fffff807`38e4ace9     nt!KeBugCheckEx+0x107
04 ffffbe8b`fc90e2e0 fffff807`38e4b2b2     nt!KiBugCheckDispatch+0x69
05 ffffbe8b`fc90e420 fffff807`38e49006     nt!KiFastFailDispatch+0xb2
06 ffffbe8b`fc90e600 fffff807`38d20b8c     nt!KiRaiseSecurityCheckFailure+0x346
07 ffffbe8b`fc90e790 fffff807`3b602535     nt!RtlRemoveEntryHashTable+0x4c
08 ffffbe8b`fc90e7c0 fffff807`3b6013ce     NETIO!FreeSomeCacheBucketEntries+0x7d
09 ffffbe8b`fc90e810 fffff807`3b607218     NETIO!FreeSomeCacheEntries+0x3e
0a ffffbe8b`fc90e840 fffff807`3b60442b     NETIO!UpdateCacheLruBucket+0x828
0b ffffbe8b`fc90e910 fffff807`3b6118f7     NETIO!LruCleanupWorkItemRoutine+0x9b
0c ffffbe8b`fc90e940 fffff807`38ce0afe     NETIO!NetiopIoWorkItemRoutine+0x57
0d ffffbe8b`fc90e990 fffff807`38cf4b35     nt!IopProcessWorkItem+0x8e
0e ffffbe8b`fc90ea00 fffff807`38d1b577     nt!ExpWorkerThread+0x155
0f ffffbe8b`fc90ebf0 fffff807`38e3a414     nt!PspSystemThreadStartup+0x57
10 ffffbe8b`fc90ec40 00000000`00000000     nt!KiStartSystemThread+0x34
lundman commented 5 months ago

Since it is not actually calling us, we can't skip ambakdrv.sys based on "name of caller". I suppose we can skip loading stoport, if we detect it is running on the system.

lundman commented 5 months ago

OK to get some progress, what I will do for now is, the default value of windows_zvol_enabled is 1, and if it is 1, and we detect ambakdrv.sys is loaded, zvols are skipped. If you set windows_zvol_enabled to 2, it will force load OpenZFS with zvols. 0 is disabled as expected.