Open lundman opened 1 year ago
I'm glad I could find another problem for you 😈
Kernel-mode code is always interruptible: an interrupt with a higher IRQL value can occur at any time, thereby causing another piece of kernel-mode code that has a higher system-assigned IRQL to be run immediately on that processor.
Interrupts are handled by the current running thread on that processor. kmem_cache_alloc() in trying to lock the same cpu number cache lock in both the calls.
Yeah, I guess as much there, but it probably means we need to "detect" this situation in dispatcher()
and I guess delay them with a DPC
entry, at least that was my first thought, maybe checking other project sources will show what the standard way to handle that is.
Looks like the recent commit having DPC/DISPATCH_LEVEL IRQL check in __dprintf() is not there in this binary?
It should be, I made the patch to stop crashes on my system. So I would have the debug build with the patch installed.
fine. The line number in the call stack for dprintf() seems to be off by many lines. So wanted to confirm. Thanks.
At worst it could have been this build https://github.com/openzfsonwindows/openzfs/commit/935ef7dcf5ce64c33f19cfb32c674b29058c6f87
Scratch that
Yeah that is weird. it should be line 680 in zfs_windows_zvol.c
Anyway I can confirm that I did not touch that file.
I meant the line number 235 for the zfs_kmem_alloc() call in _dprintf(). yes, the line number is matching to this https://github.com/openzfsonwindows/openzfs/commit/935ef7dcf5ce64c33f19cfb32c674b29058c6f87 commit. This commit does not solve that reported issue.
The DPC thing dies in the wait, which is after the "locking against ourselves" check, so this is not that problem, but rather that we re-entered the FS as the same thread.
Correct. Checked with other folks and below are some points:- 1) If the IRQL of the incoming interrupt is higher than the current IRQL, then the current thread would execute the ISR. This is by windows design. 2) HW_STARTIO() can be called at DISPATCH_LEVEL. https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/storport/nc-storport-hw_startio 3) After ISR, the thread is executing DPCs (nt!KiExecuteAllDpcs()) and normally the current IRQL would be at DISPATCH_LEVEL as in # 103 call stack and we should not use mutex.
Thanks, I pretty much suspected that to be the case, so it is nice to have it confirmed. That 1 and 2 happens is as expected, and we can handle it fine. The thing that is offensive to my unix senses is that it goes and does 3) by "borrowing" the thread. Brr.
The question is, could we when called in our dispatch()
detect that 3) case is what is happening, that it has temporarily suspend the thread to do some other processing - so that we can defer these calls until done. Is there a
CurrentlyInDPC()
or similar? Is it enough to check for DISPATCH_LEVEL
and just defer the IRP with PENDING.
To me the following sounds like a problem:
kmem_cache_alloc() which needs to sleep waiting on a mutex
Provided I understand the function name correctly, meaning its purpose is to allocate NPOOL memory to place some cached data in, it seems quite strange to wait for something to perform such an allocation. Is such a design/architecture really intended? To me, either there is enough memory (satisfying immediately) or there isn't (failing immediately). For a driver to explicit and manually wait for memory at DPC level... I find that unusual.
Moving on, since caller producing this condition is partmgr trying to perform a DASD IO, while the lock is held for (from my understanding) a completely different purpose, could part of the problem be that the same dispatch handler (and/or locking object) is pulling double duty - both FS service and direct block access?
Additionally, that __dprintf
should have to enter a DPC wait just to get some memory seems just wrong in itself, even more so since it seems to request cache memory. I'd myself allocate some NPOOL memory only once, then keep it in a lookaside buffer/list.
I mean, surely __dprintf
doesn't require such grotesque amounts of memory it would have to wait for it? :-)
Provided I understand the function name correctly, meaning its purpose is to allocate NPOOL memory to place some cached data in, it seems quite strange to wait for something to perform such an allocation
The Solaris kmem system is a full memory manager, which you can read about more here: https://people.eecs.berkeley.edu/~kubitron/courses/cs194-24-S14/hand-outs/bonwick_slab.pdf
So essentially, it is what the SunOS/Solaris kernel uses for physical memory, slab allocator reuse magazine, and cache allocator etc. It is way overkill to use on a system which already HAS a memory manager - definitely, but it also gets us up and running fast, especially with the cache slab allocator. It also means we get all the debugging with it, including leak tracking etc. The Solaris kernel has that down to perfection.
So, what's the deal with waiting for allocation? Most allocations will simply find free space in a magazine, and return it. It uses (Unix kernel style) mutex to ensure two competing threads don't try to modify the magazine struct at the same time, what Windows would call critical section. This means waiting, usually for a short time. Sometimes magazines are full, we need to go get more pages etc, which means calling Windows allocator. Either way, there is definitely locking needing, and deferring of threads.
Some places in the kernel the allocations can ask for KM_SLEEP
(vs KM_NOSLEEP
) which means, do not fail the allocation (by returning NULL) but rather, just wait until memory is available. So sleep, wait for reap, or other situations that return memory, so definitely need waiting sometimes. That is just how the code that is ported is written.
Can't use critical-section, or other built-in types in Windows kernel, they fiddle with IRLQ etc, not how Unix code is designed. Mutex in Unix can be held for a long time (as in, hours if needed), like posix_mutex. So ZFS I implemented something equivalent to Unix threads using atomics, and Events to sleep on to be signaled. That all works fine, most of the way. Until it decides to preempt the KeWaitForSingleObject()
- which is fine, Unix does this too.
Then it handles the interrupt, also fine, Unix does this too.
Then when it's bored, it decides to process any waiting DPCs, as the same thread that is inside KeWaitForSingleObject()
. Things the code has asked to "process this later please, when we aren't in the middle of things", and it does it as the existing thread. That is what is surprising, and tounge-in-cheek, feels rude :) A thread that is sleeping in a system call, gets called again via dispatch()
. 🤯
That it trickles into dprintf()
and wants to allocate memory isn't really the problem specifically, ANY ZFS code can/will ask for memory, or indeed any other code with mutex, and end up here. dprintf()
just conveniently showed the issue early.
Anyway, annoying, they need to be put back into DPC, and processed later.
Note to self: apparently I was using a zvol as a physical disk in hyper-v. maybe running macosx.
OK so been thinking about this one for a bit.
1) Essentially, the problem comes from that KeGetCurrentThread()
isn't as "unique" as Unix wants it to be, if it could detect the difference of the thread running, and DPC call (using thread) we'd have no issues. Detecting DPC probably can be done, maybe something with Thread->ApcState.KernelApcInProgress
or avoiding that all with Thread->KernelApcDisable
/ KeEnterCriticalRegionThread()
. But it feels a bit hacky. Let's try not to.
2) We do sort of control DPCs. Either we make them ourselves, or we know where they will come from. At the moment, probably just hwInitData.HwStartIo = wzvol_HwStartIo;
. That function (and decedents) and currently "like" 99% done without ZFS and ZFS mutex. We should consider that path "special" and with care can most likely avoid this issue. At the very least, not use dprintf()
for debugging. And the same for any additional DPCs we may use in future.
Side-note. There is also that the mini STOR driver we use for zvols wants its own list of Driver->MajorFunction[x]
and ZFS also has own MajorFunction[x]
, so we have to decide which one in the ZFS dispatcher()
and relay on to STOR_MajorFunction[x]
if it isn't meant for ZFS. Surely Windows has someway to handle this better that I haven't come across yet? Should clean this up one day, if someone points out how it would normally be done.
Stack
``` : nt!KiExceptionDispatch+0x107 : nt!KiBreakpointTrap+0x305 : OpenZFS!panic+0x4f [C:\Users\andre\Desktop\openzfs-fix_dprintf_crash\openzfs-fix_dprintf_crash\module\os\windows\spl\spl-debug.c @ 33] : OpenZFS!spl_mutex_enter+0x5f [C:\Users\andre\Desktop\openzfs-fix_dprintf_crash\openzfs-fix_dprintf_crash\module\os\windows\spl\spl-mutex.c @ 121] : OpenZFS!kmem_cache_alloc+0x41 [C:\Users\andre\Desktop\openzfs-fix_dprintf_crash\openzfs-fix_dprintf_crash\module\os\windows\spl\spl-kmem.c @ 2104] : OpenZFS!zfs_kmem_alloc+0x15a [C:\Users\andre\Desktop\openzfs-fix_dprintf_crash\openzfs-fix_dprintf_crash\module\os\windows\spl\spl-kmem.c @ 2639] : OpenZFS!__dprintf+0x179 [C:\Users\andre\Desktop\openzfs-fix_dprintf_crash\openzfs-fix_dprintf_crash\module\os\windows\zfs\zfs_debug.c @ 235] : OpenZFS!wzvol_HwStartIo+0x65 [C:\Users\andre\Desktop\openzfs-fix_dprintf_crash\openzfs-fix_dprintf_crash\module\os\windows\zfs\zfs_windows_zvol.c @ 683] : storport!RaidAdapterPostScatterGatherExecute+0x227 : storport!RaUnitStartIo+0x300 : storport!RaidStartIoPacket+0x4f6 : storport!RaUnitScsiIrp+0x215 : storport!RaDriverScsiIrp+0x5a : OpenZFS!dispatcher+0x313 [C:\Users\andre\Desktop\openzfs-fix_dprintf_crash\openzfs-fix_dprintf_crash\module\os\windows\zfs\zfs_vnops_windows.c @ 5797] : nt!IofCallDriver+0x55 : CLASSPNP!SubmitTransferPacket+0x2fb : CLASSPNP!ServiceTransferRequest+0x2cc : CLASSPNP!ClassReadWrite+0x166 : CLASSPNP!ClassGlobalDispatch+0x24 : nt!IofCallDriver+0x55 : partmgr!PmWrite+0x10f : partmgr!PmGlobalDispatch+0x20 : nt!IofCallDriver+0x55 : passthruparser!PassthruParserHandleSRB+0x1ef : passthruparser!PassthruParserExecuteScsiRequestDisk+0x9 : storvsp!VspStartJob+0x17d : storvsp!VspKickJobsList+0x22aa : storvsp!VspPvtKmclProcessingComplete+0x47 : vmbkmclr!InpFillAndProcessQueue+0xeb : vmbkmclr!KmclpVmbusIsr+0x129 : vmbusr!ChReceiveChannelInterrupt+0x3c : vmbusr!ParentRingInterruptDpc+0x81 : nt!KiExecuteAllDpcs+0x491 : nt!KiRetireDpcList+0x2a2 : nt!KxRetireDpcList+0x5 : nt!KiDispatchInterruptContinue : nt!KiDpcInterruptBypass+0x25 : nt!KiVmbusInterruptDispatch+0xaa : OpenZFS!spl_mutex_enter+0xf6 [C:\Users\andre\Desktop\openzfs-fix_dprintf_crash\openzfs-fix_dprintf_crash\module\os\windows\spl\spl-mutex.c @ 126] : OpenZFS!kmem_cache_alloc+0x41 [C:\Users\andre\Desktop\openzfs-fix_dprintf_crash\openzfs-fix_dprintf_crash\module\os\windows\spl\spl-kmem.c @ 2104] : OpenZFS!zfs_kmem_zalloc+0x4e [C:\Users\andre\Desktop\openzfs-fix_dprintf_crash\openzfs-fix_dprintf_crash\module\os\windows\spl\spl-kmem.c @ 2582] : OpenZFS!vdev_mirror_map_alloc+0x24 [C:\Users\andre\Desktop\openzfs-fix_dprintf_crash\openzfs-fix_dprintf_crash\module\zfs\vdev_mirror.c @ 157] : OpenZFS!vdev_mirror_map_init+0x4de [C:\Users\andre\Desktop\openzfs-fix_dprintf_crash\openzfs-fix_dprintf_crash\module\zfs\vdev_mirror.c @ 323] : OpenZFS!vdev_mirror_io_start+0x1f [C:\Users\andre\Desktop\openzfs-fix_dprintf_crash\openzfs-fix_dprintf_crash\module\zfs\vdev_mirror.c @ 628] : OpenZFS!zio_vdev_io_start+0x128 [C:\Users\andre\Desktop\openzfs-fix_dprintf_crash\openzfs-fix_dprintf_crash\module\zfs\zio.c @ 3752] : OpenZFS!__zio_execute+0x30d [C:\Users\andre\Desktop\openzfs-fix_dprintf_crash\openzfs-fix_dprintf_crash\module\zfs\zio.c @ 2213] : OpenZFS!zio_execute+0x1b [C:\Users\andre\Desktop\openzfs-fix_dprintf_crash\openzfs-fix_dprintf_crash\module\zfs\zio.c @ 2125] : OpenZFS!taskq_thread+0x51a [C:\Users\andre\Desktop\openzfs-fix_dprintf_crash\openzfs-fix_dprintf_crash\module\os\windows\spl\spl-taskq.c @ 2097] ```
What appears to happen here, is that we receive a
dispatch()
as per usual, and end up inkmem_cache_alloc()
which needs to sleep waiting on a mutex. Then something calledKiVmbusInterruptDispatch()
triggers. This in turn, feels the need to call into ZFS filesystem again, viaIofCallDriver()
and we end up in ourdispatcher()
. Eventually, we need to wait on that same mutex as before, and panic withlocking against myself
.That we lock against ourselves in a
dprintf()
isn't so important, we need to handle the larger issue ofKiVmbusInterruptDispatch()
. What is it? Can we control it, or in some way detect it? Avoid? Why is it running as the very same thread that is sleeping, another "performance" thing Windows does?Do the Datacore guys know anything about this?