openzfsonwindows / openzfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
400 stars 15 forks source link

InterruptDispatch FS re-entry #110

Open lundman opened 1 year ago

lundman commented 1 year ago
Stack

``` : nt!KiExceptionDispatch+0x107 : nt!KiBreakpointTrap+0x305 : OpenZFS!panic+0x4f [C:\Users\andre\Desktop\openzfs-fix_dprintf_crash\openzfs-fix_dprintf_crash\module\os\windows\spl\spl-debug.c @ 33] : OpenZFS!spl_mutex_enter+0x5f [C:\Users\andre\Desktop\openzfs-fix_dprintf_crash\openzfs-fix_dprintf_crash\module\os\windows\spl\spl-mutex.c @ 121] : OpenZFS!kmem_cache_alloc+0x41 [C:\Users\andre\Desktop\openzfs-fix_dprintf_crash\openzfs-fix_dprintf_crash\module\os\windows\spl\spl-kmem.c @ 2104] : OpenZFS!zfs_kmem_alloc+0x15a [C:\Users\andre\Desktop\openzfs-fix_dprintf_crash\openzfs-fix_dprintf_crash\module\os\windows\spl\spl-kmem.c @ 2639] : OpenZFS!__dprintf+0x179 [C:\Users\andre\Desktop\openzfs-fix_dprintf_crash\openzfs-fix_dprintf_crash\module\os\windows\zfs\zfs_debug.c @ 235] : OpenZFS!wzvol_HwStartIo+0x65 [C:\Users\andre\Desktop\openzfs-fix_dprintf_crash\openzfs-fix_dprintf_crash\module\os\windows\zfs\zfs_windows_zvol.c @ 683] : storport!RaidAdapterPostScatterGatherExecute+0x227 : storport!RaUnitStartIo+0x300 : storport!RaidStartIoPacket+0x4f6 : storport!RaUnitScsiIrp+0x215 : storport!RaDriverScsiIrp+0x5a : OpenZFS!dispatcher+0x313 [C:\Users\andre\Desktop\openzfs-fix_dprintf_crash\openzfs-fix_dprintf_crash\module\os\windows\zfs\zfs_vnops_windows.c @ 5797] : nt!IofCallDriver+0x55 : CLASSPNP!SubmitTransferPacket+0x2fb : CLASSPNP!ServiceTransferRequest+0x2cc : CLASSPNP!ClassReadWrite+0x166 : CLASSPNP!ClassGlobalDispatch+0x24 : nt!IofCallDriver+0x55 : partmgr!PmWrite+0x10f : partmgr!PmGlobalDispatch+0x20 : nt!IofCallDriver+0x55 : passthruparser!PassthruParserHandleSRB+0x1ef : passthruparser!PassthruParserExecuteScsiRequestDisk+0x9 : storvsp!VspStartJob+0x17d : storvsp!VspKickJobsList+0x22aa : storvsp!VspPvtKmclProcessingComplete+0x47 : vmbkmclr!InpFillAndProcessQueue+0xeb : vmbkmclr!KmclpVmbusIsr+0x129 : vmbusr!ChReceiveChannelInterrupt+0x3c : vmbusr!ParentRingInterruptDpc+0x81 : nt!KiExecuteAllDpcs+0x491 : nt!KiRetireDpcList+0x2a2 : nt!KxRetireDpcList+0x5 : nt!KiDispatchInterruptContinue : nt!KiDpcInterruptBypass+0x25 : nt!KiVmbusInterruptDispatch+0xaa : OpenZFS!spl_mutex_enter+0xf6 [C:\Users\andre\Desktop\openzfs-fix_dprintf_crash\openzfs-fix_dprintf_crash\module\os\windows\spl\spl-mutex.c @ 126] : OpenZFS!kmem_cache_alloc+0x41 [C:\Users\andre\Desktop\openzfs-fix_dprintf_crash\openzfs-fix_dprintf_crash\module\os\windows\spl\spl-kmem.c @ 2104] : OpenZFS!zfs_kmem_zalloc+0x4e [C:\Users\andre\Desktop\openzfs-fix_dprintf_crash\openzfs-fix_dprintf_crash\module\os\windows\spl\spl-kmem.c @ 2582] : OpenZFS!vdev_mirror_map_alloc+0x24 [C:\Users\andre\Desktop\openzfs-fix_dprintf_crash\openzfs-fix_dprintf_crash\module\zfs\vdev_mirror.c @ 157] : OpenZFS!vdev_mirror_map_init+0x4de [C:\Users\andre\Desktop\openzfs-fix_dprintf_crash\openzfs-fix_dprintf_crash\module\zfs\vdev_mirror.c @ 323] : OpenZFS!vdev_mirror_io_start+0x1f [C:\Users\andre\Desktop\openzfs-fix_dprintf_crash\openzfs-fix_dprintf_crash\module\zfs\vdev_mirror.c @ 628] : OpenZFS!zio_vdev_io_start+0x128 [C:\Users\andre\Desktop\openzfs-fix_dprintf_crash\openzfs-fix_dprintf_crash\module\zfs\zio.c @ 3752] : OpenZFS!__zio_execute+0x30d [C:\Users\andre\Desktop\openzfs-fix_dprintf_crash\openzfs-fix_dprintf_crash\module\zfs\zio.c @ 2213] : OpenZFS!zio_execute+0x1b [C:\Users\andre\Desktop\openzfs-fix_dprintf_crash\openzfs-fix_dprintf_crash\module\zfs\zio.c @ 2125] : OpenZFS!taskq_thread+0x51a [C:\Users\andre\Desktop\openzfs-fix_dprintf_crash\openzfs-fix_dprintf_crash\module\os\windows\spl\spl-taskq.c @ 2097] ```

What appears to happen here, is that we receive a dispatch() as per usual, and end up in kmem_cache_alloc() which needs to sleep waiting on a mutex. Then something called KiVmbusInterruptDispatch() triggers. This in turn, feels the need to call into ZFS filesystem again, via IofCallDriver() and we end up in our dispatcher(). Eventually, we need to wait on that same mutex as before, and panic with locking against myself.

That we lock against ourselves in a dprintf() isn't so important, we need to handle the larger issue of KiVmbusInterruptDispatch(). What is it? Can we control it, or in some way detect it? Avoid? Why is it running as the very same thread that is sleeping, another "performance" thing Windows does?

Do the Datacore guys know anything about this?

andrewc12 commented 1 year ago

I'm glad I could find another problem for you 😈

datacore-rm commented 1 year ago

Kernel-mode code is always interruptible: an interrupt with a higher IRQL value can occur at any time, thereby causing another piece of kernel-mode code that has a higher system-assigned IRQL to be run immediately on that processor.

https://docs.microsoft.com/en-us/windows-hardware/drivers/kernel/always-preemptible-and-always-interruptible.

Interrupts are handled by the current running thread on that processor. kmem_cache_alloc() in trying to lock the same cpu number cache lock in both the calls.

lundman commented 1 year ago

Yeah, I guess as much there, but it probably means we need to "detect" this situation in dispatcher() and I guess delay them with a DPC entry, at least that was my first thought, maybe checking other project sources will show what the standard way to handle that is.

datacore-rm commented 1 year ago

Looks like the recent commit having DPC/DISPATCH_LEVEL IRQL check in __dprintf() is not there in this binary?

andrewc12 commented 1 year ago

It should be, I made the patch to stop crashes on my system. So I would have the debug build with the patch installed.

datacore-rm commented 1 year ago

fine. The line number in the call stack for dprintf() seems to be off by many lines. So wanted to confirm. Thanks.

andrewc12 commented 1 year ago

At worst it could have been this build https://github.com/openzfsonwindows/openzfs/commit/935ef7dcf5ce64c33f19cfb32c674b29058c6f87

Scratch that

Yeah that is weird. it should be line 680 in zfs_windows_zvol.c

Anyway I can confirm that I did not touch that file.

datacore-rm commented 1 year ago

I meant the line number 235 for the zfs_kmem_alloc() call in _dprintf(). yes, the line number is matching to this https://github.com/openzfsonwindows/openzfs/commit/935ef7dcf5ce64c33f19cfb32c674b29058c6f87 commit. This commit does not solve that reported issue.

lundman commented 1 year ago

The DPC thing dies in the wait, which is after the "locking against ourselves" check, so this is not that problem, but rather that we re-entered the FS as the same thread.

datacore-rm commented 1 year ago

Correct. Checked with other folks and below are some points:- 1) If the IRQL of the incoming interrupt is higher than the current IRQL, then the current thread would execute the ISR. This is by windows design. 2) HW_STARTIO() can be called at DISPATCH_LEVEL. https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/storport/nc-storport-hw_startio 3) After ISR, the thread is executing DPCs (nt!KiExecuteAllDpcs()) and normally the current IRQL would be at DISPATCH_LEVEL as in # 103 call stack and we should not use mutex.

https://www.osronline.com/article.cfm%5Earticle=529.htm

lundman commented 1 year ago

Thanks, I pretty much suspected that to be the case, so it is nice to have it confirmed. That 1 and 2 happens is as expected, and we can handle it fine. The thing that is offensive to my unix senses is that it goes and does 3) by "borrowing" the thread. Brr. The question is, could we when called in our dispatch() detect that 3) case is what is happening, that it has temporarily suspend the thread to do some other processing - so that we can defer these calls until done. Is there a CurrentlyInDPC() or similar? Is it enough to check for DISPATCH_LEVEL and just defer the IRP with PENDING.

tamlin-mike commented 1 year ago

To me the following sounds like a problem:

kmem_cache_alloc() which needs to sleep waiting on a mutex

Provided I understand the function name correctly, meaning its purpose is to allocate NPOOL memory to place some cached data in, it seems quite strange to wait for something to perform such an allocation. Is such a design/architecture really intended? To me, either there is enough memory (satisfying immediately) or there isn't (failing immediately). For a driver to explicit and manually wait for memory at DPC level... I find that unusual.

Moving on, since caller producing this condition is partmgr trying to perform a DASD IO, while the lock is held for (from my understanding) a completely different purpose, could part of the problem be that the same dispatch handler (and/or locking object) is pulling double duty - both FS service and direct block access?

Additionally, that __dprintf should have to enter a DPC wait just to get some memory seems just wrong in itself, even more so since it seems to request cache memory. I'd myself allocate some NPOOL memory only once, then keep it in a lookaside buffer/list. I mean, surely __dprintf doesn't require such grotesque amounts of memory it would have to wait for it? :-)

lundman commented 1 year ago

Provided I understand the function name correctly, meaning its purpose is to allocate NPOOL memory to place some cached data in, it seems quite strange to wait for something to perform such an allocation

The Solaris kmem system is a full memory manager, which you can read about more here: https://people.eecs.berkeley.edu/~kubitron/courses/cs194-24-S14/hand-outs/bonwick_slab.pdf

So essentially, it is what the SunOS/Solaris kernel uses for physical memory, slab allocator reuse magazine, and cache allocator etc. It is way overkill to use on a system which already HAS a memory manager - definitely, but it also gets us up and running fast, especially with the cache slab allocator. It also means we get all the debugging with it, including leak tracking etc. The Solaris kernel has that down to perfection.

So, what's the deal with waiting for allocation? Most allocations will simply find free space in a magazine, and return it. It uses (Unix kernel style) mutex to ensure two competing threads don't try to modify the magazine struct at the same time, what Windows would call critical section. This means waiting, usually for a short time. Sometimes magazines are full, we need to go get more pages etc, which means calling Windows allocator. Either way, there is definitely locking needing, and deferring of threads.

Some places in the kernel the allocations can ask for KM_SLEEP (vs KM_NOSLEEP) which means, do not fail the allocation (by returning NULL) but rather, just wait until memory is available. So sleep, wait for reap, or other situations that return memory, so definitely need waiting sometimes. That is just how the code that is ported is written.

Can't use critical-section, or other built-in types in Windows kernel, they fiddle with IRLQ etc, not how Unix code is designed. Mutex in Unix can be held for a long time (as in, hours if needed), like posix_mutex. So ZFS I implemented something equivalent to Unix threads using atomics, and Events to sleep on to be signaled. That all works fine, most of the way. Until it decides to preempt the KeWaitForSingleObject() - which is fine, Unix does this too.

Then it handles the interrupt, also fine, Unix does this too.

Then when it's bored, it decides to process any waiting DPCs, as the same thread that is inside KeWaitForSingleObject(). Things the code has asked to "process this later please, when we aren't in the middle of things", and it does it as the existing thread. That is what is surprising, and tounge-in-cheek, feels rude :) A thread that is sleeping in a system call, gets called again via dispatch(). 🤯

That it trickles into dprintf() and wants to allocate memory isn't really the problem specifically, ANY ZFS code can/will ask for memory, or indeed any other code with mutex, and end up here. dprintf() just conveniently showed the issue early.

Anyway, annoying, they need to be put back into DPC, and processed later.

andrewc12 commented 1 year ago

Note to self: apparently I was using a zvol as a physical disk in hyper-v. maybe running macosx.

lundman commented 1 year ago

OK so been thinking about this one for a bit.

1) Essentially, the problem comes from that KeGetCurrentThread() isn't as "unique" as Unix wants it to be, if it could detect the difference of the thread running, and DPC call (using thread) we'd have no issues. Detecting DPC probably can be done, maybe something with Thread->ApcState.KernelApcInProgress or avoiding that all with Thread->KernelApcDisable / KeEnterCriticalRegionThread(). But it feels a bit hacky. Let's try not to.

2) We do sort of control DPCs. Either we make them ourselves, or we know where they will come from. At the moment, probably just hwInitData.HwStartIo = wzvol_HwStartIo;. That function (and decedents) and currently "like" 99% done without ZFS and ZFS mutex. We should consider that path "special" and with care can most likely avoid this issue. At the very least, not use dprintf() for debugging. And the same for any additional DPCs we may use in future.

Side-note. There is also that the mini STOR driver we use for zvols wants its own list of Driver->MajorFunction[x] and ZFS also has own MajorFunction[x], so we have to decide which one in the ZFS dispatcher() and relay on to STOR_MajorFunction[x] if it isn't meant for ZFS. Surely Windows has someway to handle this better that I haven't come across yet? Should clean this up one day, if someone points out how it would normally be done.