lundman commented 4 years ago

As reported by Datacore, on version 0.21.

I created a pool with dedup=on and carved out a zvol using these commands.

zpool create -O dedup=on -O casesensitivity=insensitive -O compression=lz4 -O atime=off -o ashift=12 zdedup PHYSICALDRIVEx
zfs create -s -V 3GB -b 512 zdedup/Z._629e5981-fa3f-11e9-84be-00155d050e1d_-00000060

I run I/O to the presented windows disk and after a short while, after I/Os have been stopped even, I get this BSOD. I took a quick look at it and it seems that the slab is being reused while it is disposed of.

code diff

``` ******************************************************************************* * * * Bugcheck Analysis * * * ******************************************************************************* PFN_LIST_CORRUPT (4e) Typically caused by drivers passing bad memory descriptor lists (ie: calling MmUnlockPages twice with the same list, etc). If a kernel debugger is available get the stack trace. Arguments: Arg1: 000000000000009a, Arg2: 0000000000084fab Arg3: 0000000000000006 Arg4: 0000000000000002 Debugging Details: ------------------ DUMP_CLASS: 1 DUMP_QUALIFIER: 401 BUILD_VERSION_STRING: 14393.2879.amd64fre.rs1_release_inmarket.190313-1855 SYSTEM_MANUFACTURER: Microsoft Corporation VIRTUAL_MACHINE: HyperV SYSTEM_PRODUCT_NAME: Virtual Machine SYSTEM_SKU: None SYSTEM_VERSION: Hyper-V UEFI Release v4.0 BIOS_VENDOR: Microsoft Corporation BIOS_VERSION: Hyper-V UEFI Release v4.0 BIOS_DATE: 01/30/2019 BASEBOARD_MANUFACTURER: Microsoft Corporation BASEBOARD_PRODUCT: Virtual Machine BASEBOARD_VERSION: Hyper-V UEFI Release v4.0 DUMP_TYPE: 1 BUGCHECK_P1: 9a BUGCHECK_P2: 84fab BUGCHECK_P3: 6 BUGCHECK_P4: 2 BUGCHECK_STR: 0x4E_9a CPU_COUNT: 6 CPU_MHZ: da4 CPU_VENDOR: GenuineIntel CPU_FAMILY: 6 CPU_MODEL: 3f CPU_STEPPING: 2 CPU_MICROCODE: 6,3f,2,0 (F,M,S,R) SIG: FFFFFFFF'00000000 (cache) FFFFFFFF'00000000 (init) DEFAULT_BUCKET_ID: CODE_CORRUPTION PROCESS_NAME: System CURRENT_IRQL: 2 ANALYSIS_SESSION_HOST: THIERRYW10 ANALYSIS_SESSION_TIME: 01-17-2020 16:30:11.0543 ANALYSIS_VERSION: 10.0.10586.567 amd64fre LAST_CONTROL_TRANSFER: from fffff803db09f80b to fffff803db161950 STACK_TEXT: ffffc600`4bee4368 fffff803`db09f80b : 00000000`0000004e 00000000`0000009a 00000000`00084fab 00000000`00000006 : nt!KeBugCheckEx ffffc600`4bee4370 fffff803`db09ef2c : ffffc600`4bee4501 00000000`00000100 00000000`00000100 00000000`00100000 : nt!MiInsertNonPagedPoolOnSlist+0x51b ffffc600`4bee4460 fffff803`db254503 : ffff9306`3ff00000 ffffc600`4bee4570 00000000`00000000 fffff80b`00000000 : nt!ExFreeLargePool+0x52c ffffc600`4bee4530 fffff80b`a26ecabe : ffff9306`3ff00000 00000000`000000de ffff9305`fb61e580 00000000`00000001 : nt!ExFreePoolWithTag+0x503 ffffc600`4bee4610 fffff80b`a26fccfc : ffff9306`3ff00000 00000000`00100000 00000001`00000001 ffff9305`fca6f770 : ZFSin!osif_free+0x1e [C:\src\ZFSin\ZFSin\spl\module\spl\spl-seg_kmem.c @ 178] ffffc600`4bee4640 fffff80b`a26fc345 : fffff80b`a2bd7b70 ffff9306`3ff00000 fffff80b`00100000 fffff80b`a26f5587 : ZFSin!xnu_free_throttled+0xcc [C:\src\ZFSin\ZFSin\spl\module\spl\spl-vmem.c @ 2660] ffffc600`4bee4680 fffff80b`a26f70de : fffff80b`a2bd7b70 ffff9306`3ff00000 ffff9305`00100000 00000000`00000001 : ZFSin!vmem_xfree+0x325 [C:\src\ZFSin\ZFSin\spl\module\spl\spl-vmem.c @ 1693] ffffc600`4bee46f0 fffff80b`a26f555a : fffff80b`a2bd7b70 ffff9306`3ff80000 00000001`00020000 ffff9305`fca6f770 : ZFSin!vmem_free+0x6e [C:\src\ZFSin\ZFSin\spl\module\spl\spl-vmem.c @ 1772] ffffc600`4bee4720 fffff80b`a26fc345 : fffff80b`a2bd1120 ffff9306`3ff80000 00000001`00020000 ffff9306`33e36300 : ZFSin!vmem_bucket_free+0x3a [C:\src\ZFSin\ZFSin\spl\module\spl\spl-vmem.c @ 3020] ffffc600`4bee4760 fffff80b`a26f70de : fffff80b`a2bdf4f0 ffff9306`3ff80000 ffff9306`00020000 ffff9306`33e2e700 : ZFSin!vmem_xfree+0x325 [C:\src\ZFSin\ZFSin\spl\module\spl\spl-vmem.c @ 1693] ffffc600`4bee47d0 fffff80b`a26fc345 : fffff80b`a2bdf4f0 ffff9306`3ff80000 00000001`00020000 ffff9306`33e36280 : ZFSin!vmem_free+0x6e [C:\src\ZFSin\ZFSin\spl\module\spl\spl-vmem.c @ 1772] ffffc600`4bee4800 fffff80b`a26f70de : fffff80b`a2c2f000 ffff9306`3ff80000 ffff9306`00020000 ffff9306`33e2e680 : ZFSin!vmem_xfree+0x325 [C:\src\ZFSin\ZFSin\spl\module\spl\spl-vmem.c @ 1693] ffffc600`4bee4870 fffff80b`a26fc345 : fffff80b`a2c2f000 ffff9306`3ff80000 00000001`00020000 ffff9305`fca6f770 : ZFSin!vmem_free+0x6e [C:\src\ZFSin\ZFSin\spl\module\spl\spl-vmem.c @ 1772] ffffc600`4bee48a0 fffff80b`a26f70de : fffff80b`a2c31000 ffff9306`3ff80000 ffff9305`00020000 fffff803`db07614c : ZFSin!vmem_xfree+0x325 [C:\src\ZFSin\ZFSin\spl\module\spl\spl-vmem.c @ 1693] ffffc600`4bee4910 fffff80b`a26e0e99 : fffff80b`a2c31000 ffff9306`3ff80000 fffff80b`00020000 00000000`00000080 : ZFSin!vmem_free+0x6e [C:\src\ZFSin\ZFSin\spl\module\spl\spl-vmem.c @ 1772] ffffc600`4bee4940 fffff80b`a26e13b0 : ffff9305`fcb2d040 ffff9306`3cdfb498 ffff9305`fc5fb040 fffff80b`a26ebcc6 : ZFSin!kmem_slab_destroy+0x179 [C:\src\ZFSin\ZFSin\spl\module\spl\spl-kmem.c @ 1315] ffffc600`4bee49a0 fffff80b`a26dda1e : ffff9305`fcb2d040 ffff9306`3ff80000 ffff9305`fc5fb040 fffff80b`a26dbf05 : ZFSin!kmem_slab_free+0x510 [C:\src\ZFSin\ZFSin\spl\module\spl\spl-kmem.c @ 1544] ffffc600`4bee4a30 fffff80b`a26dc11b : ffff9305`fcb2d040 ffff9306`42d68970 00000000`00000001 fffff80b`a26f0f42 : ZFSin!kmem_magazine_destroy+0x1ee [C:\src\ZFSin\ZFSin\spl\module\spl\spl-kmem.c @ 1756] ffffc600`4bee4aa0 fffff80b`a26db308 : ffff9305`fcb2d040 ffff9305`fc5fb040 00000000`00008598 fffff80b`a26ebcc6 : ZFSin!kmem_depot_ws_reap+0x12b [C:\src\ZFSin\ZFSin\spl\module\spl\spl-kmem.c @ 1863] ffffc600`4bee4b00 fffff80b`a26f144f : ffff9305`fcb2d040 ffff9305`00000002 ffff9305`fca6f830 ffffc600`4bee4bb8 : ZFSin!kmem_cache_reap+0x148 [C:\src\ZFSin\ZFSin\spl\module\spl\spl-kmem.c @ 2846] ffffc600`4bee4b50 fffff803`db02a005 : ffff9305`fca6f770 fffff80b`a26f0fc0 ffff9305`fc394dc0 ffff9305`fc394dc0 : ZFSin!taskq_thread+0x48f [C:\src\ZFSin\ZFSin\spl\module\spl\spl-taskq.c @ 1610] ffffc600`4bee4c10 fffff803`db168c36 : ffffc600`4b3c6180 ffff9305`fc5fb040 fffff803`db029fc4 ffff9305`fc47a030 : nt!PspSystemThreadStartup+0x41 ffffc600`4bee4c60 00000000`00000000 : ffffc600`4bee5000 ffffc600`4bedf000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x16 STACK_COMMAND: kb MODULE_NAME: memory_corruption IMAGE_NAME: memory_corruption FOLLOWUP_NAME: memory_corruption DEBUG_FLR_IMAGE_TIMESTAMP: 0 MEMORY_CORRUPTOR: LARGE FAILURE_BUCKET_ID: MEMORY_CORRUPTION_LARGE BUCKET_ID: MEMORY_CORRUPTION_LARGE PRIMARY_PROBLEM_CLASS: MEMORY_CORRUPTION_LARGE TARGET_TIME: 2020-01-17T19:32:24.000Z OSBUILD: 14393 OSSERVICEPACK: 0 SERVICEPACK_NUMBER: 0 OS_REVISION: 0 SUITE_MASK: 272 PRODUCT_TYPE: 3 OSPLATFORM_TYPE: x64 OSNAME: Windows 10 OSEDITION: Windows 10 Server TerminalServer SingleUserTS OS_LOCALE: USER_LCID: 0 OSBUILD_TIMESTAMP: 2019-03-14 01:46:51 BUILDDATESTAMP_STR: 190313-1855 BUILDLAB_STR: rs1_release_inmarket BUILDOSVER_STR: 10.0.14393.2879.amd64fre.rs1_release_inmarket.190313-1855 ANALYSIS_SESSION_ELAPSED_TIME: 262a ANALYSIS_SOURCE: KM FAILURE_ID_HASH_STRING: km:memory_corruption_large FAILURE_ID_HASH: {e29154ac-69a4-0eb8-172a-a860f73c0a3c} Followup: memory_corruption --------- ```

lundman commented 4 years ago

I assume you can still trigger it even if compression=off ?

lundman commented 4 years ago

Hmm actually, I should also ask about dedup. I've never tried it on Windows so far, so it's a bit unknown.

lundman commented 4 years ago

Running tests, before I can experience a crash, I often die in

    hal!HalpTimerClockInterrupt+0xf7    C/C++/ASM
    nt!KiCallInterruptServiceRoutine+0xa5   C/C++/ASM
    nt!KiInterruptSubDispatchNoLockNoEtw+0xfa   C/C++/ASM
    nt!KiInterruptDispatchNoLockNoEtw+0x37  C/C++/ASM
    nt!KiAcquireKobjectLockSafe+0x31    C/C++/ASM
    nt!KeSetEvent+0x65  C/C++/ASM
>   ZFSin!spl_mutex_exit+0x66 [C:\src\ZFSin\ZFSin\spl\module\spl\spl-mutex.c @ 141] C/C++/ASM
    ZFSin!zio_wait_for_children+0x19a [C:\src\ZFSin\ZFSin\zfs\module\zfs\zio.c @ 682]   C/C++/ASM
    ZFSin!zio_done+0x3e [C:\src\ZFSin\ZFSin\zfs\module\zfs\zio.c @ 4253]    C/C++/ASM

and I'm a bit surprised that a call to KeSetEvent() can hang forever, I would not have expected it to have a wait in it at all.

lundman commented 4 years ago

OK, not perhaps so surprising:

1: kd> dt mp
Local var @ 0xfffff98102c14430 Type kmutex*
0xffffb707`0ac10408 
   +0x000 m_lock           : mutex_t
   +0x018 m_owner          : 0xdeadbeef`deadbeef Void
   +0x020 initialised      : 0xdeadbeef

That would be memory freed, then attempted to unlocked.

And walking the stack up, we get all the way to zio_ddt_write(cio); with a freed zio.

lundman commented 4 years ago

Updating all dedup related commits from ZOL. c14a719 - 6304af8 The memory corruption is no longer the first problem I run into, so perhaps that is "better".

EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%p referenced memory at 0x%p. The memory could not be %s.

EXCEPTION_RECORD:  fffff981027bb608 -- (.exr 0xfffff981027bb608)
ExceptionAddress: fffff8004e9d56be (storport!RaidLogMiniportCompletion+0x000000000000002e)
   ExceptionCode: c0000005 (Access violation)

NULL ptr reference;
: storport!RaidLogMiniportCompletion+0x2e 
: storport!RaidAdapterRequestComplete+0x199 : storport!StorPortNotification+0x103 
: ZFSin!wzvol_WkRtn+0x21a [C:\src\ZFSin\ZFSin\zfs\module\zfs\zfs_windows_zvol_scsi.c @ 819] 
: ZFSin!wzvol_GeneralWkRtn+0x6d [C:\src\ZFSin\ZFSin\zfs\module\zfs\zfs_windows_zvol_scsi.c @ 848] 
: nt!IopProcessWorkItem+0xff
: nt!ExpWorkerThread+0x105

   817:     StorPortNotification(RequestComplete, pHBAExt, pSrb);

Since pHBAExt is our struct, it presumably refers to something in pSrb.

0: kd> dt pSrb
Local var @ 0xfffff981027bba68 Type _SCSI_REQUEST_BLOCK*
0xffffb706`cd7c6ec0 
   +0x000 Length           : 0x58
   +0x002 Function         : 0 ''
   +0x003 SrbStatus        : 0x1 ''
   +0x004 ScsiStatus       : 0 ''
   +0x005 PathId           : 0 ''
   +0x006 TargetId         : 0 ''
   +0x007 Lun              : 0 ''
   +0x008 QueueTag         : 0xe2 ''
   +0x009 QueueAction      : 0x20 ' '
   +0x00a CdbLength        : 0xa ''
   +0x00b SenseInfoBufferLength : 0x12 ''
   +0x00c SrbFlags         : 0x40200382
   +0x010 DataTransferLength : 0x11000
   +0x014 TimeOutValue     : 0x41
   +0x018 DataBuffer       : 0x00000000`000cc000 Void
   +0x020 SenseInfoBuffer  : 0xffffb706`c5178298 Void
   +0x028 NextSrb          : (null) 
   +0x030 OriginalRequest  : 0xffffb706`c5be1260 Void
   +0x038 SrbExtension     : (null) 
   +0x040 InternalStatus   : 0x1c3f0
   +0x040 QueueSortKey     : 0x1c3f0
   +0x040 LinkTimeoutValue : 0x1c3f0
   +0x044 Reserved         : 0
   +0x048 Cdb              : [16]  "*"

datacore-tlaurent commented 4 years ago

OK, not perhaps so surprising:
1: kd> dt mp
Local var @ 0xfffff98102c14430 Type kmutex*
0xffffb707`0ac10408 
   +0x000 m_lock           : mutex_t
   +0x018 m_owner          : 0xdeadbeef`deadbeef Void
   +0x020 initialised      : 0xdeadbeef
That would be memory freed, then attempted to unlocked.

And walking the stack up, we get all the way to zio_ddt_write(cio); with a freed zio.

I have another instance of a BSOD where the same zio is currently being processed by 2 different taskq_thread instances and it crashes when trying to set an event on a mutex_exit. I am not familiar with the requirements for the taskq_thread logic but on the surface it seems flawed for a specific zio pointer to be handled by 2 different threads. I have not tried with all the updates you made recently but this seems different. I'll create another thread if you confirm my suspicion about the flawed double-ownership for the zio.

lundman commented 4 years ago

Two taskq's should not be processing the same zio, I don't think. But the recent commits do fix issues in precisely that area, like:

cd0315a : Reduce taskq and context-switch cost of zio pipe

So it would be interesting if you guys could rebase and test again.

Also, the code that commit 146839320069426ba6dbd902fb904f68ed8c029f fixes is just embarrassing - any IO could randomly be marked as failure :) So definitely worth checking how things behave after that commit.

datacore-tlaurent commented 4 years ago

Regarding this comment related to a SetEvent() that blocks forever: https://github.com/openzfsonwindows/ZFSin/issues/201#issuecomment-576089430

I got the same thing today (after rebasing at 1/22 level) and my debugger shows that the mutex pointer passed in refers to memory that is all initialized with 0xdeadbeef. The zio itself is also all filled in with 0xdeadbeef so it looks like the structure has been freed by another thread (if freeing writes 0xdeadbeef, I'm using the debug build). This is surely why the entire host is hosed up (like in my case). I got it while importing the pool after restart, not doing any external I/O. I'll take a deeper look at the live dump I got. Here's the stack.

Child-SP RetAddr Call Site

00 ffffab0197de8d88 fffff803a6aadf7c nt!DbgBreakPointWithStatus 01 ffffab0197de8d90 fffff803a6aab258 nt!KeAccumulateTicks+0x39c 02 ffffab0197de8df0 fffff803a72224e5 nt!KeClockInterruptNotify+0xb8 03 ffffab0197de8f40 fffff803a6a60566 hal!HalpTimerClockIpiRoutine+0x15 04 ffffab0197de8f70 fffff803a6b6214a nt!KiCallInterruptServiceRoutine+0x106 05 ffffab0197de8fb0 fffff803a6b62637 nt!KiInterruptSubDispatchNoLockNoEtw+0xea 06 ffffab019b7916e0 fffff803a6b0b2b3 nt!KiInterruptDispatchNoLockNoEtw+0x37 07 ffffab019b791870 fffff803a6a750a1 nt!KiAcquireKobjectLockSafe+0x33 08 ffffab019b7918a0 fffff809bb74bcf6 nt!KeSetEvent+0x51 09 ffffab019b791930 fffff809bb9d379a ZFSin!spl_mutex_exit+0x66 [F:\repos\zfsin\ZFSin\spl\module\spl\spl-mutex.c @ 141] 0a ffffab019b791960 fffff809bb9c6602 ZFSin!zio_wait_for_children+0x19a [F:\repos\zfsin\ZFSin\zfs\module\zfs\zio.c @ 682] 0b ffffab019b7919d0 fffff809bb9c8962 ZFSin!zio_vdev_io_done+0x62 [F:\repos\zfsin\ZFSin\zfs\module\zfs\zio.c @ 3573] 0c ffffab019b791a40 fffff809bb9ce4c3 ZFSin!__zio_execute+0x342 [F:\repos\zfsin\ZFSin\zfs\module\zfs\zio.c @ 1997] 0d ffffab019b791ad0 fffff809bb91cebe ZFSin!zio_execute+0x13 [F:\repos\zfsin\ZFSin\zfs\module\zfs\zio.c @ 1927] 0e ffffab019b791b00 fffff809bb9c5d1a ZFSin!vdev_mirror_io_start+0x24e [F:\repos\zfsin\ZFSin\zfs\module\zfs\vdev_mirror.c @ 601] 0f ffffab019b791bb0 fffff809bb9c8962 ZFSin!zio_vdev_io_start+0x16a [F:\repos\zfsin\ZFSin\zfs\module\zfs\zio.c @ 3451] 10 ffffab019b791cd0 fffff809bb9d0aa2 ZFSin!__zio_execute+0x342 [F:\repos\zfsin\ZFSin\zfs\module\zfs\zio.c @ 1997] 11 ffffab019b791d60 fffff809bb8dbd6a ZFSin!zio_nowait+0x132 [F:\repos\zfsin\ZFSin\zfs\module\zfs\zio.c @ 2060] 12 ffffab019b791dd0 fffff809bb856270 ZFSin!spa_load_verify_cb+0x39a [F:\repos\zfsin\ZFSin\zfs\module\zfs\spa.c @ 2164] 13 ffffab019b791e60 fffff809bb856802 ZFSin!traverse_visitbp+0x600 [F:\repos\zfsin\ZFSin\zfs\module\zfs\dmu_traverse.c @ 285] 14 ffffab019b792010 fffff809bb856802 ZFSin!traverse_visitbp+0xb92 [F:\repos\zfsin\ZFSin\zfs\module\zfs\dmu_traverse.c @ 321] 15 ffffab019b7921c0 fffff809bb8546ec ZFSin!traverse_visitbp+0xb92 [F:\repos\zfsin\ZFSin\zfs\module\zfs\dmu_traverse.c @ 321] 16 ffffab019b792370 fffff809bb856d2f ZFSin!traverse_dnode+0x1bc [F:\repos\zfsin\ZFSin\zfs\module\zfs\dmu_traverse.c @ 494] 17 ffffab019b792400 fffff809bb856802 ZFSin!traverse_visitbp+0x10bf [F:\repos\zfsin\ZFSin\zfs\module\zfs\dmu_traverse.c @ 356] 18 ffffab019b7925b0 fffff809bb856802 ZFSin!traverse_visitbp+0xb92 [F:\repos\zfsin\ZFSin\zfs\module\zfs\dmu_traverse.c @ 321] 19 ffffab019b792760 fffff809bb856802 ZFSin!traverse_visitbp+0xb92 [F:\repos\zfsin\ZFSin\zfs\module\zfs\dmu_traverse.c @ 321] 1a ffffab019b792910 fffff809bb856802 ZFSin!traverse_visitbp+0xb92 [F:\repos\zfsin\ZFSin\zfs\module\zfs\dmu_traverse.c @ 321] 1b ffffab019b792ac0 fffff809bb856802 ZFSin!traverse_visitbp+0xb92 [F:\repos\zfsin\ZFSin\zfs\module\zfs\dmu_traverse.c @ 321] 1c ffffab019b792c70 fffff809bb8546ec ZFSin!traverse_visitbp+0xb92 [F:\repos\zfsin\ZFSin\zfs\module\zfs\dmu_traverse.c @ 321] 1d ffffab019b792e20 fffff809bb85709c ZFSin!traverse_dnode+0x1bc [F:\repos\zfsin\ZFSin\zfs\module\zfs\dmu_traverse.c @ 494] 1e ffffab019b792eb0 fffff809bb854f96 ZFSin!traverse_visitbp+0x142c [F:\repos\zfsin\ZFSin\zfs\module\zfs\dmu_traverse.c @ 397] 1f ffffab019b793060 fffff809bb85451d ZFSin!traverse_impl+0x736 [F:\repos\zfsin\ZFSin\zfs\module\zfs\dmu_traverse.c @ 648] 20 ffffab019b793140 fffff809bb854413 ZFSin!traverse_dataset_resume+0x8d [F:\repos\zfsin\ZFSin\zfs\module\zfs\dmu_traverse.c @ 678] 21 ffffab019b7931a0 fffff809bb855255 ZFSin!traverse_dataset+0x43 [F:\repos\zfsin\ZFSin\zfs\module\zfs\dmu_traverse.c @ 685] 22 ffffab019b7931e0 fffff809bb8db6ec ZFSin!traverse_pool+0x1f5 [F:\repos\zfsin\ZFSin\zfs\module\zfs\dmu_traverse.c @ 740] 23 ffffab019b7932c0 fffff809bb8d9c2a ZFSin!spa_load_verify+0x17c [F:\repos\zfsin\ZFSin\zfs\module\zfs\spa.c @ 2210] 24 ffffab019b793360 fffff809bb8da544 ZFSin!spa_ld_verify_pool_data+0x3a [F:\repos\zfsin\ZFSin\zfs\module\zfs\spa.c @ 3720] 25 ffffab019b7933a0 fffff809bb8d9cc3 ZFSin!spa_load_impl+0x384 [F:\repos\zfsin\ZFSin\zfs\module\zfs\spa.c @ 4190] 26 ffffab019b793420 fffff809bb8d9e55 ZFSin!spa_load+0x53 [F:\repos\zfsin\ZFSin\zfs\module\zfs\spa.c @ 2398] 27 ffffab019b793480 fffff809bb8d6256 ZFSin!spa_load_best+0xc5 [F:\repos\zfsin\ZFSin\zfs\module\zfs\spa.c @ 4337] 28 ffffab019b793510 fffff809bb9628d9 ZFSin!spa_import+0x256 [F:\repos\zfsin\ZFSin\zfs\module\zfs\spa.c @ 5663] 29 ffffab019b7935c0 fffff809bb96b826 ZFSin!zfs_ioc_pool_import+0x109 [F:\repos\zfsin\ZFSin\zfs\module\zfs\zfs_ioctl.c @ 1687] 2a ffffab019b793610 fffff809bb98c0c9 ZFSin!zfsdev_ioctl+0x716 [F:\repos\zfsin\ZFSin\zfs\module\zfs\zfs_ioctl.c @ 7658] 2b ffffab019b793700 fffff809bb9898c7 ZFSin!ioctlDispatcher+0x1f9 [F:\repos\zfsin\ZFSin\zfs\module\zfs\zfs_vnops_windows.c @ 3910] 2c ffffab019b793790 fffff803a6e7d080 ZFSin!dispatcher+0x127 [F:\repos\zfsin\ZFSin\zfs\module\zfs\zfs_vnops_windows.c @ 4717] 2d ffffab019b793820 fffff803a6e7c41c nt!IopSynchronousServiceTail+0x1a0 2e ffffab019b7938e0 fffff803a6e7b676 nt!IopXxxControlFile+0xd9c 2f ffffab019b793a20 fffff803a6b70503 nt!NtDeviceIoControlFile+0x56 30 ffffab019b793a90 00007ffe93fb5be4 nt!KiSystemServiceCopyEnd+0x13 31 000000488c38b238 00007ffe910d4943 ntdll!NtDeviceIoControlFile+0x14 32 000000488c38b240 00007ffe91a16840 KERNELBASE!DeviceIoControl+0x73 33 000000488c38b2b0 00007ff63e71ccc6 KERNEL32!DeviceIoControlImplementation+0x80 34 000000488c38b300 0000000000000000 zpool+0x26ccc6

lundman commented 4 years ago

Related to fffff8004e9d56be (storport!RaidLogMiniportCompletion+0x000000000000002e)

I was single-stepping through a zvolWrite request when it stopped in

zio.c
2545:   lr = rangelock_enter(&zv->zv_rangelock, position, count,
2546:       RL_WRITER);

suddenly the VM continued to run as my thread was swapped out waiting for it.

Eventually, it died calling StorPortNotification(RequestComplete, pHBAExt, pSrb);

Dumping all the stacks I expected then to find a write thread stuck waiting for rangelock:

                                        ZFSin!spl_cv_wait+0xf3
                                        ZFSin!rangelock_enter_writer+0x23b
                                        ZFSin!rangelock_enter+0x170
                                        ZFSin!zvol_write_win+0x10e
                                        ZFSin!wzvol_WkRtn+0x32b
                                        ZFSin!wzvol_GeneralWkRtn+0x6d

I wonder then if there is a timeout, and retry involved, which won't go well if we are still waiting for a lock. Interestingly, we also have

                                        nt!KeWaitForSingleObject+0x255
                                        ZFSin!spl_mutex_enter+0x12b
                                        ZFSin!dbuf_create+0x588
                                        ZFSin!__dbuf_hold_impl+0x598
                                        ZFSin!dbuf_hold_impl+0x87
                                        ZFSin!dbuf_hold_level+0x50
                                        ZFSin!dbuf_hold+0x29
                                        ZFSin!dmu_buf_hold_array_by_dnode+0x27c
                                        ZFSin!dmu_write_win_dnode+0x7f
                                        ZFSin!dmu_write_win_dbuf+0x84
                                        ZFSin!zvol_write_win+0x25a
                                        ZFSin!wzvol_WkRtn+0x32b

                                        nt!KeSetEvent+0xb3
                                        ZFSin!spl_mutex_exit+0x120
                                        ZFSin!kmem_cache_alloc+0xab
                                        ZFSin!arc_hdr_alloc+0xa5
                                        ZFSin!arc_alloc_buf+0x5d
                                        ZFSin!dbuf_noread+0x24b
                                        ZFSin!dmu_buf_will_fill+0x1d7
                                        ZFSin!dmu_write_win_dnode+0x243
                                        ZFSin!dmu_write_win_dbuf+0x84
                                        ZFSin!zvol_write_win+0x25a
                                        ZFSin!wzvol_WkRtn+0x32b
                                        ZFSin!wzvol_GeneralWkRtn+0x6d

and about 8 other threads looking the same.

Do we have a rangelock leak? Being locked forever is certainly not expected. Or is it a starvation?

lundman commented 4 years ago

Hmm that appears to be a memory allocation blocking forever, while holding the rangelock. What is odd is that it is calling KeSetEvent() and getting stuck.

lundman commented 4 years ago

Ok, I can confirm that I have 2 zvol_write_win() threads with identical offset and size, so I am guessing that something times-out somewhere, and re-issued the request. The crash comes from that one of the requests is eventually completed, followed by the second, trying to call StorPortNotification(RequestComplete again, and we crash. We will eventually have to do something to protect against this. Can we send Notification that we received the request and are working on it?

Meanwhile, my biggest issue is running out of ram, as the VM is cramped, and that appears to be due to ints defined as Atomic (#defined away to nothing under windows) and thread concurrency leaving them as non-zero forever. This I will correct with atomics next.

lundman commented 4 years ago

Ok, so on KeSetEvent being 0xdeadbeefdeadbeef.

This appears to be the case where:

* thread A *

void spl_mutex_exit(kmutex_t *mp)
{

    VERIFY3P(mp->m_owner, != , 0xdeadbeefdeadbeef);  // Passes here, NOT freed
    mp->m_owner = NULL;  // Mutex available for use
    VERIFY3U(KeGetCurrentIrql(), <= , DISPATCH_LEVEL); // extra debug check, OK
    KeSetEvent((PRKEVENT)&mp->m_lock, SEMAPHORE_INCREMENT, FALSE);
}

So when we enter KeSetEvent() the zio() is in fact allocated and looks correct. The surprise here is that KeSetEvent() has a bit of blocking logic inside it somewhere, and can easily be preempted.

However, another thread now runs:

* thread B *

zio_destroy(zio_t *zio)
{
    mutex_destroy(&zio->io_lock);
    cv_destroy(&zio->io_cv);
    kmem_cache_free(zio_cache, zio);
}

Which destroys the mutex, and frees the memory, and scribbles 0xdeadbeefdeadbeef all over mutex, and zio. This happens all before thread A has completed the call to KeSetEvent() - that KEVENT is now destroyed, and fails to ever return.

Possibly spl_mutex_destroy() will need a little extra logic to ensure all calls to mutex_exit() have completed.

As a temporary hack, it appears to help adding:

void spl_mutex_destroy(kmutex_t *mp)
{
    if (!mp) return;

    if (mp->initialised != MUTEX_INITIALISED) 
        panic("%s: mutex not initialised\n", __func__);

+   mutex_enter(mp);
+   mutex_exit(mp);

    mp->initialised = MUTEX_DESTROYED;

    if (mp->m_owner != 0) 
        panic("SPL: releasing held mutex");

In this case, spl_mutex_destroy() will pass the initialised test, it is after all "correct". And it will pass the owner == NULL test, as that was set before calling KeSetEvent().

Here I added grabbing the mutex, and releasing it, to make sure any other mutex_exit() has left. This is a temporary hack, possibly a better flow control is needed. KeEnterCriticalSection() around KeSetEvent() is that legal?

tamlin-mike commented 4 years ago

Hmm that appears to be a memory allocation blocking forever, while holding the rangelock. What is odd is that it is calling KeSetEvent() and getting stuck.

That seems like a recipe for disaster. Imagine an allocation attempt resulting in swapping (even if you request nonpaged, Mm might decide to evict some paged mem to make room for the nonpaged request). To make it more fun, imagine a pagefile in on that very FS.

Perhaps reordering the code to avoid such a situation could be beneficial.

Possibly spl_mutex_destroy() will need a little extra logic to ensure all calls to mutex_exit() have completed. As a temporary hack ...

mutex_enter(mp);

mutex_exit(mp);

That hack will still most likely fail under extreme load. Remember the mantra of NT kmode "Always preemptible, always interruptible".

I think to solve this a lock on the entity holding the mutex is required.

KeEnterCriticalSection() around KeSetEvent() is that legal?

KeEnterCriticalRegion + KeSetEvent seems to be valid, but I fail to see the purpose. The only thing that would guard against would be two threads calling KeSetEvent at the same time.

What I believe is required is undertanding why it enters zio_destroy in this state. Is there a missing refcount on some object (the one calling SetEvent would have increased a refcount somewhere I'd hope). Could this be related to the use of plain integers (and no explicit InterlockedInc/Dec I'd presume) instead of atomics?

As usual, just some random thoughts.

datacore-tlaurent commented 4 years ago

If we need to guard against a mutex within the zio being used while another thread wants to destroy the zio then in the kernel we have the Io Remove Locks that are good at that. We should protect the entire zio then. Since the mutex here is owned by a zio, that Io Remove Lock could be set in the zio structure. Each time processing is starting for a zio the remove lock would be taken (IoAcquireRemoveLock) and if that succeeds we are good to go. When finishing using the zio we would call IoReleaseRemoveLock.
The zio_destroy routine would call IoAcquireRemoveLock first, then make sure the zio is no longer in any queue where it could be fetched, then call IoReleaseRemoveLockAndWait. After that last call we would be sure that no other thread has a reference on the zio and that the zio can no longer be retrieved from any queue so it is then safe to free its memory. That's one way to do it.

lundman commented 4 years ago

I was unaware of IoAcquireRemoveLock and it looks like a good candidate, but we do need to fix the problem for all mutex, as zio is probably just the most visible. So it'd need to live in struct mutex.

The difference between Unix and Windows here appears to be that mutex_exit() is an (near) instant call, where as under Windows "it appears to me" to resume-other-thread-execution and gets preempted.

KeEnterCriticalRegion ... guard against would be two threads

Ah, I assumed it disabled preemption. But if it does not, then no, it won't do here.

It would be easy to use atomics to know all mutex_exit() has finished, but ugly to sleep on them waiting for 0.

why it enters zio_destroy in this state. Is there a missing refcount on some object

I've not looked, but the standard setup is:

thread A

cv_wait(zio->mutex);
mutex_destroy(zio->mutex);
zio_free(zio);

thread B

mutex_exit(zio->mutex);

Which is very common in ZFS. Thread A is waiting for the mutex to be released. Thread B releases it, but gets preempted inside KeSetEvent() as it appears to be directly involved in pushing A into motion. The code is correct in terms of how mutex and condvars work in Solaris. My implementation of Solaris mutex/condvars isn't quite right here.

lundman commented 4 years ago

Ah I see, so it appears that events in Windows belong to lists, perhaps accounting related, as well as a list of waiters. So the KeSetEvent() itself is simple:

https://github.com/ZoloZiak/WinNT4/blob/f5c14e6b42c8f45c20fe88d14c61f9d6e0386b8e/private/ntos/ke/eventobj.c#L406

But if we need to Wait we enter here:

https://github.com/ZoloZiak/WinNT4/blob/f5c14e6b42c8f45c20fe88d14c61f9d6e0386b8e/private/ntos/ke/waitsup.c#L325

So there are list operations, especially if there are more than one waiter. Which means one of our waiting threads can resume, free zio and mutex and blow away all the list data. Meanwhile, KiWaitTest() will loop to do more processing and get upset.

It would appear that passing TRUE to KeSetEvent() would perhaps work (we would not end up in `KiWaitTest() right?), but there is no reason to wait - unless we invent one.

tamlin-mike commented 4 years ago

Could the following perhaps be of help?

From https://docs.microsoft.com/en-us/windows-hardware/drivers/ifs/fsrtlenterfilesystem

Every file system driver entry point routine must call FsRtlEnterFileSystem immediately before acquiring a resource required in performing a file I/O request and call FsRtlExitFileSystem immediately afterward. This ensures that the routine cannot be suspended while running and thus block other file I/O requests.

Edit: On the subject of Remove Locks, it seems they are intended to protect the device object. While I haven't looked at the implementation, that could suggest they contain code less suitable ("heavier"/slower/larger) for general I/O handling.

lundman commented 4 years ago

We call FsRtlEnterFileSystem() waaay earlier in the Dispatcher. But that appears more attached to handling the IRPs and FILE_OBJECTs, as opposed to the guts of ZIO pipeline.

I can see two/three ways to fix this, but if there is a standard Windows way to address it, I think we should try that first.

tamlin-mike commented 4 years ago

Possibly related: Taking some time to look at the actual code, it seems worrying that arc_read places a zio_t in acb->acb_zio_head, followed by calling zio_wait with that very same rzio a few lines below.

zio_wait destroys the zio (which is something I found more than a little surprising, based on the function name).

From my understanding, now we have a pointer to a non-existing zio_t in a live arc_callback_t.

If my understanding is correct, this could definitely be a contributing factor to the LZ4 crash.

lundman commented 4 years ago

You create a zio, load it up, then you call either zio_wait() or zio_nowait() to make the IO happen. The zio is spent. However, the callbacks are done by zio.c before the zio is freed, and acb_zio_head is only used in arc_read() logic. That looks normal, there is a video presentation about the zio framework from the dev summit if you want more details than I can provide :)

datacore-tlaurent commented 4 years ago

This issue has drifted on different subjects from its original description and I am still having the original issue happening., which looks like a double free and still happening with level 0.22 so I'll close this one here and open a new one to track the PFN_LIST_CORRUPT (4e).

datacore-tlaurent commented 4 years ago

I can make it happen running I/Os to a zvol, the pool does not have dedup on, nor compression. just a regular zpool with 1 disk in it.

lundman commented 4 years ago

Yes, so, we fixed the zio/mutex_destroy problem, but we are talking about a different problem? Perhaps closing this noisy issue and starting with a new one is a good idea.

datacore-tlaurent commented 4 years ago

Yes, the PFN_LIST_CORRUPT is a different problem and I still get it. I want to repro it using straight from Master level so the dump file can refer to source files that can be referenced by everyone instead of using our DataCore fork level. I'll open the new issue when I'm ready with all the information gathered. Until now I tried to instrument the code to catch the culprit but no success yet. Driver Verifier does not detect anything wrong during the run. Stay tuned but you can close #201 now.

openzfsonwindows / ZFSin

PFN_LIST_CORRUPT (4e) #201

Child-SP RetAddr Call Site