openzfsonwindows / ZFSin

OpenZFS on Windows port
https://openzfsonwindows.org
1.21k stars 68 forks source link

BSOD: CACHE_MANAGER #187

Closed inoperable closed 4 years ago

inoperable commented 4 years ago

Hello,

I'm getting a CACHE_MANAGER BSOD when trying to perform any kind of write operation on a dataset using the latest "nightly" build.

When using earlier builds (also official releases) I got a lock after few minutes of operation (both write/read). Windows locks at some point and that's it, nothing useful in the "log" (actually I don't know in which one to look at - since there are gazillions of possible subscriptions from microsoft making even less sense - maybe someone can point me where some useful info can be dig out)

After a lock I need to reboot (surprise...) or (surprise!) signout (although signouts / userspace exit seems not always to drop out of the lock) - but now after trying the latest build I get CACHE_MANAGER, which, from what I gathered can relate to driver not being signed (totally makes sense to BSOD within an CACHE_MANAGER error, right?)

Locking is annoying but can be frowned upon and cost a relog, BSOD on the other hand far less manageable - Disabled integrity checks and enabled test signing with bcdedit /set, what else can I do?

lundman commented 4 years ago

Yes, don't use the latest release, go back to OpenZFSOnWindows-debug-20191016.exe

Been fighting CcMgr all week, I think I have better understanding of how it should work, but now it can just soft-hang in CcInitialize which is perplexing. I might have to go find someone who knows how this should work.

inoperable commented 4 years ago

Got it. Hope you figure it out without too much effort, I need an anger management session after few days in visual studio so you are my personal hero for doing all this within microsoft uhmm... development tooling.

inoperable commented 4 years ago

2 semi-related questions.

1) does the write-cache setting influence the caching issue in any way? 2) makes sense to keep the cache ticked off or is this not related/indifferent to zfs directly?

By "write-cache" I mean the settings you find in device manager: [ DiskDrive > Properties > Policies ]

caching2
lundman commented 4 years ago

Ok I've been fighting CcMgr this entire time, I think I have the normal filesystem use down ok without crash, I'm still a little uncertain about vflush() - ie, export and umount. It seems quite slow waiting for Windows to vacate all the caches. So could it wait forever trying to unmount?

http://www.lundman.net/OpenZFSOnWindows-debug-20191120.exe

inoperable commented 4 years ago

I'll try to gather some debug info from the VM and take a close look when the locking occurs. Meanwhile, maybe you could take a look on how WinBtrfs manages caching? Maybe that will give you some idea/inspiration if there is any parallels to look at.

inoperable commented 4 years ago
2: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

CACHE_MANAGER (34)
    See the comment for FAT_FILE_SYSTEM (0x23)
Arguments:
Arg1: 0000000000000273
Arg2: ffffffffc0000420
Arg3: 0000000000000000
Arg4: 0000000000000000

Debugging Details:
------------------

KEY_VALUES_STRING: 1

    Key  : Analysis.CPU.Sec
    Value: 1

    Key  : Analysis.DebugAnalysisProvider.CPP
    Value: Create: 8007007e on RECTIFIER

    Key  : Analysis.DebugData
    Value: CreateObject

    Key  : Analysis.DebugModel
    Value: CreateObject

    Key  : Analysis.Elapsed.Sec
    Value: 1

    Key  : Analysis.Memory.CommitPeak.Mb
    Value: 71

    Key  : Analysis.System
    Value: CreateObject

DUMP_FILE_ATTRIBUTES: 0x8
  Kernel Generated Triage Dump

BUGCHECK_CODE:  34

BUGCHECK_P1: 273

BUGCHECK_P2: ffffffffc0000420

BUGCHECK_P3: 0

BUGCHECK_P4: 0

EXCEPTION_RECORD:  ffffffffc0000420 -- (.exr 0xffffffffc0000420)
Cannot read Exception record @ ffffffffc0000420

BLACKBOXBSD: 1 (!blackboxbsd)

BLACKBOXNTFS: 1 (!blackboxntfs)

BLACKBOXWINLOGON: 1

CUSTOMER_CRASH_COUNT:  1

PROCESS_NAME:  Far.exe

STACK_TEXT:  
ffff800d`ea16f388 fffff805`0be3213c : 00000000`00000034 00000000`00000273 ffffffff`c0000420 00000000`00000000 : nt!KeBugCheckEx
ffff800d`ea16f390 fffff805`19ab8964 : ffffa486`cf45ba20 ffffa486`00000000 ffffa486`00100000 00000000`00000001 : nt!CcCopyReadEx+0x1c46dc
ffff800d`ea16f440 ffffa486`cf45ba20 : ffffa486`00000000 ffffa486`00100000 00000000`00000001 ffff8c00`f8c40040 : ZFSin+0x258964
ffff800d`ea16f448 ffffa486`00000000 : ffffa486`00100000 00000000`00000001 ffff8c00`f8c40040 ffffa486`cf45ba50 : 0xffffa486`cf45ba20
ffff800d`ea16f450 ffffa486`00100000 : 00000000`00000001 ffff8c00`f8c40040 ffffa486`cf45ba50 ffffa486`cf6c9080 : 0xffffa486`00000000
ffff800d`ea16f458 00000000`00000001 : ffff8c00`f8c40040 ffffa486`cf45ba50 ffffa486`cf6c9080 ffffa486`ce980730 : 0xffffa486`00100000
ffff800d`ea16f460 ffff8c00`f8c40040 : ffffa486`cf45ba50 ffffa486`cf6c9080 ffffa486`ce980730 00000000`00100000 : 0x1
ffff800d`ea16f468 ffffa486`cf45ba50 : ffffa486`cf6c9080 ffffa486`ce980730 00000000`00100000 ffffa486`d001f9c0 : 0xffff8c00`f8c40040
ffff800d`ea16f470 ffffa486`cf6c9080 : ffffa486`ce980730 00000000`00100000 ffffa486`d001f9c0 00000000`00000000 : 0xffffa486`cf45ba50
ffff800d`ea16f478 ffffa486`ce980730 : 00000000`00100000 ffffa486`d001f9c0 00000000`00000000 ffffa486`8d41db10 : 0xffffa486`cf6c9080
ffff800d`ea16f480 00000000`00100000 : ffffa486`d001f9c0 00000000`00000000 ffffa486`8d41db10 00000000`00000000 : 0xffffa486`ce980730
ffff800d`ea16f488 ffffa486`d001f9c0 : 00000000`00000000 ffffa486`8d41db10 00000000`00000000 fffff805`00000000 : 0x100000
ffff800d`ea16f490 00000000`00000000 : ffffa486`8d41db10 00000000`00000000 fffff805`00000000 00000001`00000000 : 0xffffa486`d001f9c0

SYMBOL_NAME:  ZFSin+258964

MODULE_NAME: ZFSin

IMAGE_NAME:  ZFSin.sys

STACK_COMMAND:  .thread ; .cxr ; kb

BUCKET_ID_FUNC_OFFSET:  258964

FAILURE_BUCKET_ID:  0x34_ZFSin!unknown_function

OSPLATFORM_TYPE:  x64

OSNAME:  Windows 10

FAILURE_ID_HASH:  {b7987bc4-7cf3-67a1-c583-5d0346d04b45}

Followup:     MachineOwner
---------
inoperable commented 4 years ago

Well, mounting/unmounting does work, I can dir/ls the dataset, but it crashed when trying to copy (read) a file FROM the dataset TO the SystemDrive (ie C:). It's not an FAT file system though - it's the default NTFS, so where does the FAT_FILE_SYSTEM comes from?

CACHE_MANAGER (34)
    See the comment for FAT_FILE_SYSTEM (0x23)

FAT_FILE_SYSTEM (0x23) happens when there are FS errors, gonna check the systemdrive from Winre first and make sure there are no errors. Or does the FAT_FILE_SYSTEM here relates to the source - the zfs dataset?

UPDATE: Made sure that the drive is error free (chkdsk c: /x /f /r), scrubbed the source dataset, neither detected any errors, after mounting the dataset, Windows crashed exactly the same way as before (again, does not happen when mounting/unmounting or running dir/ls, happens immediately when trying to run copy r:\testfile.zip c:\ from command-line cmd.exe

lundman commented 4 years ago

Ah so that code bit in fs_read() after all. Ok, I'll restore it.

lundman commented 4 years ago

Ok, I restored the CcMgr call in fs_read and fs_write - fastfat has them, so clearly they are meant to be there. Should help with the reading problem I hope. http://www.lundman.net/OpenZFSOnWindows-debug-20191121.exe

inoperable commented 4 years ago

The packaged version you posted in your reply above seems (previous one) to work BUT, for some reason something changed in between and Windows cant find/import the dataset (neither when booting from VM or directly into Windows, Linux however can import all of them).

On the dataset that could be imported the driver works without any ill effects, need to let it run for some time but I copied files back/forth, checked some .zip files and made tried some md5sum's, all is fine.

I tried to force-import with by using -d -> zpool import -f -d "\\.\PhysicalDrive3" stuff, but this won't work. I'm still not sure which form is the proper one: "//./PhysicalDisk" ? "\.\\PhysicalDrive" ? "/dev/PhysicalDisk" ?

zpool command in windows lists the attached device as "/dev/PhysicalDrive3, Windows documentation is more twisted then Escher's Art..and I'm totally confused about which "device names" to use in Windows btw., I don't even know which command dig those names in Windows. PoweShell get-disk | get-partition but the Objects returned by those functions lack any Path property to be used (or I don't know which property to use). So wmic convulated output is the only way to get the proper device paths?

Gonna test the current release you posted. I'm not sure if it's the dataset configuration that's causing trouble (since I can't import it from Windows no more), I tried to copy/move/write and no BSOD happened - gonna check the new one.

lundman commented 4 years ago

Yes, the Windows version has no "-d" on import for disks, as devices can not live anywhere but the special hidden area. "-d" can be used with files. zpool import should list anything it can find, and just use zpool import PHYSICALDRIVEx to attempt import. No longer expecting backslashes or /dev .

inoperable commented 4 years ago

In short: BSOD immediately as in previous releases, meaning: 1) can import all datasets, 2) when trying to read/copy it BSOD's with CACHE_MANAGER,

1: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

CACHE_MANAGER (34)
    See the comment for FAT_FILE_SYSTEM (0x23)
Arguments:
Arg1: 0000000000000273
Arg2: ffffffffc0000420
Arg3: 0000000000000000
Arg4: 0000000000000000

Debugging Details:
------------------

KEY_VALUES_STRING: 1

    Key  : Analysis.CPU.Sec
    Value: 1

    Key  : Analysis.DebugAnalysisProvider.CPP
    Value: Create: 8007007e on RECTIFIER

    Key  : Analysis.DebugData
    Value: CreateObject

    Key  : Analysis.DebugModel
    Value: CreateObject

    Key  : Analysis.Elapsed.Sec
    Value: 1

    Key  : Analysis.Memory.CommitPeak.Mb
    Value: 71

    Key  : Analysis.System
    Value: CreateObject

    Key  : Dump.Attributes.InsufficientDumpfileSize
    Value: 1

DUMP_FILE_ATTRIBUTES: 0xc
  Insufficient Dumpfile Size
  Kernel Generated Triage Dump

BUGCHECK_CODE:  34

BUGCHECK_P1: 273

BUGCHECK_P2: ffffffffc0000420

BUGCHECK_P3: 0

BUGCHECK_P4: 0

EXCEPTION_RECORD:  ffffffffc0000420 -- (.exr 0xffffffffc0000420)
Cannot read Exception record @ ffffffffc0000420

BLACKBOXBSD: 1 (!blackboxbsd)

BLACKBOXNTFS: 1 (!blackboxntfs)

BLACKBOXPNP: 1 (!blackboxpnp)

BLACKBOXWINLOGON: 1

CUSTOMER_CRASH_COUNT:  1

PROCESS_NAME:  svchost.exe

STACK_TEXT:  
ffffd401`73fa7358 fffff801`3563213c : 00000000`00000034 00000000`00000273 ffffffff`c0000420 00000000`00000000 : nt!KeBugCheckEx
ffffd401`73fa7360 fffff801`42628a33 : ffffb004`9da39bd0 fffff801`00000000 ffffb004`0000000c fffff801`423d1019 : nt!CcCopyReadEx+0x1c46dc
ffffd401`73fa7410 ffffb004`9da39bd0 : fffff801`00000000 ffffb004`0000000c fffff801`423d1019 ffff8301`0b487af8 : ZFSin+0x258a33
ffffd401`73fa7418 fffff801`00000000 : ffffb004`0000000c fffff801`423d1019 ffff8301`0b487af8 ffffb004`9da39c00 : 0xffffb004`9da39bd0
ffffd401`73fa7420 ffffb004`0000000c : fffff801`423d1019 ffff8301`0b487af8 ffffb004`9da39c00 ffffb004`9f451080 : 0xfffff801`00000000
ffffd401`73fa7428 fffff801`423d1019 : ffff8301`0b487af8 ffffb004`9da39c00 ffffb004`9f451080 ffffb004`5efcd950 : 0xffffb004`0000000c
ffffd401`73fa7430 ffff8301`0b487af8 : ffffb004`9da39c00 ffffb004`9f451080 ffffb004`5efcd950 ffffb004`0000000c : ZFSin+0x1019
ffffd401`73fa7438 ffffb004`9da39c00 : ffffb004`9f451080 ffffb004`5efcd950 ffffb004`0000000c ffffb004`5f0f3608 : 0xffff8301`0b487af8
ffffd401`73fa7440 ffffb004`9f451080 : ffffb004`5efcd950 ffffb004`0000000c ffffb004`5f0f3608 ffffb004`a127b2c0 : 0xffffb004`9da39c00
ffffd401`73fa7448 ffffb004`5efcd950 : ffffb004`0000000c ffffb004`5f0f3608 ffffb004`a127b2c0 ffffb004`00000000 : 0xffffb004`9f451080
ffffd401`73fa7450 ffffb004`0000000c : ffffb004`5f0f3608 ffffb004`a127b2c0 ffffb004`00000000 00000000`00000000 : 0xffffb004`5efcd950
ffffd401`73fa7458 ffffb004`5f0f3608 : ffffb004`a127b2c0 ffffb004`00000000 00000000`00000000 00000000`00000000 : 0xffffb004`0000000c
ffffd401`73fa7460 ffffb004`a127b2c0 : ffffb004`00000000 00000000`00000000 00000000`00000000 ffffb004`00000000 : 0xffffb004`5f0f3608
ffffd401`73fa7468 ffffb004`00000000 : 00000000`00000000 00000000`00000000 ffffb004`00000000 fffff801`00000000 : 0xffffb004`a127b2c0
ffffd401`73fa7470 00000000`00000000 : 00000000`00000000 ffffb004`00000000 fffff801`00000000 00000001`00000000 : 0xffffb004`00000000

SYMBOL_NAME:  ZFSin+258a33

MODULE_NAME: ZFSin

IMAGE_NAME:  ZFSin.sys

STACK_COMMAND:  .thread ; .cxr ; kb

BUCKET_ID_FUNC_OFFSET:  258a33

FAILURE_BUCKET_ID:  0x34_ZFSin!unknown_function

OSPLATFORM_TYPE:  x64

OSNAME:  Windows 10

FAILURE_ID_HASH:  {b7987bc4-7cf3-67a1-c583-5d0346d04b45}

Followup:     MachineOwner
---------
lundman commented 4 years ago

Oh dang, left the trashcan test code on as well.

So I'm having issues replicating the read crash here, any special instructions?

lundman commented 4 years ago

Oh ah, if I copy something to zfs, then I can copy it off (as it is still cached) - but if I export/import, and try to copy something off zfs, it dies;

    nt!KeBugCheckEx+0x107   C/C++/ASM
    nt!CcCopyReadEx+0x193405    C/C++/ASM
    ZFSin!fs_read+0x483 [c:\src\zfsin\zfsin\zfs\module\zfs\zfs_vnops_windows.c @ 3015]  C/C++/ASM
    ZFSin!fsDispatcher+0xca1 [c:\src\zfsin\zfsin\zfs\module\zfs\zfs_vnops_windows.c @ 4643] C/C++/ASM
    ZFSin!dispatcher+0x178 [c:\src\zfsin\zfsin\zfs\module\zfs\zfs_vnops_windows.c @ 4766]   C/C++/ASM
lundman commented 4 years ago

I believe it is caused by:

 if (Length + FileOffset->QuadPart > FileObject->SectionObjectPointer->SharedCacheMap->FileSize.QuadPart))
    KeBugCheckEx(CACHE_MANAGER, 0x273, 0xFFFFFFFFC0000420, 0UL, 0UL);

So, what does that actually mean...

lundman commented 4 years ago
        if (!CcCopyReadEx(fileObject,
                &byteOffset,
                bufferLength,

 kd> dt byteOffset
   +0x000 QuadPart         : 0n0

 kd> dt bufferLength
0x40000

 kd> dt _SHARED_CACHE_MAP 0xffffb706`c550f790

 +0x008 FileSize         : _LARGE_INTEGER 0x0

So that does seem to track. I wonder if I have to set the real size when I call CcInitialiseCacheMap.. hmmm

lundman commented 4 years ago

Ok, now we always set the true size when setting up CcMgr (assuming I didn't miss any calls)

http://www.lundman.net/OpenZFSOnWindows-debug-20191122.exe

inoperable commented 4 years ago

No BSOD and all datasets are detected and imported. Seems you nailed it 👍

Gonna run with it for a while a report back if any locking happens.

UPDATE:

No BSOD, no locking, not any issues so far.