Open dima333750 opened 4 years ago
Привет,
-O dedup=on
Right now, there seems to be issues with dedup. Or perhaps, dedup makes issues happen more. Issue #201 (near the end) talks about one.
But PAGE_FAULT_IN_NONPAGED_AREA (50)
is new to me.
: ffff850a`cee673a0 00000000`00000000 : nt!KiPageFault+0x349
: 00000000`00000000 00000000`00000000 : nt!RtlInitUnicodeString+0x19
: 00000000`00000000 00000000`0000009a : ZFSin+0x1003b0
That is surprising. Perhaps sourceString
is wrong. I will look up where ZFSin+0x1003b0
is.
0: kd> dt ZFSin+0x1003b0
dmu_object_dnsize_from_db
void ZFSin!dmu_object_dnsize_from_db+0(
dmu_buf*,
int*)
Can't be right, no RtlInitUnicodeString()
there - guess I have to run the identical binary in the installer.
I did a new build to fix the mutex. But I am unsure where exactly it crashes here. You could try ".reload /f ZFSin.sys" to see if symbol matching works.
I do not really understand what mutex is
I reinstalled OpenZFSOnWindows-debug-20200128.exe from https://openzfsonosx.org/wiki/Windows_builds
again
"C:\Program Files (x86)\Windows Kits\10\Debuggers\x64\windbg.exe" -y "C:\Program Files\OpenZFS On Windows\symbols\" -z "C:\Windows\Minidump\012720-31609-01.dmp"
Nothing has changed, but I tried !sym noisy
and received
************* Path validation summary **************
Response Time (ms) Location
OK C:\Program Files\OpenZFS On Windows\symbols
DBGHELP: Symbol Search Path: c:\program files\openzfs on windows\symbols
DBGENG: ntoskrnl.exe - Partial symbol image load missing image info
DBGHELP: No header for ntoskrnl.exe. Searching for dbg file
DBGHELP: c:\program files\openzfs on windows\symbols\ntoskrnl.dbg - file not found
DBGHELP: c:\program files\openzfs on windows\symbols\exe\ntoskrnl.dbg - path not found
DBGHELP: c:\program files\openzfs on windows\symbols\symbols\exe\ntoskrnl.dbg - path not found
DBGHELP: .\ntoskrnl.dbg - file not found
DBGHELP: .\exe\ntoskrnl.dbg - path not found
DBGHELP: .\symbols\exe\ntoskrnl.dbg - path not found
DBGHELP: ntoskrnl.exe missing debug info. Searching for pdb anyway
DBGHELP: c:\program files\openzfs on windows\symbols\ntoskrnl.pdb - file not found
DBGHELP: c:\program files\openzfs on windows\symbols\exe\ntoskrnl.pdb - file not found
DBGHELP: c:\program files\openzfs on windows\symbols\symbols\exe\ntoskrnl.pdb - file not found
DBGHELP: ntoskrnl.pdb - file not found
*** WARNING: Unable to verify timestamp for ntoskrnl.exe
DBGHELP: nt - no symbols loaded
I also tested it under load and noticed that it works for a friend. Its RAM consumption increased from 2GB to 6.5GB with the same amount of data, but it doesn't seem to crash. I find it hard to understand.
OK ZFSin.sys -> great! Maybe you will get symbols next crash. Does it still crash for you?
ntoskrnl.exe "no symbols" -> does not matter, we only care about ZFSin.sys.
OK. I think I began to understand, now I just use
"C:\Program Files (x86)\Windows Kits\10\Debuggers\x64\windbg.exe" -y "srv*c:\cache*https://msdl.microsoft.com/download/symbols;C:\Program Files\OpenZFS On Windows\symbols" -z "C:\Windows\Minidump\012820-6703-01.dmp"
then
!analyze -v
Also, I got SYSTEM_SERVICE_EXCEPTION (3b) when trying to use zpool destroy -f dedu (previously it just hung) I attach the result debug_zpool destroy -f.txt
Perfect, this is what we need.
fffff807`40e8dbdd 483988e0000000 cmp qword ptr [rax+0E0h],rcx ds:002b:deadbeef`deadbd57=????????????????
ZFSin!vflush+0x8d [C:\src\ZFSin\ZFSin\spl\module\spl\spl-vnode.c @ 1437]
ZFSin!zfs_vfs_unmount+0x198 [C:\src\ZFSin\ZFSin\zfs\module\zfs\zfs_vfsops.c @ 3161]
ZFSin!zfs_windows_unmount+0xbe [C:\src\ZFSin\ZFSin\zfs\module\zfs\zfs_vnops_windows_mount.c @ 1293]
ZFSin!zfs_ioc_unmount+0x26 [C:\src\ZFSin\ZFSin\zfs\module\zfs\zfs_ioctl.c @ 6208]
Hmmm. "deadbeef" means memory was freed, but it accessed this memory anyway.
1436 // skip vnodes not belonging to this mount
1437 if (mp && rvp->v_mount != mp)
1438 continue;
I'll have to think more on this.
After a fairly long real load, I again got PAGE_FAULT_IN_NONPAGED_AREA (50) Attach file debug_PAGE_FAULT_IN_NONPAGED_AREA_OpenZFSOnWindows-debug-20200128.txt
: nt!RtlInitUnicodeString+0x19
: ZFSin!file_rename_information+0x239 [C:\src\ZFSin\ZFSin\zfs\module\zfs\zfs_vnops_windows_lib.c @ 2334]
Perfect, I have something to go on. I'll let you know if I find something.
Ah, ok, the FileName
in the FILE_RENAME_INFORMATION
is not always null-terminated, and I pass it;
RtlInitUnicodeString(&uFileName, ren->FileName);
and it crashes. How annoying.
I tried different options and found strong memory usage when turning on LZ4 Command zpool create -O compress=lz4 -O casesensitivity=insensitive -O atime=off -o ashift=12 dedu PHYSICALDRIVE4 lz4.txt If a debugging method to find out the reason?
OK, rename fix is available here:
Well, it seems that it works, I no longer receive BSOD! But sometimes I get errors when writing files, it’s still hard for me to collect data about this (it seems to me that the OpenZFSOnWindows-release-20200115.exe version didn’t have this problem, now I’ll check it)
I think there are problems that are easy to reproduce:
1) High memory consumption, I have a pool created: zpool create -O casesensitivity=insensitive -O atime=off -o ashift=12 dedu PHYSICALDRIVE4 2.62G of data on disk consumes 3.5G of RAM
2) When data is deleted, it is not deleted from the pool. I even waited 12 hours, but after writing data to the pool, they remain there forever.
2) I can check on this. We had this problem in the past, perhaps it has come back.
What is symptom of "errors when writing files" ?
I checked the version of OpenZFSOnWindows-release-20200115.exe, it has no problems writing to a file. My sql programs crash without error, presumably when I write to a file or read, I'm not sure.
sql programs - so maybe mmap file (file sections). What software is that using? Is it something I can try myself ?
This problem is difficult to reproduce until I just discovered
in version http://www.lundman.net/OpenZFSOnWindows-debug-20200115.exe isatty () fix - this problem is missing
in version http://www.lundman.net/OpenZFSOnWindows-debug-20200123.exe Random IO error fix, dedup work - already exists
I will assume that the problem is this change: Random IO error fix, dedup work
1)
I can confirm deleting files does not remove space, the files still have SectionObjectPointers.DataSectionObject
and get put on unlinked_drain list. If you zpool export
zpool import
the file data will be freed. Attempting to correct this problem today.
2) "Random IO error" fixes that ZFS would issue IO to device, but check the wrong place for the result of IO. It would get wrong OK, and wrong error code. It is possible that IO failed, but the code would incorrectly return OK.
Files are deleted
I no longer see the error PAGE_FAULT_IN_NONPAGED_AREA (50)
When using zpool destroy -f, SYSTEM_SERVICE_EXCEPTION (3b) reappeared, the pool was empty, long before that I deleted all the data. debug_zpool destroy -f_bsod2.txt
Error echo test_text >> J:\1.txt - the zero file remained
I have not yet looked at -f force
for unmount/destroy, so not sure what it will do.
Will try your write test tomorrow.
On version OpenZFSOnWindows-debug-20200130.exe, I got PAGE_FAULT_IN_NONPAGED_AREA (50) when I tried to remove the driver. The driver could only be uninstalled by installing OpenZFSOnWindows-release-20200115.exe. PAGE_FAULT_IN_NONPAGED_AREA on OpenZFSOnWindows-debug-20200130.exe.txt During driver uninstallation, the pool was disconnected (unmounted)
Unload? OK. Since I rollback VMs I do not often test unloading. Thanks for reporting it.
Today I tested the version of OpenZFSOnWindows-debug-20200206.exe
ZFSin!spl_mutex_enter+0x18 [C:\src\ZFSin\ZFSin\spl\module\spl\spl-mutex.c @ 108]
No what?, I fixed this one already! Stop having this crash!
ZFSin!spl_mutex_enter+0x18: cmp dword ptr [rax+20h],23456789h ds:002b:deadbeef`deadbf67=????????
: ZFSin!spl_mutex_enter+0x18 [C:\src\ZFSin\ZFSin\spl\module\spl\spl-mutex.c @ 108]
: ZFSin!dmu_sync_done+0x8f [C:\src\ZFSin\ZFSin\zfs\module\zfs\dmu.c @ 2031]
: ZFSin!arc_write_done+0xfda [C:\src\ZFSin\ZFSin\zfs\module\zfs\arc.c @ 7082]
: ZFSin!zio_done+0x230d [C:\src\ZFSin\ZFSin\zfs\module\zfs\zio.c @ 4495]
Well, deadbeefdeadbf67
means we have freed the memory, then we enter spl_mutex_enter()
with it anyway, what is most unusual is that this is the zio_done()
thread itself, the thread that eventually frees the zio
.
Unless...
dmu.c:2031: mutex_enter(&db->db_mtx);
Ok this is something new, not the zio problem after all.
: ZFSin!dmu_sync_done+0x8f [C:\src\ZFSin\ZFSin\zfs\module\zfs\dmu.c @ 2031]
Having issues trying to figure this one out, I can not replicate it here.
Would it be possible to get the contents of cbuf at crash:
https://openzfsonosx.org/wiki/Windows_BSOD#Debug_Print_Buffer
or possibly all the stacks;
.logopen c:\src\stacks.txt ; !stacks 2; .logclose
I believe the stacks
command can take name pattern like ZFSin* if you want to hide other process names.
It is clear that the db
has been freed() before we enter dmu_sync_done()
. The only place that releases a db is dbuf_destroy()
which also grabs the mutex - so is it a starvation...
sorry for the long answer
dt ZFSin!cbuf
- received ???? Memory read error fffff80162bff000
then .writemem D:\cbuf.txt 0xffffe089`f0010000 L100000
.logopen D: \ stacks.txt; ! stacks 2; .logclose
And got it only
Opened log file 'D:\stacks.txt'
Proc.Thread .Thread Ticks ThreadState Blocker
Unable to get value of PsActiveProcessHead.Flink
Threads Processed: 0 Closing open log file D:\stacks.txt
I send the whole dump
[020720-6984-01.zip](https://github.com/openzfsonwindows/ZFSin/files/4203030/020720-6984-01.zip)
Not sure I still have the grasp of using minidumps from other people.
But looking at this minidump, it seems:
dt mp
Local var @ 0xffffcd0201a92610 Type kmutex*
0xdeadbeef`deadbf47
mutex_enter(&db->db_mtx);
dt db
Symbol not found at address 00000000000000db.
3: kd> dd db
00000000`000000db ???????? ???????? ???????? ????????
dmu_buf_impl_t *db = dr->dr_dbuf;
3: kd> dt dr
Local var @ 0xffffcd0201a92640 Type dbuf_dirty_record*
0xffffe589`44d00540
3: kd> dd dr
ffffcd02`01a92640 44d00540 ffffe589 e267f54f 97204a04
ffffcd02`01a92650 2f90408f 9ae16a3b deadbeef deadbeef
ffffcd02`01a92660 00000000 00000000 1a3d1019 fffff806
So it doesn't seem to be that db was freed, but rather dr->dr_buf
is wiped out. That there is a 0xdeadbeef right before is interesting too, possibly not initialised after alloc.
Bah, so dt db
in windbg will attempt to lookup address 0xdb. One should use dt -n db
3: kd> dt -n db
Local var @ 0xffffcd0201a92658 Type dmu_buf_impl*
0xdeadbeef`deadbeef
0xdeadbeef`deadbeef
+0x000 db : dmu_buf
+0x000 db_object : ??
+0x008 db_offset : ??
+0x010 db_size : ??
+0x018 db_data : ????
+0x020 db_objset : ????
+0x028 db_dnode_handle : ????
+0x030 db_parent : ????
+0x038 db_hash_next : ????
+0x040 db_blkid : ??
+0x048 db_blkptr : ????
+0x050 db_level : ??
+0x058 db_mtx : kmutex
3: kd> dd 0xffffcd0201a92658
ffffcd02`01a92658 deadbeef deadbeef 00000000 00000000
ffffcd02`01a92668 1a3d1019 fffff806 00000000 00000000
ffffcd02`01a92678 2f90408f 9ae16a3b 00000000 00000000
Which is a different thing. We have a valid db
- that is not the issue.
The first member in db is 0xdeadbeef, ie, db.db_object
which might not be so interesting.
I think then the mutex is at +0x58, which is:
3: kd> dd 0xffffcd0201a92658 + 0x58
ffffcd02`01a926b0 7fa61729 5878b321 00000000 00000000
ffffcd02`01a926c0 00000000 00000000 00000000 00000000
which implies it has not been initialised? But if we die in panic()
due to not initialised mutex (we need to see cbuf too) we should die more gracefully than PAGE_FAULT.
Stepping into mutex_enter
3: kd> dt mp
Local var @ 0xffffcd0201a92610 Type kmutex*
0xdeadbeef`deadbf47
+0x000 m_lock : mutex_t
+0x018 m_owner : ????
+0x020 initialised : ??
+0x024 set_event_guard : ??
Memory read error deadbeefdeadbf6b
3: kd> dd 0xffffcd0201a92610
ffffcd02`01a92610 deadbf47 deadbeef 1b1ce0f8 ffffe589
I wonder if one can tag cbuf
allocation to be part of minidumps.
My addresses and debuggers addresses do not match, I guess I made a mistake. But still not an initialised mutex.
Does not appear to be an API to tag memory allocations to be included in minidump. Seems changing to use "kernel memory" would include cbuf
. https://support.microsoft.com/en-us/help/254649/overview-of-memory-dump-file-options-for-windows
I'm going to assume the problem is "not initialised mutex" and see if I can work out how that can happen.
ZOL has an issue for db->db_mtx - not quite the same, but could be related. https://github.com/openzfs/zfs/issues/7814
Today I tested the latest version of OpenZFSOnWindows-debug-20200326 again, error SYSTEM_THREAD_EXCEPTION_NOT_HANDLED_M (1000007e) unfortunately remains. debug_27.03.2020.txt 032720-7093-01.zip
Hi, I speak Russian and use a translator. I apologize in advance for possible errors. I saw the latest release https://github.com/openzfsonwindows/ZFSin/releases/tag/0.21 and this is a huge progress, I installed it. But that's all, there is a problem that has long tormented me. I use the following line to create:
zpool create -O dedup=on -O compress=gzip-9 -O casesensitivity=insensitive -O atime=off -o ashift=12 dedu PHYSICALDRIVE4
In the pool I store temporary data that is perfectly deduplicated, but they are constantly replaced or deleted and this gives a high load of I / O and when the load reaches a certain point in time it causes explosive memory consumption. In part, I am able to reduce the likelihood of this problem occurring using gzip-9. I think this problem could be related to https://github.com/openzfsonwindows/ZFSin/commit/c14a7194b5d8dcf1bf6d2d7c7c030fc0d547c461 I tried to debug a memory dump using material: https://openzfsonosx.org/wiki/Windows_BSOD But I failed, at first I tried "C:\Program Files (x86)\Windows Kits\10\Debuggers\x64\windbg.exe" -y "C:\Program Files\OpenZFS On Windows\symbols\" -z "C:\Windows\Minidump\012720-31609-01.dmp" but got Symbol Loading Error later "C:\Program Files (x86)\Windows Kits\10\Debuggers\x64\windbg.exe" -z "C:\Windows\Minidump\012720-31609-01.dmp" it seems that something happened, I saved both output files and attach them together with the dump. debug Symbol Loading Error.txt debug.txt 012720-31609-01.zip
I can repeat BSOD to find the error if you write how to do it.