openzfsonwindows / openzfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
402 stars 15 forks source link

BSOD when copying a specific JPG file to a ZFS pool on Windows 10 #339

Open VykosX opened 6 months ago

VykosX commented 6 months ago

System information

Type Version/Name
Distribution Name Microsoft Windows 10 Pro
Distribution Version 10.0.19045 Build 19045
Kernel Version 22H2
Architecture x64
OpenZFS Version zfs-windows-2.2.2-rc1

Describe the problem you're observing

BSOD on Windows 10 Pro 22H2 when copying a specific JPG file to a ZFS pool. Specifically the BSOD reads SYSTEM_SERVICE_EXCEPTION on OpenZFS.sys.


So I've been trying out OpenZFS for Windows and backing up my personal files to a ZFS pool I created for testing how robust and stable the native implementation on Windows is. As I've been copying my files over with FreeFileSync, I suddenly experienced a BSOD. I have narrowed down the cause to a specific jpg file that when copied to the pool causes the driver to crash. I will attach the JPG in question in both JPG and ZIP formats in case Github recompresses the image, so that it can be tested and replicated and the root cause can be found.

I suspect there might be something unusual with the jpg in question (a 700x700 Folder.jpg cover of an album for the OST of the videogame Legend of Zelda - A Link to the Past). Perhaps it may have gotten partially corrupt over time from bad sectors or bit rot? The file opens fine on Windows however it does look strangely low-res for the size and I did notice that the thumbnail on explorer is a little strange and only renders the top of the image, filling the rest of the thumbnail with solid green.

Nevertheless you'd probably want any file to be able to be copied to the pool without issues, regardless of its contents, so I figured I should report this.

For reference, the pool was created with the following command:

zpool.exe create -O casesensitivity=insensitive -O normalization=formD -O compression=zstd -O atime=off -o ashift=12 Data PHYSICALDRIVE1

then mounted with

zpool import Data

Describe how to reproduce the problem

Copy the attached JPG file to a ZFS drive with ZSTD compression enabled (might be related? unsure.)

You should get a SYSTEM_SERVICE_EXCEPTION BSOD regardless of where the file is placed.

Folder Folder.zip


EDIT: I hashed the attached JPG with the original on my drive prior to uploading and the hashes match, so it should be fine to use the image as is to trigger the BSOD.

andrewc12 commented 6 months ago

Thanks for the report.

Firstly can I ask you to check if the extracted version causes a BSOD? There might be a problem with metadata or alternate data streams, and they sometimes get lost when creating a zip file.

lundman commented 6 months ago

Stellar bug report, thanks. I will give this a go in the office and see if I can reproduce it.

lundman commented 6 months ago
    nt!KiPageFault+0x43d    C/C++/ASM
    OpenZFS!vnode_put+0xe [C:\src\openzfs\module\os\windows\spl\spl-vnode.c @ 1088]     C/C++/ASM
    OpenZFS!zfs_vnop_lookup_impl+0x2219 [C:\src\openzfs\module\os\windows\zfs\zfs_vnops_windows.c @ 1412]   C/C++/ASM
    OpenZFS!zfs_vnop_lookup+0x21c [C:\src\openzfs\module\os\windows\zfs\zfs_vnops_windows.c @ 1843]     C/C++/ASM
VykosX commented 6 months ago

Thanks for the report.

Firstly can I ask you to check if the extracted version causes a BSOD? There might be a problem with metadata or alternate data streams, and they sometimes get lost when creating a zip file.

Thanks for the insight!

I checked the metadata through XNViewMP and compared it with other similar cover images on my music folder and did not notice any anomalies.

I then checked the Alternate Data Streams with Nirsoft's AlternateStreamViewer and the local file does have the :Zone.Identifier:$DATA stream which Windows uses to tag files downloaded from the internet, but I checked the contents and they appear to be correct:

[ZoneTransfer]
ZoneId=3

I checked the rest of the albums folder and scanned the files therein for Alternate Data Streams and numerous other files came up with the same data stream (including other such Folder.jpg files) and they were copied correctly to the zpool while still preserving the data streams. So it seems unlikely to me that this could be the culprit.

My layman's guess is that perhaps there could be some corrupt data within that file specifically and perhaps the compression algorithm assumes a valid file and while trying to optimize the algorithm it throws an error which is not handled gracefully by the driver, leading to a BSOD.

I have another album for the SNES version of the game's OST that I'm attaching for comparison (the corrupt cover is from the GBA gamerip of the game) and I suspect both covers may have once been the same file. You can see this one is much more detailed while still being the same dimension, but surprisingly the file size is only half as large as the (purportedly) corrupt version. The fact that the other one still renders in low-detail could perhaps be due to interlacing or having some kind of fallback preview?

Folder

Stellar bug report, thanks. I will give this a go in the office and see if I can reproduce it.

Thanks a lot, I hope you are able to reproduce it!

lundman commented 6 months ago

Looks like it passes a RelatedFileObject along in Open (the file), with a filename :Zone.Identifier:$DATA (open stream), odd but not invalid. This means we do not have the Parent VP dvp set, and crash when we try to release it (release NULL).

lundman commented 6 months ago

2442735

lundman commented 6 months ago

OpenZFSOnWindows-debug-2.2.99-1-g24427350f6 has been placed in Releases. Give it a go.

derritter88 commented 6 months ago

Looks good for me - it passed the "magic" 414 MB copy mark where it usually crashed. Will keep watching it if there is another BSOD or if it finishes without any problems.

derritter88 commented 6 months ago

For me your last changes at version OpenZFSOnWindows-debug-2.2.99-1-g24427350f6 did solve my BSOD issues I previously had - thanks for that!

derritter88 commented 6 months ago

Unfortunately yet another BSOD. Today with a different drive if that matters (8 TB SATA III). I was stoping a copy job to delete the copied files.

Somehow ZFS hanged before several times. It copied stuff, hung up, continue to copy and so on.

Logs below: cbuf.txt stack.txt info.txt

VykosX commented 6 months ago

Apologies for the delay in responding, I wanted to get a little further in testing before I added my input.

For all intents and purposes the last release seems to have addressed the bulk of the issues. I did run into one more BSOD since when attempting to copy files, but I was not able to narrow it down or reproduce it since. The latest update did fix the original issue and I've been able to transfer around 16 TB so far in and about 5 million files to the Zpool without issue. I'll keep an eye out for more BSODs or errors and report them as I go.

Great job and thank you very much for making ZFS on Windows natively a reality!

VykosX commented 6 months ago

One issue I can say I noticed however. Trying to list all the files in the disk takes an inordinate amount of resources. Well over 48GB of RAM on my machine. More concerning is the fact that this ram, presumably being taken by the driver is completely invisible to the system and does not show up as part of any process in Task Manager. Even more concerning is that this memory is never released, even after running zpool export on the drive. You need a full reboot to release the memory afterwards... Can anything be done about this? For reference, I am using the software WizTree to list the files, but I believe it may also happen with FreeFileSync.

lundman commented 6 months ago

That does sound like a memory leak. Could you make it suck up 48GB of ram, then run the kstat.exe tool and save the output on here, maybe we can see where it goes. If it is from dirlistings, that will narrow a leak down.

VykosX commented 6 months ago

I had FreeFileSync run a full comparison between my source and destination drives (the destination being the zpool) to see if all the files I had been copying for almost a week had been moved correctly. I was watching my memory usage grow and grow in System Informer and sweating bullets hoping the system wouldn't crash before the comparison was over. Because the RAM being used (or leaked) is in the form of Physical Memory rather than virtual commit, I wasn't certain the pagefile would save me. Luckily, the comparison finished just shy of running out of RAM entirely. I've piped the output from kstat.exe into a text file and I'm attaching it along with a screenshot of the system after the comparison. You can see that System Informer reports only about 2GB being used for the entire system. Everything else is being consumed invisibly by the OpenZFS driver. I hope the log file is useful in finding this pesky leak!

kstat.txt Mem

lundman commented 6 months ago

Worst offenders are:

    mem_inuse                       1451720704 spl_default_arena
    mem_inuse                       8862535680 bucket_16384
    mem_inuse                       17639956480 bucket_32768
    mem_inuse                       25832443904 kmem_default
    mem_inuse                       25835241472 kmem_va
    mem_inuse                       26838171648 heap
    mem_inuse                       27418558464 bucket_heap

Which makes bucket_32768 stand out quite a bit, 16GB there, and next closest is 16384 with 8GB.

derritter88 commented 6 months ago

Maybe linked to #333 ?

VykosX commented 3 months ago

I was wondering if there was any further developments on this issue, specially in regards to the massive memory leaks? Thank you very much!