openzfsonwindows / ZFSin

OpenZFS on Windows port
https://openzfsonwindows.org
1.2k stars 68 forks source link

KERNEL_SECURITY_CHECK_FAILURE when importing pool #215

Open vmlemon opened 4 years ago

vmlemon commented 4 years ago

After spending some time, testing the WSL changes (creating a TARBall of /mnt/c), and running a Wireshark dump, on a new pool, I found that eventually, applications would stop responding, and window contents would be replaced with black boxes.

After rebooting, if I tried to Import the pool, I can reliably reproduce a BSOD, which leads me to suspect that data is being corrupted, somewhere, prior to writing.

Using OpenZFS 20200219, and Windows 10 1909 (Build 18363.657).

MiniDumps are attached…

KSCF1.zip

vmlemon commented 4 years ago

From digging into this, there seems to be some bugs in the way that snapshots are handled, in the OpenZFS Windows port (a regression?), since the pool history reports that several snapshots were made, if I try accessing it under Linux - but, most of them don´t appear in the list of available file systems, and "zfs list" throws a bunch of I/O errors whilst enumerating them.

vmlemon commented 4 years ago

I also notice that the "zfs snapshot" command hangs, the first time that it gets invoked, and subsequent snapshots don´t seem to be properly accounted for, when running "zfs list -t all".

lundman commented 4 years ago

Oh that's interesting - so you suspect something with snapshots? I will give snapshots a run around the yard and see what happens.

vmlemon commented 4 years ago

Not sure if it makes a difference, but the pool was created entirely with the Windows OpenZFS driver, and compression, and deduplication were enabled, plus the dataset was set to be case-insensitive.

lundman commented 4 years ago

On a new pool, I can't see any real-basic snapshot problems;

# ./zpool create -O dedup=on -O casesensitivity=insensitive -O compression=lz4 -O atime=off -o ashift=12 tank PHYSICALDRIVE1

$ ./zfs snapshot tank@one

$ ./zfs list -t all
NAME       USED  AVAIL  REFER  MOUNTPOINT
tank       444K  18.9G   108K  /tank
tank@one      0      -   108K  -

$ ./zfs snapshot tank@two

$ ./zfs list -t all
NAME       USED  AVAIL  REFER  MOUNTPOINT
tank       444K  18.9G   108K  /tank
tank@one      0      -   108K  -
tank@two      0      -   108K  -

If you create a test pool, does the first snapshot command still hang?

vmlemon commented 4 years ago

Not really sure what to think - either, it needs the pool to be aged, for a while, or it´s a quirk of the drive itself (although, it was a brand-new, factory-fresh one, that had just been unpacked, exclusively for testing), but from trying with a VHD, on NTFS...

Created a new pool, on a 20GB VHD, with:

C:\Windows\system32>zpool create -O dedup=on -O casesensitivity=sensitive -O compression=lz4 Swirl PHYSICALDRIVE2
Expanded path to '\\?\PHYSICALDRIVE2'
DeviceName said OK with '*'
DriveGeometry said OK
LayoutInfo said OK
DiskExtents said NG

Attached to WSL1:

tyson@LAPTOP-E8D0SOEE:/mnt$ sudo mkdir /mnt/e
[sudo] password for tyson:
tyson@LAPTOP-E8D0SOEE:/mnt$ sudo mount -t drvfs -o metadata E: /mnt/e

Created a small TAR archive:

tyson@LAPTOP-E8D0SOEE:/mnt/e$ sudo tar --xattrs --preserve-permissions -cvf TestArchive.tar /mnt/c/Users/Lenovo/Links/ /mnt/c/Users/Lenovo/IntelGraphicsProfiles/
tar: Removing leading `/' from member names
/mnt/c/Users/Lenovo/Links/
/mnt/c/Users/Lenovo/Links/desktop.ini
/mnt/c/Users/Lenovo/Links/Desktop.lnk
/mnt/c/Users/Lenovo/Links/Downloads.lnk
/mnt/c/Users/Lenovo/IntelGraphicsProfiles/
/mnt/c/Users/Lenovo/IntelGraphicsProfiles/Brighten Video.man.igpi
/mnt/c/Users/Lenovo/IntelGraphicsProfiles/Darken Video.man.igpi
/mnt/c/Users/Lenovo/IntelGraphicsProfiles/Enhance Video Colors.man.igpi

```Created one snapshot:

C:\Windows\system32>zfs snapshot Swirl@Test

Swirl 133K 18,9G 44K /Swirl Swirl@Test 0 - 44K -


Another: 

Swirl 133K 18,9G 44K /Swirl Swirl@Test 0 - 44K - Swirl@Test1 0 - 44K -

Swirl 1,47M 18,9G 1,26M /Swirl Swirl@Test 0 - 44K - Swirl@Test1 0 - 44K - Swirl@Test2 87,5K - 1,24M -


Started capturing traffic to a file, using Wireshark (10 interfaces)...

Took a few more snapshots:

Swirl 2,68M 18,9G 1,82M /Swirl Swirl@Test 0 - 44K - Swirl@Test1 0 - 44K - Swirl@Test2 87,5K - 1,24M - Swirl@Test3 27K - 1,81M - Swirl@Test4 0 - 1,82M - Swirl@Test5 0 - 1,82M -


Maybe, the problem is with using a large-capacity USB drive (5TB), for the other pool, or a bug in the previous version of OpenZFS that I tried?

"zpool list" reports:

C:\Windows\system32>zpool list NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT NewRiver 4,53T 1,11T 3,42T - - - 24% 1.13x DEGRADED - Swirl 19,5G 4,89M 19,5G - - 0% 0% 1.00x ONLINE -


Used Chrome, to download a 2MB+ file, onto the drive, and took another snapshot...

Swirl 13,9M 18,9G 12,4M /Swirl Swirl@Test 0 - 44K - Swirl@Test1 0 - 44K - Swirl@Test2 87,5K - 1,24M - Swirl@Test3 27K - 1,81M - Swirl@Test4 0 - 1,82M - Swirl@Test5 0 - 1,82M - Swirl@Test6 115K - 11,9M -


Obviously, file systems aren´t perfectly-deterministic beasts, so there´s probably something else causing problems - but, I´m unsure as to what´s so special about the 5TB disk, that snapshots aren´t working properly.
vmlemon commented 4 years ago

Also tried to download a file onto it, using Edge, which seems to work, as does taking snapshots. Going to see what happens, if I add a second dataset to the new pool.

vmlemon commented 4 years ago

Creating the second dataset worked, at least:

C:\Windows\system32>zfs create -o casesensitivity=sensitive -o driveletter=U Swirl/Second filesystem successfully created, but not shared

vmlemon commented 4 years ago

Downloaded a file onto the new dataset, and things kinda look OK, so far... `C:\Windows\system32>zfs list NAME USED AVAIL REFER MOUNTPOINT NewRiver 1,26T 3,27T 1,18T /NewRiver NewRiver/Insensitive 603M 3,27T 602M none Swirl 50,8M 18,8G 49,6M /Swirl Swirl/Second 225K 18,8G 225K /Swirl/Second

C:\Windows\system32>zpool list NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT NewRiver 4,53T 1,11T 3,42T - - - 24% 1.13x DEGRADED - Swirl 19,5G 51,2M 19,4G - - 0% 0% 1.00x ONLINE -`

vmlemon commented 4 years ago

Managed to create a load of snapshots, on both volumes:

Swirl                             51,1M  18,8G  49,8M  /Swirl
Swirl@Test                            0      -    44K  -
Swirl@Test1                           0      -    44K  -
Swirl@Test2                       87,5K      -  1,24M  -
Swirl@Test3                         27K      -  1,81M  -
Swirl@Test4                           0      -  1,82M  -
Swirl@Test5                           0      -  1,82M  -
Swirl@Test6                        115K      -  11,9M  -
Swirl@Test7                        142K      -  28,3M  -
Swirl@Test8                         97K      -  38,1M  -
Swirl@Test9                         99K      -  38,1M  -
Swirl/Second                       225K  18,8G   225K  /Swirl/Second
Swirl/Second@Grabs                    0      -   225K  -
Swirl/Second@Grabs1                   0      -   225K  -
Swirl/Second@Grabs12                  0      -   225K  -
Swirl/Second@Grabs123                 0      -   225K  -
Swirl/Second@Grabs1234                0      -   225K  -
vmlemon commented 4 years ago

Probably need to keep poking at it, to see if I can reproduce it, with this small test pool.

Not sure if memory gets exhausted, or there are race conditions happening, somewhere, as you expand the size of the pool, and throw more concurrent I/O at it...

vmlemon commented 4 years ago

Trying to export the pool hangs at "zunmount(Swirl,U:\ ) running", and "zpool list" claims that it´s still "ONLINE"...

vmlemon commented 4 years ago

Can´t close the cmd.exe window, when it´s stalled, and it seems to be possible to issue it multiple times, as well as create snapshots, at least.

vmlemon commented 4 years ago

If it makes the difference, the VHD SCSI device is set to be "optimised for best performance", but the USB drive is set for "optimised for quick removal", in terms of caching policy.

vmlemon commented 4 years ago

(And, Windows refuses to let me change it, on the virtual drive)

vmlemon commented 4 years ago

If I reboot (which happens cleanly, despite the "zpool export" stall), and reattach the VHD, it looks like I still can´t change the caching policy - however, importing the ZFS pool works fine...

If I get chance, I´ll see what happens, if I try importing the bad pool, with "optimised for best performance" enabled.

vmlemon commented 4 years ago

Can also export the test pool, after rebooting:

´´´ C:\Windows\system32>zpool export Swirl zunmount(Swirl,U:\ ) running zunmount(Swirl,U:\ ) returns 0 zunmount(Swirl/Second,E:\ ) running zunmount(Swirl/Second,E:\ ) returns 0 ´´´

lundman commented 4 years ago

So if I read that right, it isn't easy to reproduce. I can definitely get stalls if I exhaust my systems memory, ZFS will just sort of... stop.. doing anything, but the rest of the machine is fine.

If you find something that you can reproduce with few steps, that would be helpful - but bugs don't always behave nicely like that.

vmlemon commented 4 years ago

Yeah, I noticed that I can get spurious checksum errors, if I have a bad USB cable, and leave things sitting for a while, or memory runs low, and I`ve seen Chrome hang, when starting large downloads - but, my data still seems fine, otherwise.

For instance, on a fairly-new drive;


C:\Windows\system32>zpool status -v
  pool: NewRiver
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub canceled on Sun Jan 12 02:17:44 2020
config:

        NAME                   STATE     READ WRITE CKSUM
        NewRiver               DEGRADED     0     0     0
          /dev/physicaldrive1  DEGRADED     0     0     0  too many errors

errors: Permanent errors have been detected in the following files:

        D:\ /ConsoleTests1/Neuer Ordner (2)/Neuer Ordner/SNX/MINI_EXT/fmli/fmli/ups
        D:\ /ConsoleTests1/Neuer Ordner (2)/Neuer Ordner/SNX/MINI_EXT/etc/fs/vxfs
        D:\ /ConsoleTests1/Neuer Ordner (2)/Neuer Ordner/SNX/MINI_EXT/etc/inet
        D:\ /ConsoleTests1/Neuer Ordner (2)/Neuer Ordner/SNX/MINI_EXT/fmli/fmli/text/De/error
        D:\ /ConsoleTests1/Neuer Ordner (2)/Neuer Ordner/SNX/MINI_EXT/usr/lib/vidi
        D:\ /ConsoleTests1/Neuer Ordner (2)/Neuer Ordner/SNX/MINI_EXT/usr/sadm
        D:\ /ConsoleTests1/Neuer Ordner (2)/Neuer Ordner/SNX/MINI_EXT/usr/share/lib
        D:\ /ConsoleTests1/Neuer Ordner (2)/Neuer Ordner/SNX/MINI_EXT/fmli/fmli/text/De/var
        D:\ /ConsoleTests1/Neuer Ordner (2)/Neuer Ordner/SNX/MINI_EXT/var/tmp
        D:\ /ConsoleTests1/Neuer Ordner (2)/Neuer Ordner/SNX/MINI_EXT/opt/dpt/bin
        D:\ /ConsoleTests1/Neuer Ordner (2)/Neuer Ordner/SNX/MINI_EXT/sbin
        D:\ /ConsoleTests1/Neuer Ordner (2)/Neuer Ordner/SNX/MINI_EXT/fmli/fmli/text/En
        NewRiver@MultiFace:<0x0>
lundman commented 4 years ago

Hmm we definitely had a bug that would do random errors like that (reading the wrong returncode from IO) but that was fixed near the start of the year, so I would not expect that to be there now. Guess I need a better way to see what ZFSin.sys is running to confirm versions.

vmlemon commented 4 years ago

Bizarrely, the pool stays online, if I accidentally yank/loosen the USB connection, but the I/O errors keep piling up, but I´m unsure if that´s a bug, or a feature - Linux will just violently lock up, if I try that kind of things.

lundman commented 4 years ago

Ah ok, so it's a loose cable? phew :)

vmlemon commented 4 years ago

I wouldn´t be surprised, knowing the quality of cheap consumer laptops, these days - Í always feel uneasy about even slightly moving mine, if I have an external drive connected, even with ZFS working its magic.

vmlemon commented 4 years ago

Those SINIX system files, that I pulled from an ISO, were uncorrupted, and still open fine, but the errors just appeared out of the blue, after moving a bunch of them around, into another directory.

I invariably end up imaging my drives, onto new ones, after a few years of use, to shake out problems, anyway. ;)