microsoft / WSL

Issues found on WSL
https://docs.microsoft.com/windows/wsl
MIT License
17.52k stars 822 forks source link

vhdx files becoming corrupted since 2.0.4.0 pre-release install #10609

Open jtabox opened 1 year ago

jtabox commented 1 year ago

Windows Version

Microsoft Windows [Version 10.0.22621.2361]

WSL Version

2.0.4.0

Are you using WSL 1 or WSL 2?

Kernel Version

5.15.123.1-1

Distro Version

Ubuntu 22.04

Other Software

No response

Repro Steps

I can't really reproduce this, I was mostly wondering if anyone else has had their virtual hd's gradually becoming corrupted with the pre-release version of WSL2. I installed v2.0.4.0 two days ago, had an Ubuntu 22.04 (the standard distro from the store) already installed. After some time I suddenly started getting filesystem read-only errors. I've had this distro installed (same vhdx file) for almost a year now, never had similar issue (or any other for that matter). Only things that had changed now is the pre-release version and 2 settings activated in .wslconf, autoMemoryReclaim=gradual and sparseVhd=true (I also ran --manage --set-sparse true for my already existing image). Google said it's probably a corrupted disk image, e2fsck found errors and supposedly repaired them, but the read-only errors persisted. I loaded a copy of the vhdx file that I had from before the pre-release install, soon enough it also started throwing read-only errors. Debug console showed the root filesystem was being mounted with errors, I would correct them with e2fsck but they persisted. I ended up nuking both files, and did a fresh distro install. Went fine but at some point it started throwing corruption errors too, this time dpkg wouldn't run because of corrupted files. I've now spent the last two days uninstalling and reinstalling WSL and testing out distros, but have been having the same issue. I wonder if it could be the sparseVhd option. Has anyone else had any similar issues? I've deactivated the option for now and watching if I get a corrupted file again, if I do I'll probably revert to the previous release version.

Expected Behavior

Mainly I'd expect my vhdx files not becoming corrupted 😅

Actual Behavior

They became corrupted.

Diagnostic Logs

No response

benhillis commented 1 year ago

@jtabox - is it possible your distro vhd is full?

OneBlue commented 1 year ago

/logs

jtabox commented 1 year ago

@jtabox - is it possible your distro vhd is full?

No, I don't think so. As I said, this happened with fresh installs of the Ubuntu 22.04 directly from the Store, so I hadn't installed many things. Maybe a miniconda installation at most, the vhdx files were never above 2 Gb.

In regard to logs, I have actually not had a similar issue ever since I deactivated sparseVhd in the configuration file. I installed a lot of things by now, the image is at 30 Gb at the moment and it seems to be working fine. So, I don't know how much use any logs would have now. I'm not sure if my corruption issues were directly related to sparseVhd option, or indirectly, in some obscure way. But the fact is I haven't had any corruption the last day or so, while previously with sparseVhd enabled, the image would become corrupted within an hour or two. I'll try to make a backup of my working installation and re-enable sparseVhd and see if the issues come back.

OneBlue commented 1 year ago

Thank you @jtabox. Can you try to reproduce the issue under log collection ? We'd need to to see logs from when the disk becomes normal to corrupt to root cause the issue.

/logs

ASleepyCat commented 1 year ago

I've also had this issue when enabling sparseVhd and --set-sparse on Ubuntu. It's also affecting files outside the VM for me:

These issues started from the very first 2.0.0 pre-release version.

Edit: My drive got corrupted again, although this time I had disabled sparseVhd on my distro. I guess my C:/ drive corruption issues are unrelated?

NGRhodes commented 1 year ago

This is a clean Win11 install today, fully updated and running 2.0.4 with sparseVhd in my .wslconfig and --set-sparse against Ubuntu

I run touch test sucessfully, activate a conda env and try running pycharm from Windows (remote connection to WSL2). Try touch test2 and get a readonly error. All happens within a few minutes. Here are my logs for the above.

WslLogs-2023-10-11_22-06-43.zip

This is chkdsk straight after:


The type of the file system is NTFS.

WARNING!  /F parameter not specified.
Running CHKDSK in read-only mode.

Stage 1: Examining basic file system structure ...
Attribute list for file 6196 is corrupt.
Attribute list for file 6197 is corrupt.
  270592 file records processed.
File verification completed.
 Phase duration (File record verification): 5.54 seconds.
File record segment 1783E is an orphan.
File record segment 1783F is an orphan.
File record segment 2FBD0 is an orphan.
File record segment 2FBD1 is an orphan.
  8188 large file records processed.
 Phase duration (Orphan file record recovery): 10.84 milliseconds.

Errors found.  CHKDSK cannot continue in read-only mode.```
jtabox commented 1 year ago

@NGRhodes So you're getting similar errors I assume? With sparseVhd active? At least I'm not the only one, I haven't seen any similar feedback so I was worried it's something specific to my PC. Do you use antivirus software? Besides Microsoft's Defender. I've installed Avast Free Antivirus recently, and it's been a bit too eager to block and meddle in stuff in general, so I was wondering if it's related in some way.

@ASleepyCat Luckily I haven't had any issues with anything else outside the vhdx file getting corrupted. Gotta admit it sounds a bit far-fetched, I'd assume ´sparseVhd´ only affects the vhdx files, though I might be totally wrong here.

NGRhodes commented 1 year ago

wsl --manage Ubuntu --set-sparse false and I can still reproduce the error.

@jtabox - I have tried with Kaspersky and Windows Defender and the drive goes readonly in both cases.

zirco77 commented 1 year ago

I had a similar issue

Context:

Then I started to got random problems, did a few shutdown/restart of WSL, until I figured out that the file system was locking into read-only after less than minute of use after each "reboot" of the WSL distro.

Attempts to fix:

I ended up copying most of my data/config files from the the first to second distro (Ubuntu22.04 as well), re-installed what I needed, and deleted the first one. It was simply broken.

I've used the second distro every day for over a week and its working totally fine under WSL 2.0.3. I strongly suspect --set-sparse true to be the cause, and I didn't take any change to enable it on the second distro. I didn't have time to try to reproduce the issue though.

zavocc commented 1 year ago

This also causes to make the areas of the C: drive dirty! In my case setting sparse would not only cause read only errors on the distro filesystem everywhere but also causes minor filesystem corruption which chkdsk (in windows re) reports free space not being able to properly freed up?

image

I managed however to fix read only file system errors by running e2fsck on WSL system distro wsl --system and by mounting the vhdx which I used wsl --mount --vhd .\ext4.vhdx --bare and do e2fsck /dev/sdc -f -y and it works, though it could corrupt some files (which in my case my oh my zsh prints a lot of errors)

ChGen commented 8 months ago

I experimented with pre-release versions of WSL2 and sparseVhd option too. And I run fstrim in wsl2 too. And it seems that my host Windows 11 23H2 NTFS system is quite corrupted now (and vhd ext4 too, btw), so sfc and dism cannot repair it. Quite dangerous stuff...

AlexeyMatskevich commented 7 months ago

I enabled sparseVhd a few months ago, for the last month my filesystem in ubuntu started getting corrupted every other programming session, especially when running docker desktop. Prior to enabling this option, I had been using this system for over a year and had no problems. Also, the windows file system started to get corrupted when using wsl too, in cases where I don't use wsl, this behaviour is not observed.

jtabox commented 7 months ago

I love WSL as a concept and for the amazing utility it offers for free, and I truly appreciate the work being poured into it. But honestly, I'm staying as far away from sparseVhd as humanly possible, at least for the time being.

Since I opened this issue a few months ago, every one of the 3-4 times I changed my mind and decided to give sparseVhd one more try, it has always ended with me straight up deleting the test distro's vhdx file within the first hour of use and having to create a new one from scratch (after deactivating sparseVhd of course) because it's impossible to fix its corruption issues.

There surely must be some kind of interaction between sparseVhd and something on my part, but I can't figure out what it is, after multiple tries. Luckily the host system doesn't seem to have been corrupted so far, but I don't dare use my main Windows PC for my tries.

BtbN commented 7 months ago

Just chiming in here, that I've observed exactly what's being described here as well. Any distro with sparseVhd enabled will eventually suffer fs corruption. And it also resulted in FS corruption of the Host NTFS, which I was almost about to throw out my SSD for.

Krmloo commented 6 months ago

Same problem, both with WSL2 corruption and the host drive.

btrude commented 6 months ago

I am also experiencing the same WSL and host corruption as everyone else since switching to --set-sparse true.

widewind2015 commented 5 months ago

unfortunately, my vhdx file gets corrupted after hours when I enable --set-sparse true.

BtbN commented 5 months ago

unfortunately, my vhdx file gets corrupted after hours when I enable --set-sparse true.

run a chkdsk on your host fs while you still can, and delete any vhds that were in sparse mode.

Krmloo commented 5 months ago

So I've managed to somewhat curb the corruption

CheyenneForbes commented 4 months ago

My VHDXs are showing 0 bytes also, can the data be recovered?

albertocavalcante commented 3 months ago

It has been almost a year since this feature has been released and for what it looks like we have no fix to the corruption problem yet?

The recommendation at https://github.com/MicrosoftDocs/WSL/issues/1855 should at least be changed.

jtabox commented 3 months ago

It has been almost a year since this feature has been released and for what it looks like we have no fix to the corruption problem yet?

The recommendation at MicrosoftDocs/WSL#1855 should at least be changed.

Ever since I opened this thread, I've been consistently and periodically getting notifications of a new post here, so I'm really curious what the cause might be. Still, we're a small minority that's getting the corruption issue, so I assume there must be something specific in our PCs that interacts with WSL in such a catastrophic way. There would be way more open issues if this was a widespread problem.

At this point I've just given up the sparseVhd option completely, and If I'm being honest, even if the issue is fixed in a future update, I still won't be activating the option. The consequences are way too annoying and disrupting, and I don't have any spare PCs to test.

So as long as sparseVhd is not implemented as a default option, I'm fine with it taking time to resolve.

BtbN commented 3 months ago

This is still an experimental feature for which you need to go out of your way to set a flag. And then when your disk corrupts, you also need to notice it, and make the connection that this is the cause.

For me, on multiple systems, turning on the sparseVhd feature very reliably corrupts the filesystem of both guest and host, so I highly doubt it's system dependent.

aont commented 3 months ago

Just FYI.

I was also facing this issue and gave up using sparseVhd. To me, it seemed using docker made corruption frequently. Docker may not be the direct cause, but it can be a key to reproduce this issue.

ChGen commented 3 months ago

Yes, I've noticed this issue while playing with docker and fstrim commands.

BtbN commented 3 months ago

docker probably just creates and deletes A LOT of files, so it gives the sparse stuff a lot more work to do and a lot more chances to make a mess.

devilhyt commented 3 months ago

Same issue here.

I set the VHD to sparse, got filesystem read-only errors in WSL, fixed them with e2fsck. Then Win11 reported a hard drive issue, and after the automatic repair, I couldn’t boot into it anymore🥲.

MtkN1 commented 2 months ago

+1

After setting sparseVhd=true and wsl --manage <distro> --set-sparse true, my WSL became read-only and eventually corrupted my Windows system. Despite clean installing Windows and rebuilding the WSL environment, it became read-only and corrupted again. I'm also using Docker Desktop.

While WSL has saved me a lot of time, this option has cost me a lot of time again 😇

zavocc commented 2 months ago

This issue had been open and almost heated issue almost a year and none of the WSL developers looks after this issue

ChGen commented 7 hours ago

Reproduced it again on Win 11 WSL2 setup with host corruption. Critically dangerous feature!