microsoft / WSL

Issues found on WSL
https://docs.microsoft.com/windows/wsl
MIT License
16.93k stars 799 forks source link

WSl2 corrupts ext4 filesystem #5895

Open livius-ungureanu opened 3 years ago

livius-ungureanu commented 3 years ago

Environment

Platform ServicePack Version VersionString


Win32NT 10.0.19041.0 Microsoft Windows NT 10.0.19041.0

lsb_release -r Release: 20.04

cat /proc/version Linux version 4.19.104-microsoft-standard (oe-user@oe-host) (gcc version 8.2.0 (GCC)) #1 SMP Wed Feb 19 06:37:35 UTC 2020

Windows build number: [run `[Environment]::OSVersion` for powershell, or `ver` for cmd]
Your Distribution version: [On Debian or Ubuntu run `lsb_release -r` in WSL]
Whether the issue is on WSL 2 and/or WSL 1: [run `cat /proc/version` in WSL]

Steps to reproduce

I am using Intellij linux version running in WSL2 and connected to a X410 server for GUI . While intellij is running some apps WSL2 suddenly stops. After a start again wsl2 I see that

- filesystem also corrupted some of my files:

cat someProjectFile 2f��)��l�l�.;�K{�7���/��~Gh{�7���/��~Gh{�7���/��~Gh{�7���/��~Gh{�7���/��~Gh{�7���/��~Gh{�7���/��~Gh{�7���/��~Gh{�7���/��~Gh{�7���/��~Gh{�7���/��~Gh{�7���/��~Gh{�7���/��~Gh{�7���/��~Gh{�7���/��~Gh{�7���/....



<!-- 
If you'd like to provide logs you can provide an `strace(1)`  log of the failing command (if `some_command` is failing, then run `strace -o some_command.strace -f some_command some_args`, and link the contents of `some_command.strace` in a gist. 
More info on `strace` can be found here: https://www.man7.org/linux/man-pages/man1/strace.1.html
You can use Github gists to share the output: https://gist.github.com/
-->

<!--
Collect WSL logs by following these instructions: https://github.com/Microsoft/WSL/blob/master/CONTRIBUTING.md#8-detailed-logs  
-->
**WSL logs**: 

#  Expected behavior

Do not corrupt the ext4 file system.This makes WS2 quite unreliable it would be fine to be fixed as soon as posible.

<!-- A description of what you're expecting, possibly containing screenshots or reference material. -->

# Actual behavior
Every two 2-3 days the ext4 file system gets currupted.

<!-- What's actually happening? -->
Jessidhia commented 3 years ago

I have also encountered this when yarn installing large dependency trees; especially when large native library builds are involved (e.g. when puppeteer is a dependency). After the build finishes, / is usually already remounted ro, e2fsck finds errors to correct, and even after restarting wsl (wsl --shutdown cycle) several files, usually my shell history file and some files inside node_modules, are corrupted; the latter forcing the install to be re-done, which leads to non-deterministic loops of manually installing, rechecking, restarting, until everything works.

onomatopellan commented 3 years ago

The only corruption problem I had so far with wsl2 was when once my vhdx expanded with little disk space available in the system drive. WSL2 uses by default a dynamic vhdx that can grow until 256Gb but it seems the linux distro never knows what's really happening in the host disk. If that was the real problem at least in last insider build you can mount an external disk to avoid this happening again when compiling a big project for example.

Having little disk space available in C: could be a reason, another reason could be the vhdx was fully expanded and trying to write more in it did bring the corruption problem. For that try to expand the vhdx disk size and see if it lasts more days until corruption. https://docs.microsoft.com/en-us/windows/wsl/compare-versions#expanding-the-size-of-your-wsl-2-virtual-hardware-disk

livius-ungureanu commented 3 years ago

Some relevant facts before running into this:

I cannot see other relevant facts.

Today I've run again into this:

  0.874250] Adding 4194304k swap on /swap/file.  Priority:-2 extents:2 across:4202496k
[    1.157347] JBD2: Invalid checksum recovering block 71532 in log
[    1.159743] JBD2: recovery failed
[    1.159745] EXT4-fs (sdb): error loading journal
[    1.300933] JBD2: Invalid checksum recovering block 67930 in log
[    1.301045] JBD2: recovery failed
[    1.301047] EXT4-fs (sdb): error loading journal
[    1.407173] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[    2.843982] JBD2: Invalid checksum recovering data block 51785 in log
[    3.058614] JBD2: recovery failed
[    3.058620] EXT4-fs (sdb): error loading journal
[    4.419174] JBD2: Invalid checksum recovering data block 9776 in log
[    4.613561] JBD2: Invalid checksum recovering data block 1879416 in log
[    4.709278] JBD2: recovery failed
[    4.709284] EXT4-fs (sdb): error loading journal
[    4.884286] JBD2: journal transaction 586032 on sdb-8 is corrupt.
[    4.884288] EXT4-fs (sdb): error loading journal
[    5.444378] JBD2: journal transaction 588851 on sdb-8 is corrupt.
[    5.444380] EXT4-fs (sdb): error loading journal
[    6.768213] JBD2: Invalid checksum recovering data block 51785 in log
[    7.118482] JBD2: recovery failed
[    7.118488] EXT4-fs (sdb): error loading journal
[    7.141381] ERROR: MountExt4:1659: mount(/dev/sdb) failed 5
[   10.670841] JBD2: Invalid checksum recovering block 74456 in log
[   10.671840] JBD2: recovery failed
[   10.671842] EXT4-fs (sdb): error loading journal
[   11.460877] JBD2: Invalid checksum recovering block 79887 in log
[   11.462274] JBD2: recovery failed
[   11.462276] EXT4-fs (sdb): error loading journal
[   12.885092] JBD2: Invalid checksum recovering data block 51785 in log
[   12.887995] JBD2: Invalid checksum recovering data block 51785 in log
[   12.895065] JBD2: Invalid checksum recovering data block 51785 in log
[   12.913122] JBD2: Invalid checksum recovering data block 51785 in log
[   13.198654] JBD2: recovery failed
[   13.198657] EXT4-fs (sdb): error loading journal
[   14.461139] EXT4-fs (sdb): 1 orphan inode deleted
[   14.461141] EXT4-fs (sdb): recovery complete
[   14.472229] EXT4-fs (sdb): mounted filesystem with ordered data mode. Opts: discard,errors=remount-ro,data=ordered
[   14.574547] EXT4-fs error (device sdb): ext4_validate_block_bitmap:376: comm init: bg 237: bad block bitmap checksum
[   14.576137] Aborting journal on device sdb-8.
[   14.577767] EXT4-fs (sdb): Remounting filesystem read-only
[   14.580144] EXT4-fs error (device sdb): ext4_validate_block_bitmap:376: comm init: bg 243: bad block bitmap checksum
[   14.584150] EXT4-fs error (device sdb): ext4_validate_block_bitmap:376: comm init: bg 244: bad block bitmap checksum
[   14.586089] EXT4-fs error (device sdb): ext4_validate_block_bitmap:376: comm init: bg 246: bad block bitmap checksum
[   14.589313] EXT4-fs error (device sdb): ext4_validate_block_bitmap:376: comm init: bg 248: bad block bitmap checksum
[   14.591079] EXT4-fs error (device sdb): ext4_validate_block_bitmap:376: comm init: bg 249: bad block bitmap checksum
[   14.593645] EXT4-fs error (device sdb): ext4_validate_block_bitmap:376: comm init: bg 250: bad block bitmap checksum
[   14.595744] EXT4-fs error (device sdb): ext4_validate_block_bitmap:376: comm init: bg 252: bad block bitmap checksum
[   14.597753] EXT4-fs error (device sdb): ext4_validate_block_bitmap:376: comm init: bg 254: bad block bitmap checksum

I will have to re-install wsl2 as it has become painful.

onomatopellan commented 3 years ago

WSL2 vhdx disk sure needed more space to breathe.

Were C and D contiguous partitions? What do you see now in Disk Manager? Dynamic (olive) or basic disk (blue)?

livius-ungureanu commented 3 years ago

image

livius-ungureanu commented 3 years ago

Bad news is that I have:

onomatopellan commented 3 years ago

When that happens, is the ext4.vhdx disk completely enlarged? (filesize +256Gb)

livius-ungureanu commented 3 years ago

No, current size is on disk is 7.90 GB

livius-ungureanu commented 3 years ago

Weird. It does not allow me to resize.

DISKPART> Select vdisk file="C:\Users\liviu.ungureanu\AppData\Local\Packages\CanonicalGroupLimited.Ubuntu20.04onWindows_79rhkp1fndgsc\LocalState\ext4.vhdx"

DiskPart successfully selected the virtual disk file.

DISKPART> expand vdisk maximum=100000

DiskPart has encountered an error: The parameter is incorrect. See the System Event Log for more information.

Note: Ubuntu-20.04 Stopped 2

onomatopellan commented 3 years ago

Too weird. Remember you can always export a distro to another partition/disk with the wsl.exe --export option. I suspect that C and D weren't merged correctly so I'd move the distro to another disk, if possible.

livius-ungureanu commented 3 years ago

image

onomatopellan commented 3 years ago

Is the disk HDD, SDD, NVME?

livius-ungureanu commented 3 years ago

SDD

onomatopellan commented 3 years ago

Ok, thanks. If you have another disk with enough space try to export the distro there.

About the Diskpart error, did you try with something like expand vdisk maximum=300000? (300Gb)

livius-ungureanu commented 3 years ago

I do not have another disk :-) 300000 worked as it is a valid value indeed.

I guess I've got a more narrow isolation of the problem.

wsl2 should suddenly stop:

[process exited with code 1]

PS C:\Users\liviu.ungureanu> wsl -l -v NAME STATE VERSION Ubuntu-20.04 Stopped 2

Though in this case it looks like file system is not corrupted even wsl2 dies suddenly.

Question: my sdd is encrypted. Could this contribute in any way to the problem?

onomatopellan commented 3 years ago

A large file? Sounds similar to #5410 It's still open because it's hard to reproduce. Like in that thread, try using the .wslconfig file and reducing processors and memory used for WSL2.

livius-ungureanu commented 3 years ago

ok, I'll give it a try with .wslconfig

craigloewen-msft commented 3 years ago

We took a look at this problem but weren't able to diagnose any obvious WSL related problems. It's definitely hard for us to repro this problem. As @onomatopellan mentioned, we're also wondering if this is disk related. If you are able to, could you please try running the same repro steps on another disk?

anaisbetts commented 3 years ago

@craigloewen-msft I hit this too, #5026 seems to be related. It seems to be more reliably triggered if you run shutdown /r /t 0 to reboot the machine. I don't believe this is related to the host disk being corrupted, I think that under certain Windows shutdown commands, the WSL2 VM is getting its power yanked instead of getting properly shut down

luigimannoni commented 3 years ago

To add on the above, long machine suspend state or hibernation also trigger the corruption.

livius-ungureanu commented 3 years ago

I decrypted my disk just to eliminate this track. I still get corruptions when I/O is suddenly intensive(i.e Intellij is re-indexing a large project due to some dependency update)

The corruption can be more or less virlulent. In this case dmesg reported a small one. [ 2.691613] EXT4-fs (sdb): 1 orphan inode deleted

@craigloewen-msft I haven't time to try with some other disk as I am working on a office laptop. But as the picture a bit above shows it looks healthy.

craigloewen-msft commented 3 years ago

Thanks for the additional info! We'll keep trying on the dev team to see if we can get a repro for this (I've tried the shutdown trick @anaisbetts mentioned a few times but haven't had any 'luck' yet). If anyone else can find a way to repro this consistently please comment it and tag me in it.

anaisbetts commented 3 years ago

@craigloewen-msft That's surprising, on my computer it hits nearly 100% of the time. Use the VM for something real, wait a bit, shutdown /r /t 0, your .zhistory file's last line is garbage

craigloewen-msft commented 3 years ago

We've identified the issue that has been causing this, and have put in a fix! I'll leave this issue open as our landing zone for the WSL repo, and will be posting updates here and on the popular issue on the WSL 2 kernel repo: https://github.com/microsoft/WSL2-Linux-Kernel/issues/168

craigloewen-msft commented 3 years ago

This is fixed in Insiders preview build 21292, can any folks here who see this issue install this build and let us know if that resolves you? Thank you!

livius-ungureanu commented 3 years ago

@craigloewen-msft Good news!

Unfortunately I will have to wait for the official roll-out since on my laptop I am not allowed to run windows insider builds. Is there any other way to install/upgrade WSL2 to try out the new fix?

craigloewen-msft commented 3 years ago

Unfortunately not yet. I will update these threads with details if this fix becomes available on more versions.

benhillis commented 3 years ago

Reopening while fix is being confirmed.

gmargari commented 3 years ago

Not sure if this is related or I should open another issue, but just got my git repo corrupted. My disk is an SSD.

$ lsb_release -r
Release:        20.04
$ cat /proc/version
Linux version 4.4.0-18362-Microsoft (Microsoft@Microsoft.com) (gcc version 5.4.0 (GCC) ) #1049-Microsoft Thu Aug 14 12:01:00 PST 2020
$ dmesg
[    0.011281]  Microsoft 4.4.0-18362.1049-Microsoft 4.4.35
[    0.218674] <3>init: (1) ERROR: UtilCreateProcessAndWait:489: /bin/mount failed with status
[    0.218678] 2000
PovarovDenis commented 3 years ago

I'm still having this problem - my git is being corrupted from time to time in WSL2.

denis@DESKTOP-ANM2KR6:~/Projects/ui-admin$ git status
error: object file .git/objects/dd/45712db7f3718ac2c1eb512898eba180c241a9 is empty
error: object file .git/objects/dd/45712db7f3718ac2c1eb512898eba180c241a9 is empty
error: object file .git/objects/dd/45712db7f3718ac2c1eb512898eba180c241a9 is empty
fatal: loose object dd45712db7f3718ac2c1eb512898eba180c241a9 (stored in .git/objects/dd/45712db7f3718ac2c1eb512898eba180c241a9) is corrupt
craigloewen-msft commented 3 years ago

@PovarovDenis , what version of Windows are you using? Are you on Windows Insiders using the latest version? (If you're not sure please just paste the output of running ver in CMD).

kimkwanka commented 3 years ago

Is this fix already included in the non-Insiders version? I ask because my .zhistory file got corrupted twice now in the last 2 days. I'm Running Windows 10 Pro [Version 10.0.19042.928].

craigloewen-msft commented 3 years ago

No this isn't fixed in non-insiders yet. We would need to verify that this is working as expected on Insider builds.

We also haven't had much luck getting a repro for this, if you are able to get a consistent repro at all then that would be very helpful to us! For what it's worth I run the latest Insider builds on my main productivity machine, use WSL daily, and run ZSH and haven't seen this corruption issue which to me indicates that when the next major version of Windows is available your issue should be fixed! :)

livius-ungureanu commented 3 years ago

@craigloewen-msft

This has become a way of life :-) as it happens at least once every 2 days. Imagine your daily setup is somewhat ~ 6 docker containers(~ 800MB each ) and 3-4 intellij instances eagerly to update their indexes... and suddenly [process exited with code 1] . And then the setup needs to be brought up back.

Literally everything runs within WSL2 like in a linux box:

If you can provide me somehow with a fixed wsl2, I am keen to install it. Unfortunately only on my office laptop I have this environment and windows insider is not possible.

craigloewen-msft commented 3 years ago

@livius-ungureanu as of right now the only way to get the latest changes is to be on Windows Insiders.

Do you have a separate machine that you can put on Insiders and run the same workflow on it?

I tried to repro this on the latest builds by upgrading intellij and building some containers simultaneously but wasn't able to hit this issue on my machine.

mbwhite commented 3 years ago

Not sure if this helps or not.. but I've encountered this issue on my home machine; but not on my work laptop. Both are windows 10 pro, and I've setup up WSL2 the same way on both.. Work laptop gets significantly more usage, and hasn't hit this issue.

Home machine has hit this - the only difference I can see is Windows Docker Desktop in installed on the home machine (it's not now)... I've seen other reports of this error that also mention docker desktop.

Might spark an idea?

mthorning commented 3 years ago

Yes! This is exactly my situation, I hadn't considered Docker Desktop could be the problem, thanks.

luigimannoni commented 3 years ago

I am somewhat inclined to point at docker as well, in fact the times I've had data loss was on different docker projects with containers running, made the habit stop docker containers gracefully, exit docker desktop and switch off wsl before rebooting/shutting down.

However there are people mentioning bash history becoming corrupt too, which does not sound correlated to docker and never happened in my case.

Anuiran commented 3 years ago

I don’t run docker, just Ubuntu 20.04 in WSL2 and PhpStorm and had git corrupted today after rebooting my pc.

craigloewen-msft commented 3 years ago

As of right now it seems fixed on the latest Windows Insider builds. If you're seeing this issue, please comment with your Windows build number,, ensure that you're on the latest Windows Insider build, and include as many repro steps as you can! Thanks!

ksze commented 2 years ago

Any news on whether the fix is in release/stable build yet?

sarim commented 2 years ago

@jjaaccoobb you are running wsl version 1. Which doesnt use ext4. This thread is about ext4 filesystem in wsl2.

sozercan commented 2 years ago

As of right now it seems fixed on the latest Windows Insider builds.

@craigloewen-msft do we need to update to Windows 11 to get this fix? Will this fix be available in future Windows 10 builds?

mhsdesign commented 2 years ago

i started getting these problem with corrupted git and zsh corrupt history two days ago (https://github.com/microsoft/WSL/issues/5026) ... weird because i cant recall changing anything meaningfull on my setup. Before that everything worked fine - i read something about shutting wls down carefully with wsl --shutdown but while it was working i wasnt following any rules.

frequently using wls2 with: mariadb 10.3, php 7.4, git, and vscode in wls mode.

(oh and yes i should update my windows - maybe this fixes something - using: Microsoft Windows [Version 10.0.19042.1110])

f-liva commented 2 years ago

Same here

f-liva commented 2 years ago

As of right now it seems fixed on the latest Windows Insider builds.

@craigloewen-msft do we need to update to Windows 11 to get this fix? Will this fix be available in future Windows 10 builds?

Windows 11 suffers the same issue

craigloewen-msft commented 2 years ago

@f-liva when are you seeing this on Windows 11? Do you have any repro steps that we could use to help diagnose this problem??

f-liva commented 2 years ago

I'm using WSLg to run GitKraken from Debian Linux subsystem.

I have some Git repositories weighing about 400MB and with many files. Working with GitKraken on this repository, then performing normal stash, commit or push and pull operations, the subsystem often crashes due to an EXT4-related error. GitKrakens suddenly closes, PhpStorm tells me that it can no longer read files from WSL and Docker crashes. When I restart Docker I'm often notified of an EXT4-related problem.

There is no fixed procedure to follow to replicate this problem. It occurs very often and in different operations. For example if a first crash is caused by a file stash on the repository, the next time it works, and maybe it crashes on commit or push. In short, everything happens relatively to Git repositories with thousands of files in versioning.

As I understand it, the more operations that are active on the WSL filesystem, the more likely it is that it will crash.

The PC is new and so is the ssd, so I exclude any kind of hardware problem.

Is there any way to record these crashes in any WSL logs? If yes I could collect some and send them to you.

craigloewen-msft commented 2 years ago

Do you have an HDD or SSD?

And could you please enable logging using the instructions found here and try to reproduce a crash and then send it to me? That might give us more clues on what's going on.

I'll also try grabbing a large git repository and doing the operations you just listed.

EDIT: I just tried reproing this for a while. I installed and ran Git Kraken, had Docker Desktop installed and on, git cloned the VS Code repo, made 100s of thousands of small files, some larger files (2GB sized), committed them, stashed them, made new branches, switched branches, and then set that all on auto repeat for 30 minutes. I was on my HDD and wasn't able to detect any crashes or corruption. We aren't able to repro this issue, so if you are seeing this please could you list out your exact steps on when this happen so we can try and replicate it and get a repro of the problem? Thank you!

f-liva commented 2 years ago

I have a brand new SSD

Check out this feedback https://aka.ms/AAdncc7