microsoft / WSL

Issues found on WSL
https://docs.microsoft.com/windows/wsl
MIT License
16.95k stars 798 forks source link

WSl2 corrupts ext4 filesystem #5895

Open livius-ungureanu opened 3 years ago

livius-ungureanu commented 3 years ago

Environment

Platform ServicePack Version VersionString


Win32NT 10.0.19041.0 Microsoft Windows NT 10.0.19041.0

lsb_release -r Release: 20.04

cat /proc/version Linux version 4.19.104-microsoft-standard (oe-user@oe-host) (gcc version 8.2.0 (GCC)) #1 SMP Wed Feb 19 06:37:35 UTC 2020

Windows build number: [run `[Environment]::OSVersion` for powershell, or `ver` for cmd]
Your Distribution version: [On Debian or Ubuntu run `lsb_release -r` in WSL]
Whether the issue is on WSL 2 and/or WSL 1: [run `cat /proc/version` in WSL]

Steps to reproduce

I am using Intellij linux version running in WSL2 and connected to a X410 server for GUI . While intellij is running some apps WSL2 suddenly stops. After a start again wsl2 I see that

- filesystem also corrupted some of my files:

cat someProjectFile 2f��)��l�l�.;�K{�7���/��~Gh{�7���/��~Gh{�7���/��~Gh{�7���/��~Gh{�7���/��~Gh{�7���/��~Gh{�7���/��~Gh{�7���/��~Gh{�7���/��~Gh{�7���/��~Gh{�7���/��~Gh{�7���/��~Gh{�7���/��~Gh{�7���/��~Gh{�7���/��~Gh{�7���/....



<!-- 
If you'd like to provide logs you can provide an `strace(1)`  log of the failing command (if `some_command` is failing, then run `strace -o some_command.strace -f some_command some_args`, and link the contents of `some_command.strace` in a gist. 
More info on `strace` can be found here: https://www.man7.org/linux/man-pages/man1/strace.1.html
You can use Github gists to share the output: https://gist.github.com/
-->

<!--
Collect WSL logs by following these instructions: https://github.com/Microsoft/WSL/blob/master/CONTRIBUTING.md#8-detailed-logs  
-->
**WSL logs**: 

#  Expected behavior

Do not corrupt the ext4 file system.This makes WS2 quite unreliable it would be fine to be fixed as soon as posible.

<!-- A description of what you're expecting, possibly containing screenshots or reference material. -->

# Actual behavior
Every two 2-3 days the ext4 file system gets currupted.

<!-- What's actually happening? -->
f-liva commented 2 years ago

Did the log come in handy?

mhsdesign commented 2 years ago

btw i have build myself a shortcut with wls --shutdown. Since starting to use it before shutting down windows, i dindt had any problems anymore - (maybe an automatic ps shutdown action will do too)

f-liva commented 2 years ago

Wsl + Gitkraken + Phpstorm is the key to make wsl cash very often!

Just try

maidzen commented 2 years ago

We have 4 Virtual Maschines (VM Ware Horizon) - Win10 Enterprise. WSL2 Debian + PHPstorm.

After Reboot (VM) there is a high chance, on all VMs, some IDE Setting files are corrupted.

I couldn't reproduce the issue outside of the VM.

craigloewen-msft commented 2 years ago

The logs unfortunately only let us know that the system was corrupted, not how it got corrupted.

I will try WSL + GitKraken + PHPStorm in a VM and see what happens to see if I can repro this.

f-liva commented 2 years ago

Thanks

By myself I will continue to send you all crash reports when they occurs

avatsaev commented 2 years ago

Getting this at least two times a week on wsl2:

error: object file .git/objects/3c/f8503dd9cc39c04f998292333e3581479e5fc1 is empty
error: object file .git/objects/3c/f8503dd9cc39c04f998292333e3581479e5fc1 is empty
fatal: loose object 3cf8503dd9cc39c04f998292333e3581479e5fc1 (stored in .git/objects/3c/f8503dd9cc39c04f998292333e3581479e5fc1) is corrupt

also on zsh_history

f-liva commented 2 years ago

It seems kernel of wsl goes to panic, they will rollout an update soon, they said to me

Blitheness commented 2 years ago

@avatsaev A comment on #5026 recommends running

find .git/objects/ -type f -empty | xargs rm; git fetch -p; git fsck --full

in your git repository, in attempt to repair it. I just had this problem and running that worked.

"rm: missing operand" means nothing was deleted (the part of the above command before the pipe character returned nothing)

colemickens commented 2 years ago

Please god, I have no where else to put this but please don't enable save+restore by default. I dual boot this hard drive between hyper-v and native and the default save/restore action just completely trashed my entire ext4.

It's very upsetting that snapshots were blocked because of the use of a physical disk but save+restore wasn't. Imagine my horror as I realize my live services are supposedly somehow running after a restart (when the installed OS requires a decryption password to boot). Yes, Hyper-V had just held onto the VM RAM and then started replaying against the HDD in a completely different state. Absolutely awful, just sitting down to try more data recovery.

mrjrieke commented 2 years ago

So, this has been happening a lot lately. After the most recent windows update, I found I had to re-install several apps. This happens often enough now that I just keep the installers handy in the Download folder and re-install when things fail. Good news is that re-installs are trivial and finish in a couple seconds. But it's a pain and I'd rather not. I've re-installed idea @livius-ungureanu several times. I've also found that shutting down apps naturally (shutdown apps before calling: wsl --shutdown), can help prevent the problem but not always.

I really like being able to run some apps in WSL2 and having it nestled in Windows is mostly pretty handy. Would love for my linux apps not to get corrupted like this. For those using WSL-2... I really enjoyed the latest update (KB5007186) mostly aside from the broken app installs. WSL-2 seems to be much kinder to windows overall (or vice versa) both in memory and CPU utilization.

nacitar commented 2 years ago

I've been watching this ticket for over a year now... and I'm starting to think it has become a catch-all for filesystem issues that are encountered in WSL, independent of the cause, as long as EXT4 is being used. There's not a recognized pattern that has been arrived at to assign cause, and reproduction attempts are unsuccessful. I don't pretend know the issue here... but I've begun to suspect that it's not a single WSL related issue... but a smattering of various workflows/hard drive failures/power losses at inopportune moments that get blamed on WSL because of the lack of a clear cause, combined with the "niche" nature of WSL.

dcharlespyle commented 2 years ago

Well, now this is odd. I seem now to be experiencing similar. But this did not happen to me on either a standard hard disk or an SSHD. But it started the day after upgrading to an SSD. But I couldn't take the sometimes very slow performance of the Seagate SSHD I was using anymore! I'm now using a brand-new Samsung 870 EVO SSD, with the most up-to-date firmware available applied. Diagnostics say the drive is working properly.

The very next day after installing the SSD and moving the files to it, WSL stopped working. Everything worked fine during the night and early morning of the data migration. Made use of several apps from the command line and WSLg. But then I shut down the system for a while to get some rest.

On booting up again this afternoon, WSL now no longer functions―at all. I cannot now even get a listing of any Linux Distributions. I cannot import or export anything. Restarting the service doesn't help. WSL just sits there accomplishing and displaying nothing. And I cannot even read the ext4 file system in the WSL VM at all. It seems to me that WSL doesn't seem to like running for very long on SSDs. Unfortunately, I cannot now go back to the previous SSHD. Going to try reinstalling WSL, if at all possible. But I may also have to install Windows from scratch.

Mark90 commented 2 years ago

I use VSCode (remote) and Windows Terminal to connect and work on a Ubuntu VM in WSL2.

Over 2 weeks ago I had a different issue where WSL became unreachable, which I resolved by restarting some stuff.

Last few days I started seeing the issue of ~/.zsh_history being corrupted. Today I booted up my machine, went into Windows Terminal and found the shell history was corrupt again, but this time also a git repository I worked on had corrupted objects in its .git/ folder (other git repositories I didn't touch were fine). Was able to resolve using the aforementioned find method.

Unfortunately I don't have a way to reproduce this. (yet) Just rebooted my computer without shutting down WSL, but no corruption this time. All I can offer is the software and hardware I'm using. This is a personal machine so if I can do anything to help investigate let me know.

OS Microsoft Windows [Version 10.0.19043.1415].

VMs Saw some suggestions that docker or other VMs might be related, these are the ones i have (only using Ubuntu-20.04).

C:\Users\Mark>wsl -l -v
  NAME                   STATE           VERSION
* Ubuntu-20.04           Running         2
  Ubuntu-18.04           Stopped         1
  docker-desktop-data    Stopped         2
  docker-desktop         Stopped         2

Host disk Also saw suggestions that the disk type or health on which WSL is installed could play a role. I'm using an SSD, an ADATA-SX8200PNP 1TB. In use since 19-sep-2020, SMART values are fine, Adata's diagnostics tool shows no problems. 953GB effective space, 905GB allocated for C drive, 48GB unallocated for the drive to maintain itself. C drive has 97GB free space.

Guest disk Has enough space, and it seems to be in good shape (this was done before reboot).

root@DESKTOP:~# badblocks -sv /dev/sdb
Checking blocks 0 to 268435455
Checking for bad blocks (read-only test): done
Pass completed, 0 bad blocks found. (0/0/0 errors)
root@DESKTOP:~# cd /home/mark/dev
root@DESKTOP:/home/mark/dev# df -h .
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb        251G  5.6G  233G   3% /
root@DESKTOP:/home/mark/dev#

In dmesg as far as I can tell there's nothing concerning. Only some FS-Cache Duplicate cookie warnings that are considered harmless.

[    2.542543] FS-Cache: Duplicate cookie detected
[    2.542545] FS-Cache: O-cookie c=000000002e553703 [p=000000003e213e5a fl=222 nc=0 na=1]
[    2.542546] FS-Cache: O-cookie d=00000000fdf3c0be n=00000000d02613b3
[    2.542547] FS-Cache: O-key=[10] '34323934393337353330'
[    2.542549] FS-Cache: N-cookie c=00000000b48df181 [p=000000003e213e5a fl=2 nc=0 na=1]
[    2.542550] FS-Cache: N-cookie d=00000000fdf3c0be n=0000000051ca3ab8
[    2.542550] FS-Cache: N-key=[10] '34323934393337353330'
[    2.689208] FS-Cache: Duplicate cookie detected
[    2.689210] FS-Cache: O-cookie c=00000000afa7d30e [p=000000003e213e5a fl=222 nc=0 na=1]
[    2.689211] FS-Cache: O-cookie d=00000000fdf3c0be n=000000002803e293
[    2.689211] FS-Cache: O-key=[10] '34323934393337353435'
[    2.689213] FS-Cache: N-cookie c=000000006f2a40b5 [p=000000003e213e5a fl=2 nc=0 na=1]
[    2.689213] FS-Cache: N-cookie d=00000000fdf3c0be n=00000000bae1450c
[    2.689213] FS-Cache: N-key=[10] '34323934393337353435'
fedemarco commented 2 years ago

Hi @Mark90, have the same issues constantly. The only way I found to fix it so far is following this steps:

https://github.com/microsoft/WSL/issues/5092#issuecomment-743937383

Note that this will break the git repositories affected and will need to be redownloaded.

I also run VSCode in WSL, and have all my dev environment set up there.

zeejay09 commented 1 year ago

I am also experiencing this, every 3 to 5 days, my wsl2 corrupts my ext4 filesystem, if it can't be fixed by e2fsck, I resort to uninstalling and installing it over again and setup my work environment over and over again.

anodynos commented 1 year ago

Is this issue still a thing after so many years? I just spent days migrating my linux dev environment to WSL just to find out that WSL 2 corrupts the file system ;-(

@craigloewen-msft Is the fix for windows insiders out to normal Windows 10 update?

maikebing commented 1 year ago

wsl's dmesg

   13.304275] blk_update_request: I/O error, dev sde, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0
[   13.306171] blk_update_request: I/O error, dev sde, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0
[   13.307202] Buffer I/O error on dev sde, logical block 134184960, lost sync page write
[   13.307818] JBD2: Error -5 detected when updating journal superblock for sde-8.
[   13.308443] Aborting journal on device sde-8.
[   13.308846] Buffer I/O error on dev sde, logical block 134184960, lost sync page write
[   13.309433] JBD2: Error -5 detected when updating journal superblock for sde-8.
[   13.309991] EXT4-fs error (device sde): ext4_put_super:1188: comm Xwayland: Couldn't clean up the journal
[   13.310661] EXT4-fs (sde): Remounting filesystem read-only
maikebing commented 1 year ago

https://github.com/docker/for-win/issues/13340

vault-thirteen commented 1 year ago

@anodynos , looks like @maikebing and me having this issue today, so ... well ... let us hope that Microsoft will fix this soon, as this bug is critical to the entire WSL.

https://github.com/docker/for-win/issues/13345

makuepfer commented 1 year ago

The Docker for Windows issue seems to be resolved in the newest pre-version of the WSL 1.2.1.0.

Credits to: docker/for-win#13386

vault-thirteen commented 1 year ago

Good news, @makuepfer ! How much time does it usually take for the stable version of WSL to come out ?

maikebing commented 1 year ago
Please investigate the following 2 issues:

1 : The test: is the VM time synchronized?
    Failed with: parsing time output: 2023-04-15T11:01:45+00:00: parsing time "2023-04-15T11:01:45+00:00" as "2006-01-02T15:04:05UTC": cannot parse "+00:00" as "UTC"

The VM time must be in sync with the host, otherwise Docker Desktop will not work correctly.

Ensure you are using a modern WSL 2 kernel (see "wsl --update"). If this problem persists,
try manually synchronizing the VM clock with "sudo hwclock -s".

2 : The test: is the WSL 2 Linux filesystem corrupt?
    Failed with: [   15.045125] EXT4-fs error (device sdc): ext4_put_super:1188: comm Xwayland: Couldn't clean up the journal

If the WSL 2 Linux filesystem is corrupt then Docker Desktop cannot start.
There is a known issue fixed in Windows Insider builds which can cause filesystem corruption, see:
https://github.com/microsoft/WSL/issues/5895 .

Try running "wsl --shutdown" to stop your WSL Virtual Machine. When it restarts it will
run a filesystem check and hopefully fix the problem.
gibso commented 1 year ago

What I currently do, when docker is not starting anymore because of this:

  1. End Docker Desktop tasks in Task Manager, but keep Docker Desktop Service alive
  2. Open the PowerShell, shutdown wsl and delete the docker distributions

    wsl --shutdown
    wsl --unregister docker-desktop-data
    wsl --unregister docker-desktop
  3. Start Docker Desktop again
softwindy-99 commented 1 year ago

Here seems to be a strange phenomenon: although the check program shows an 'EXT4-fs error', my Docker Desktop program can still start normally.

What should I do? or not do anything?

Please investigate the following 1 issue:

1 : The test: is the WSL 2 Linux filesystem corrupt?
    Failed with: [   53.530731] EXT4-fs error (device sdc): ext4_put_super:1188: comm weston: Couldn't clean up the journal

If the WSL 2 Linux filesystem is corrupt then Docker Desktop cannot start.
There is a known issue fixed in Windows Insider builds which can cause filesystem corruption, see:
https://github.com/microsoft/WSL/issues/5895 .

Try running "wsl --shutdown" to stop your WSL Virtual Machine. When it restarts it will
run a filesystem check and hopefully fix the problem.

WSL

WSL version: 1.2.5.0
Windows version: 10.0.22621.1702

Docker

Docker version 23.0.5, build bc4487a
markuszeller commented 11 months ago

I did not notice any corrupt files. Shutting down wsl keeps the same error to be not able to start the docker engine.

Update: The file system was not corrupt at all. I've removed Docker Desktop 4.22.0 and downgraded to 4.21.1 which worked immediately!

mkl- commented 10 months ago

I have a similar error. In my case, it seems that WSL2 filesystem file gets corrupted which prevents WSL2 from booting. Thus, it generates a timeout error.

WSL 2 timeout error

The operation timed out because a response was not received from the virtual machine or container.
Error code: Wsl/Service/CreateInstance/HCS_E_CONNECTION_TIMEOUT

Here is a relatively simple way to restore a corrupted filesystem. It worked for me.

Restoring corrupted WSL2 ext4 filesystem. Based on https://github.com/microsoft/WSL/discussions/8839#discussioncomment-3703511 https://superuser.com/questions/274615/accidentally-reformatted-a-ext4-drive-to-ntfs-lost-all-data-what-are-my-option/1204121#1204121 Assume that the name of the broken wsl2 is ubuntu-main and it is a default distribution.

  1. Install some other wsl2 Linux distributive. Assume that its name is Ubuntu
  2. Locate WSL2 .vhdx file: C:\wsl\ubuntu-main\ext4.vhdx
  3. Connect this disk to some other WSL2 distribution: 3.1 Start other distribution: (Windows)$ wsl -d Ubuntu 3.2 Check and write somewhere a list of disks: (LINUX)$ lsblk 3.3 Connect the failed disk to the running WSL2. In other Windows terminal run (WINDOWS)$ wsl -d Ubuntu --mount --bare --vhd C:\wsl\ubuntu-main\ext4.vhdx 3.4 Switch back and find what new disk appeared: (LINUX)$ lsblk In my case, it was sdd
  4. Fix the filesystem on the corrupted disk: (LINUX)$ sudo fsck.ext4 -v /dev/sdd
  5. Shutdown wsl2 to unconnect disks (WINDOWS)$ wsl --shutdown
  6. Check that previously broken WSL2 starts: (WINDOWS)$ wsl
Edstub207 commented 9 months ago

Hello! We have spotted this with https://github.com/docker/for-win/issues/13716 - Currently unsure if the two are related, but possible. Would be good to have a fix.

silverlight commented 8 months ago

What I currently do, when docker is not starting anymore because of this:

  1. End Docker Desktop tasks in Task Manager, but keep Docker Desktop Service alive
  2. Open the PowerShell, shutdown wsl and delete the docker distributions
    wsl --shutdown
    wsl --unregister docker-desktop-data
    wsl --unregister docker-desktop
  3. Start Docker Desktop again

amazing!

markuszeller commented 8 months ago

I found another solution that happily worked on my PC.

Optional steps, but I wanted to remove Kali

ultimaweapon commented 7 months ago

I also having this problem. My problem is some of my system files become zero sized:

[    0.668389] FS-Cache: Duplicate cookie detected
[    0.668857] FS-Cache: O-cookie c=00000004 [p=00000002 fl=222 nc=0 na=1]
[    0.669245] FS-Cache: O-cookie d=0000000081326a9d{9P.session} n=000000007f9623b6
[    0.669560] FS-Cache: O-key=[10] '34323934393337333538'
[    0.669746] FS-Cache: N-cookie c=00000005 [p=00000002 fl=2 nc=0 na=1]
[    0.669962] FS-Cache: N-cookie d=0000000081326a9d{9P.session} n=00000000f6072358
[    0.670230] FS-Cache: N-key=[10] '34323934393337333538'
[    0.696057] scsi 0:0:0:2: Direct-Access     Msft     Virtual Disk     1.0  PQ: 0 ANSI: 5
[    0.701544] sd 0:0:0:2: Attached scsi generic sg2 type 0
[    0.702146] sd 0:0:0:2: [sdc] 2147483648 512-byte logical blocks: (1.10 TB/1.00 TiB)
[    0.702485] sd 0:0:0:2: [sdc] 4096-byte physical blocks
[    0.702700] sd 0:0:0:2: [sdc] Write Protect is off
[    0.702895] sd 0:0:0:2: [sdc] Mode Sense: 0f 00 00 00
[    0.703158] sd 0:0:0:2: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    0.704581] sd 0:0:0:2: [sdc] Attached SCSI disk
[    0.716851] EXT4-fs (sdc): mounted filesystem with ordered data mode. Opts: discard,errors=remount-ro,data=ordered. Quota mode: none.
[    0.916809] hv_pci 33d24e75-1c62-4ccc-a2c7-6197505d93a6: PCI VMBus probing: Using version 0x10004
[    0.918318] hv_pci 33d24e75-1c62-4ccc-a2c7-6197505d93a6: PCI host bridge to bus 1c62:00
[    0.918863] pci_bus 1c62:00: root bus resource [mem 0x9ffe08000-0x9ffe0afff window]
[    0.919214] pci_bus 1c62:00: No busn resource found for root bus, will use [bus 00-ff]
[    0.920055] pci 1c62:00:00.0: [1af4:1049] type 00 class 0x010000
[    0.920822] pci 1c62:00:00.0: reg 0x10: [mem 0x9ffe08000-0x9ffe08fff 64bit]
[    0.921408] pci 1c62:00:00.0: reg 0x18: [mem 0x9ffe09000-0x9ffe09fff 64bit]
[    0.922025] pci 1c62:00:00.0: reg 0x20: [mem 0x9ffe0a000-0x9ffe0afff 64bit]
[    0.924269] pci_bus 1c62:00: busn_res: [bus 00-ff] end is updated to 00
[    0.924587] pci 1c62:00:00.0: BAR 0: assigned [mem 0x9ffe08000-0x9ffe08fff 64bit]
[    0.925215] pci 1c62:00:00.0: BAR 2: assigned [mem 0x9ffe09000-0x9ffe09fff 64bit]
[    0.925717] pci 1c62:00:00.0: BAR 4: assigned [mem 0x9ffe0a000-0x9ffe0afff 64bit]
[    0.961297] /sbin/ldconfig:
[    0.961301] File /usr/lib/libreadline.so is empty, not checked.

[    0.964553] /sbin/ldconfig:
[    0.964566] File /usr/lib/libgettextpo.so is empty, not checked.

[    0.966586] /sbin/ldconfig:
[    0.966588] File /usr/lib/libgettextlib.so is empty, not checked.
intellild commented 1 month ago

Holy crap it's has been years. Is there any update for this? Every time I tried WSL results in random contents among git repositories and shell history and I have alread switched my workstation to Linux desktop It works smooth without ugly font rendering and power wasting random background processes from windows