microsoft / WSL

Issues found on WSL
https://docs.microsoft.com/windows/wsl
MIT License
17.43k stars 823 forks source link

flood of hv_storvsc errors, high cpu usage even if idle #9173

Open xworld21 opened 1 year ago

xworld21 commented 1 year ago

Version

Microsoft Windows [Version 10.0.19044.2311]

WSL Version

Kernel Version

5.15.74.2-microsoft-standard-WSL2

Distro Version

Ubuntu 22.04

Edit: and Debian Bullseye, Fedora Remix for WSL 37, with both systemd enabled and disabled.

Other Software

Edit: McAfee Antivirus, Sophos Safeguard. The issue disappeared after I switched antivirus (from McAfee to MS Defender) and management software (removed Sophos Safeguard, but BitLocker still enabled).

Repro Steps

Open Ubuntu (or any other distro -- I have seen this in Debian and the Fedora Remix for WSL).

Expected Behavior

Normal performance.

Actual Behavior

High CPU usage from rsyslog and journald, around 15-20% CPU, reflected by high CPU usage by Vmmem. Disabling kernel logging in rsyslog and journald quiets the machine, but I still see some 10% CPU from Vmmem. The cause seems related to the following message flooding the kernel log:

[  271.065940] hv_storvsc fd1d2cbd-ce7c-535c-966b-eb5f811c95f0: tag#599 cmd 0x28 status: scsi 0x2 srb 0x4 hv 0xc0000001
[  271.066203] hv_storvsc fd1d2cbd-ce7c-535c-966b-eb5f811c95f0: tag#599 cmd 0x28 status: scsi 0x2 srb 0x4 hv 0xc0000001
[  271.066412] hv_storvsc fd1d2cbd-ce7c-535c-966b-eb5f811c95f0: tag#599 cmd 0x28 status: scsi 0x2 srb 0x4 hv 0xc0000001

I couldn't find any info about this particular error, so I am reporting this here.

The only way to stop this, at least that I could find, is to shutdown the virtual machine altogether.

Edit: the above messages are SCSI read errors (see kernel line generating the message).

Diagnostic Logs

WslLogs-2022-11-18_21-28-14.zip

xufan6 commented 1 year ago

same issue for me. WSL was enabled systemd

leonardodalinky commented 1 year ago

Same. Want solutions.

LarsErikP commented 1 year ago

I got this on a regular Ubuntu 22.04 VM as well (on Windows Server 2012R2). Seems to be solved by removing the (unused) SCSI controller.

xworld21 commented 1 year ago

To thicken the plot, the issue has completely disappeared on my machine after my IT issued an update, which for the most part replaced an old management tool with Intune. I am investigating the exact changes that were applied – there may have been subtle changes about BitLocker as well.

PS: the error message is indeed a SCSI read error.

LarsErikP commented 1 year ago

I got this on a regular Ubuntu 22.04 VM as well (on Windows Server 2012R2). Seems to be solved by removing the (unused) SCSI controller.

Never mind. It's still flooding the logs..

xworld21 commented 1 year ago

I received more details about my machine. We changed both antivirus (from McAfee to Defender) and some management software dealing with encryption (we got rid of Sophos Safeguard). The antivirus sounds like a plausible source of read errors towards the VM disk.

Do the other people here have a non-MS antivirus running?

Jerry-Ma commented 1 year ago

I am seeing the same thing here, in WSL Ubuntu 22.04. I also have the systemd enabled, not sure if this is related.

xworld21 commented 1 year ago

I wasn't very clear in my original report, but systemd is not related to this. Even if you disable it, you will still get a flood in the kernel log (run dmesg to verify). Within the VM you will not see high CPU usage once journald is not running, but you will still get high CPU usage from Vmmem on the Windows side. This may be related to your Windows installation – I encourage anybody who has commented here to add details about their encryption and antimalware setup. (Grasping at straws here!)

cwarlich commented 1 year ago

To mitigate the issue, you may run WSL using a self-compiled Kernel (see https://github.com/microsoft/WSL2-Linux-Kernel) and apply the patch below. It stops the message flood in dmesg (and in /var/log/syslog and /var/log/kern.log if running WSL with systemd). It also reduces the idle load to a third of what I saw before: Certainly not the ultimate fix, but still much better that nothing :-).

diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index 6110dfd903f7..81c4701b1a0b 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -1201,11 +1201,14 @@ static void storvsc_on_io_completion(struct storvsc_device *stor_device,
            vstor_packet->vm_srb.srb_status != SRB_STATUS_SUCCESS) {

                /*
-                * Log TEST_UNIT_READY errors only as warnings. Hyper-V can
-                * return errors when detecting devices using TEST_UNIT_READY,
-                * and logging these as errors produces unhelpful noise.
+                * Log TEST_UNIT_READY and READ_10 errors only as warnings.
+                * Hyper-V can return errors when detecting devices using
+                * TEST_UNIT_READY, and WSL returns READ_10 errors on some
+                * systems when systemd is enabled. Logging these as errors
+                * produces unhelpful noise.
                 */
-               int loglevel = (stor_pkt->vm_srb.cdb[0] == TEST_UNIT_READY) ?
+               int loglevel = (stor_pkt->vm_srb.cdb[0] == TEST_UNIT_READY ||
+                               stor_pkt->vm_srb.cdb[0] == READ_10) ?
                        STORVSC_LOGGING_WARN : STORVSC_LOGGING_ERROR;

                storvsc_log(device, `loglevel,
vidarlo commented 1 year ago

Same behavior suddenly appeared after installing a WSL update using wsl.exe --update. Version info:

WSL version: 1.0.3.0
Kernel version: 5.15.79.1
WSLg version: 1.0.47
MSRDC version: 1.2.3575
Direct3D version: 1.606.4
DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows version: 10.0.19045.2364

Messages:

[   97.647632] hv_storvsc fd1d2cbd-ce7c-535c-966b-eb5f811c95f0: tag#315 cmd 0x28 status: scsi 0x2 srb 0x4 hv 0xc0000001

Repeated 100 times per second approx.

A good fix is very welcome, because this is not useable.

thoralt commented 1 year ago

Same here, getting more than 2000 error messages per second, battery is draining fast.

Edit:

WSL version: 1.0.3.0
Kernel version: 5.15.79.1
WSLg version: 1.0.47
MSRDC version: 1.2.3575
Direct3D version: 1.606.4
DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows version: 10.0.19044.2364

Anti virus: MS Defender
Encryption: BitLocker

Does anybody know if this problem occured in earlier versions of WSL? If not, is there any way to downgrade?

cquick01 commented 1 year ago

Also seeing the logs flooded with similar messages, hundreds per second. Bogging down the whole system.

[  249.512760] hv_storvsc fd1d2cbd-ce7c-535c-966b-eb5f811c95f0: tag#683 cmd 0x28 status: scsi 0x2 srb 0x4 hv 0xc0000001
[  249.513169] hv_storvsc fd1d2cbd-ce7c-535c-966b-eb5f811c95f0: tag#683 cmd 0x28 status: scsi 0x2 srb 0x4 hv 0xc0000001
[  249.513565] hv_storvsc fd1d2cbd-ce7c-535c-966b-eb5f811c95f0: tag#683 cmd 0x28 status: scsi 0x2 srb 0x4 hv 0xc0000001
wsl.exe --version
WSL version: 1.0.3.0
Kernel version: 5.15.79.1
WSLg version: 1.0.47
MSRDC version: 1.2.3575
Direct3D version: 1.606.4
DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows version: 10.0.19044.2486

Edit: We also have BitLocker enabled, along with Sophos SafeGuard 8.20. Sophos Intercept X is running instead of MS Defender.

cswrd commented 1 year ago

My workaround for the high cpu usage problem was to downgrade WSL. I don't mean WSL2 to WSL1, but some older WSL2 version with an older Linux kernel.

Some background information in my specific case: I've upgraded WSL due to a VS Code issue that prevented opening WSL2 folders from the windows explorer context menu. VS Code reports Failed to connect to the remote extension host server (Error: Missing proxy instance MainThreadDebugService). This bug was present on downgrade again, but I workaround it differently, now: opening the folders from within VS Code or through the windows recently used (pinned) folders works without any issues for me.

vidarlo commented 1 year ago

My workaround for the high cpu usage problem was to https://github.com/microsoft/WSL/issues/9383#issuecomment-1400806801 WSL. I don't mean WSL2 to WSL1, but some older WSL2 version with an older Linux kernel.

I attempted older releases, all the way back to 0.70.5, downloaded from the releases page. Still same issue on all of them.

cswrd commented 1 year ago

@vidarlo On windows 10 those releases didn't work for me, too. afaik all of them require windows 11. However, as mentioned in the downgrade link above, the cab files from the windows update catalog worked for me.

5kind commented 1 year ago

My wsl crashed the other day,I can't login to any distribution, use wsl -t arch to terminate and wsl --unregister arch to unregister archlinux, it completely stuck without any response, after that I have to restart my computer, and today I find 2 very big log files in my wsl ubuntu , :

-rw-r----- 1 syslog adm 120805030800  2月  5 00:00 /var/log/kern.log.1
-rw-r----- 1 syslog adm 120805369192  2月  5 00:00 /var/log/syslog.1

I found that the last hundred lines of the two log files kept repeating the following: Feb 5 00:00:16 ... kernel: [141600.214394] hv_storvsc fd1d2cbd-ce7c-535c-966b-eb5f811c95f0: tag#511 cmd 0x28 status: scsi 0x2 srb 0x4 hv 0xc0000001 I'm not sure, but I believe they are related, I enabled systemd, I'm not sure if antivirus has anything to do with this, I think I didn't run a specific programm that day, it just appeared suddenly, and crashed

lm1baker commented 1 year ago

My colleague has exactly the same hardware (corporate purchased laptop with same CPU, motherboard, SSD and RAM), kernel, Windows 10 and WSL versions as me. We are both running Ubuntu 20.04 LTS in Windows. He gets a few messages like this per second like this in dmesg    96.415597] hv_storvsc fd1d2cbd-ce7c-535c-966b-eb5f811c95f0: tag#46 cmd 0x28 status: scsi 0x2 srb 0x4 hv 0xc0000001 I do not.

Also disk active time in task manger is 100% only when WSL is running, while any disk operation in WSL is very slow, perhaps 10% of what it should be (for example a database query on the same db on local storage takes 30 seconds vs 3). WSL version > wsl --version WSL version: 1.2.5.0 Kernel version: 5.15.90.1 WSLg version: 1.0.51 MSRDC version: 1.2.3770 Direct3D version: 1.608.2-61064218 DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp Windows version: 10.0.19045.2965

~ uname -r 5.15.90.1-microsoft-standard-WSL2

lm1baker commented 1 year ago

The cause was a bad block on the Windows host SSD. When WSL tried to read the part of the VHDX corresponding to the bad block, it appeared to get stuck in this condition.

blackisle51 commented 1 year ago

Bit of an edge case but I had this a similar issue on an Ubuntu Hyper-V guest (on a Windows 10 host) with two USB EXT4 formatted HDD's, directly mounted in the VM. The USB discs are "offline" in the windows host disc manager and attached as physical discs to the SCSI controller, in the Hyper-V config. With WSL installed on the host, the Ubuntu VM guest was generating the flood of hv_storvsc messages, removing WSL from the Win10 host appears to have stopped the hv_storvsc messages from being generated. Will continue to monitor this (as it was filling the /var/log partition every few days) but fingers crossed this has got round the issue.

*** Edit - errors returned, looks like it was a bad disc too.

lucdew commented 11 months ago

I had the same issue, and also the SSD disk usage was 100% in the task manager. It turned out also some bad sectors errors on the host's drive.

It was also fixed by running a disk (C drive) check on the Windows host as admin + reboot chkdsk C: /f /r /x

and then when I booted the Ubuntu distrib the root filesystem was mounted in read-only due to filesystem errors confirmed with. sudo e2fsck /dev/sdd -p I had to repair the filesystem with sudo e2fsck /dev/sdd -y

Then it went back to normal.

gitovska commented 9 months ago

I had the same issue, and also the SSD disk usage was 100% in the task manager. It turned out also some bad sectors errors on the host's drive.

It was also fixed by running a disk (C drive) check on the Windows host as admin + reboot chkdsk C: /f /r /x

and then when I booted the Ubuntu distrib the root filesystem was mounted in read-only due to filesystem errors confirmed with. sudo e2fsck /dev/sdd -p I had to repair the filesystem with sudo e2fsck /dev/sdd -y

Then it went back to normal.

I've tried this to no avail. Does anybody have an update? Is this a confirmed hardware issue?

tgaff commented 9 months ago

I've tried this to no avail. Does anybody have an update? Is this a confirmed hardware issue?

I don't think this is always a hardware issue. I tested all drives involved where I was seeing this and found no errors. Currently I've scheduled the linux VM to reboot every day and Windows once a week. I also saw a hyper-v patch in one batch of updates but didn't dig further. Not sure if either of those have "fixed" it, but I haven't seen this error in over a month now.

micha-1987 commented 8 months ago

I am having the same problem with exactly the same error messages.

sudo tail -n 100 /var/log/kern.log | less Feb 19 08:24:44 md2s67dc kernel: [ 855.493826] hv_storvsc fd1d2cbd-ce7c-535c-966b-eb5f811c95f0: tag#208 cmd 0x28 status: scsi 0x2 srb 0x4 hv 0xc0000001 Feb 19 08:24:44 md2s67dc kernel: [ 855.493995] hv_storvsc fd1d2cbd-ce7c-535c-966b-eb5f811c95f0: tag#208 cmd 0x28 status: scsi 0x2 srb 0x4 hv 0xc0000001

This behaviour flodded my 1TB disk in a couple of days. I also checked my other computers where WSL2 is already running for quite a while and I discovered the same issue. syslog and kern.log are about 50GB of size even though it doesn't lead to a disk space problem until now. I tried to chkdsk my C:\ drive without success. But I cant sudo e2fsck /dev/sdc -p beause its saying that the drive is currently used and I dont know how to unmount it. sudo umount /dev/sdc is saying that the target is busy.

Finally I also want to use Docker which often leads to unresponsive containers when WSL is repeatingly writing error messages. Many thanks to any recommendations.

ganyyy commented 7 months ago

I had the same issue, and also the SSD disk usage was 100% in the task manager. It turned out also some bad sectors errors on the host's drive.

It was also fixed by running a disk (C drive) check on the Windows host as admin + reboot chkdsk C: /f /r /x

and then when I booted the Ubuntu distrib the root filesystem was mounted in read-only due to filesystem errors confirmed with. sudo e2fsck /dev/sdd -p I had to repair the filesystem with sudo e2fsck /dev/sdd -y

Then it went back to normal.

I wanted to express my gratitude for the suggestion to use chkdsk to address disk issues related to WSL2. I applied the chkdsk tool on the physical drive where my WSL2 disk resides. This approach successfully resolved my disk usage issues. Thanks for sharing such a useful solution!

K-PANIK commented 3 months ago

same errors, system instability detected (lag), simply by navitaging to wsl kernel syslog

WSL version: 2.1.5.0 Kernel version: 5.15.146.1-2 WSLg version: 1.0.60 MSRDC version: 1.2.5105 Direct3D version: 1.611.1-81528511 DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp Windows version: 10.0.22635.3858