microsoft / WSL

Issues found on WSL
https://docs.microsoft.com/windows/wsl
MIT License
17.45k stars 822 forks source link

Clock skew issues megathread #10006

Closed craigloewen-msft closed 9 months ago

craigloewen-msft commented 1 year ago

Megathread

Current status: waiting on backport for kernel patch to mitigate issue.

We're creating this megathread to track the clock skew issues in WSL in one place, and will keep this parent comment current with any updates.

Background

Sometimes the WSL clock can become skewed after resume from sleep (specifically S0). See some example related issues for more info: https://github.com/microsoft/WSL/issues/8318 https://github.com/microsoft/WSL/issues/8204 https://github.com/microsoft/WSL/issues/7255

Potential work arounds

Use systemd to force clock sync

See this comment: https://github.com/microsoft/WSL/issues/8204#issuecomment-1338334154

Set the hardware clock via a command

Run sudo hwclock -s. More info here.

Run ntpdate on distro start up

Edit /etc/wsl.conf to have this content:

[boot]
command="ntpdate ntp.ubuntu.com"

This will force a clock reset on start up of the distro.

Build a private kernel with this patch

dhensen commented 1 year ago

@pmartincic Btw, thanks for looking into this!

gorbunovav commented 1 year ago

@dhensen e.g. having Docker running seems to be mitigating this issue.

Clockwork-Muse commented 1 year ago

Using WSL-app via the Microsoft Store instead of via the enabled windows features

Probably this one? Or maybe some form of A/B feature testing.

In my case:

Using a new AMD machine instead of Intel

My machine has an on-die Intel GPU and an Nvidia dGPU.

having Docker running seems to be mitigating this issue.

Didn't mitigate it (Docker's the reason I enabled systemd in the first place).

gorbunovav commented 1 year ago

Didn't mitigate it (Docker's the reason I enabled systemd in the first place).

Hmm. For me it does. I was rarely seeing this issues on my desktop, because I usually have Docker constantly running there. And if there was a clock skew it usually meant that the docker is stopped and I was just launching it.

Recently, I've been working on a laptop and prefer not to run Docker locally, and I encounter this issue much more frequently.

In both cases (desktop, laptop) the WSL is installed via Store. Intel CPU in both PCs.

patricklangsonos commented 1 year ago

@craigloewen-msft this thread has a lot of workarounds and speculation, but it's not clear what info you need from the community to root cause the problem.

Is there any chance you can share what evidence we can look at on machines that are working correctly? What systemd unit or kernel module (hv_utils?) should be logging a notification that a time sync is needed now rather than when a downstream unit such as chronyd or a tool such as ntpdate finds clock skew after the fact?

patricklangsonos commented 1 year ago

Here's another recent correction. I totaled up the sleep time trying to correlate the drift between chronyd's journal entries and Windows sleep counters.

Here's what I saw from chronyd:

Sep 19 04:20:34 LENOVO-T14 chronyd[329]: System clock wrong by 124.099253 seconds
Sep 19 08:09:19 LENOVO-T14 chronyd[329]: System clock wrong by 63297.185604 seconds

Shortly after that I ran ntpdate to correct the clock and get the most current drift:

patrick@LENOVO-T14:~$ date
Tue Sep 19 08:58:41 PDT 2023
patrick@LENOVO-T14:~$ sudo ntpdate time.windows.com
[sudo] password for patrick:
20 Sep 10:46:30 ntpdate[4830]: step time server 168.61.215.74 offset +92854.955337 sec

I used powercfg /systempowerreport to get a list of times the machine was sleeping.

2023-09-19 

00:03:30 - duration 3:00:03
03:03:35 - duration 5:51:38
12:39:26 - duration 0:02:09
16:12:26 - duration 6:04:38
22:19:32 - duration 4:41:32

2023-09-20

03:01:04 - duration 6:48:44

10:41:58 - report generated

The total duration of all those sleeps on 9-19/20 is 26:28:44, which is relatively close to the time correction required (25:47:35)

This is on a Lenovo T14 AMD-based machine, versions:

wsl --version
WSL version: 1.2.5.0
Kernel version: 5.15.90.1
WSLg version: 1.0.51
MSRDC version: 1.2.3770
Direct3D version: 1.608.2-61064218
DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows version: 10.0.22621.2283
ghost commented 1 year ago

@patricklangsonos, Ideally we'll want to know from people which powerstate transitions cause this. But I don't know a good way to gather that information yet.

mungojam commented 1 year ago

@patricklangsonos, Ideally we'll want to know from people which powerstate transitions cause this. But I don't know a good way to gather that information yet.

For the high CPU symptom, I believe it is hibernation that causes it. I rarely suspend my work pc but often hibernate it and get the symptom multiple times per week

Viqsi commented 1 year ago

The total duration of all those sleeps on 9-19/20 is 26:28:44, which is relatively close to the time correction required (25:47:35)

I opted to do a little numerology because of a pet theory in the back of my mind that it's an issue with sleep times that go beyond a certain time threshold, and noticing that if you remove the 00:02:09 sleep, the difference is exactly 39 minutes. If that theory is correct, then (on that machine, anyways) any sleep state that lasts longer than 7 minutes 48 seconds would lead to a time correction needed for the duration of the remainder of the sleep state.

I doubt it's actually that simple but it piqued my curiosity.

patricklangsonos commented 1 year ago

@patricklangsonos, Ideally we'll want to know from people which powerstate transitions cause this. But I don't know a good way to gather that information yet.

Yeah, I think that's a possibility. I have a desktop machine on the same Win11 & WSL builds that is not reproducing it. It's also a different CPU though - the Lenovo T14 is a Ryzen 5xxx pro series, the desktop is a 7xxx series.

cmullendore commented 1 year ago

I've been vexed by this issue for a long time as well. I'm pretty sure for linux OSs running directly on hardware, particularly those that are intended for clients that do S states all of the time, this is a fixed issue. There is the possibility of dynamically changing the datetime on an OS instance constantly could freak out some long-lived services, but on WSL, which isn't necessarily designed to run 24x7 anyway, that should be a lesser issue. So... A question and some thoughts... (NOTE: I do C#, not hardware, I'm not a true linux dev, and I won't claim to be a linux or necessarily hardware expert... but I'm gonna give ideas rather than just complaints).

Ideas:

I'm a superfan of WSL and use it for a lot of dev (frequently running mysqld, better or worse) so the time sync issue does matter to me. A Windows guest in Hyper-V has this solved. "Enlightened" OSs are supposed to benefit from the services offered by Hyper-V intelligently... so I'm thinking there MUST be a way to get this to work.

wpwoodjr commented 1 year ago

Use systemd to force clock sync

See this comment: https://github.com/microsoft/WSL/issues/8204#issuecomment-1338334154

cmullendore commented 1 year ago

@wpwoodjr I hear you and have done similar workarounds... but the fundamental truth is that this shouldn't be a burden on users, not everyone is experienced enough to implement such a fix, and if a Windows guest can do it automatically WSL should to given that it's a Microsoft feature.

Being a geek, I get hacks... I'm saying there must be a way for either MS or the distros to solve this without us having to do hacks. Plus, if MS considered the above to be a sufficient fix, I'm guessing they'd close this thread. :)

wpwoodjr commented 1 year ago

@cmullendore @G-Rath Just reminding everyone who comes to this thread who may not have seen the post at the top. I don't know why MS can't reproduce the issue, but the systemd fix works great with minimal impact.

multizone-uk commented 1 year ago

This is still an issue. (Unable to use auth due to time drift).

Build: 22621 Branch: ni_release Release: Ubuntu 22.04.2 LTS Kernel: Linux 5.15.90.1-microsoft-standard-WSL2 Uptime: 8d 17h 29m

Fixed (until next drift) by:

sudo apt-get install ntpdate sudo ntpdate time.windows.com

ghost commented 1 year ago

I want to clarify and ask, has anyone experienced drift that was not associated with a suspend/resume scenario? Just want to make sure I'm not missing something as I read through and troubleshoot this.

Clockwork-Muse commented 1 year ago

For a good amount of time hwclock -s was working for me, but, as has been mentioned in this thread, it's stopped working for me now.

Of course, the fun times started because it reared its head by making my devcontainer stop working, because cert validation failed - which was evidenced by the logs claiming "unable to get local certificate", which had me all confused. (The clock was way off, but the actual problem was caused by me being on my corporate network, which has SSL interception, and me not having the corp certs installed)

ghost commented 1 year ago

For those seeing time discontinuity on resume. What power states do you see listed as available on powercfg /a, What windows versions are you running? cmd.exe /c ver I've tracked down one source of this but want to confirm there are not others.

@Clockwork-Muse, @benc-uk, @0xabu, @lewissbaker You all mentioned seeing that the virtualized hardware clock was out of sync. How do you reproduce this? That would be a separate bug from the hypervisor failing to deliver time notifications to the guest on resume mentioned above. What windows versions are you running? cmd.exe /c ver Everyone on a desktop OS? Inbox WSL? If lifted WSL what versions?

patricklangsonos commented 1 year ago

(btw - ver reports the same version as wsl --version)

Here is the AMD-based system that reproduces the problem.

wsl --version
WSL version: 1.2.5.0
Kernel version: 5.15.90.1
WSLg version: 1.0.51
MSRDC version: 1.2.3770
Direct3D version: 1.608.2-61064218
DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows version: 10.0.22621.2283
The following sleep states are available on this system:
    Standby (S0 Low Power Idle) Network Connected
    Hibernate
    Fast Startup

As an aside, I'd be happy to try S3. S0 is a feature that serves no purpose for me since I don't use any "modern apps" benefitting from S0

lorengordon commented 1 year ago

For those seeing time discontinuity on resume. What power states do you see listed as available on powercfg /a, What windows versions are you running? cmd.exe /c ver I've tracked down one source of this but want to confirm there are not others.

❯ powercfg /a
The following sleep states are available on this system:
    Standby (S0 Low Power Idle) Network Connected
    Hibernate
    Fast Startup

The following sleep states are not available on this system:
    Standby (S1)
        The system firmware does not support this standby state.
        This standby state is disabled when S0 low power idle is supported.

    Standby (S2)
        The system firmware does not support this standby state.
        This standby state is disabled when S0 low power idle is supported.

    Standby (S3)
        This standby state is disabled when S0 low power idle is supported.

    Hybrid Sleep
        Standby (S3) is not available.
        The hypervisor does not support this standby state.
❯ cmd.exe /c ver

Microsoft Windows [Version 10.0.22621.2283]
cmullendore commented 1 year ago

@lorengordon in response to your above request, powercfg is identical, win ver 10.0.22621.2361

I'm def not a linux expert but I'd love to provide anything helpful re: troubleshooting data. For kicks, I was thinking a systemd service that logs any valuable information to a file, on repeat?


[Unit]
Description=Chris TimeWatch Service
After=network.target

[Install]
WantedBy=multi-user.target

[Service]
Type=simple
User=root
Group=root
PIDFile=/run/timewatch.pid
ExecStart=/root/timewatch.sh
TimeoutSec=infinity
Restart=on-failure
RuntimeDirectory=/root
RuntimeDirectoryMode=755
LimitNOFILE=10000

timewatch.sh: while [ 1 ]; do date >> /root/date.txt && cat /proc/stat | grep cpu >> /root/date.txt; sleep 1; done;

Maybe with the right metrics inside the instance, enabling debugConsole, and ensuring the right host event logs are enabled, valuable repro data could be provided?

Happy to help and provide whatever. Downvote this comment and I'll get the point. :)

dhensen commented 1 year ago

Lenovo Z13 AMD 6850U laptop:

C:\Users\dinoh>powercfg /a
The following sleep states are available on this system:
    Standby (S0 Low Power Idle) Network Connected
    Hibernate
    Fast Startup

The following sleep states are not available on this system:
    Standby (S1)
        The system firmware does not support this standby state.
        This standby state is disabled when S0 low power idle is supported.

    Standby (S2)
        The system firmware does not support this standby state.
        This standby state is disabled when S0 low power idle is supported.

    Standby (S3)
        The system firmware does not support this standby state.
        This standby state is disabled when S0 low power idle is supported.

    Hybrid Sleep
        Standby (S3) is not available.
        The hypervisor does not support this standby state.

C:\Users\dinoh>cmd.exe /c ver

Microsoft Windows [Version 10.0.22621.2283]

C:\Users\dinoh>wsl --version
WSL version: 1.2.5.0
Kernel version: 5.15.90.1
WSLg version: 1.0.51
MSRDC version: 1.2.3770
Direct3D version: 1.608.2-61064218
DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows version: 10.0.22621.2283
sabaronett commented 1 year ago

For those seeing time discontinuity on resume. What power states do you see listed as available on powercfg /a, What windows versions are you running? cmd.exe /c ver I've tracked down one source of this but want to confirm there are not others.


PS > powercfg /a
The following sleep states are available on this system:
Standby (S0 Low Power Idle) Network Connected
Hibernate
Fast Startup

The following sleep states are not available on this system: Standby (S1) The system firmware does not support this standby state. This standby state is disabled when S0 low power idle is supported.

Standby (S2)
    The system firmware does not support this standby state.
    This standby state is disabled when S0 low power idle is supported.

Standby (S3)
    The system firmware does not support this standby state.
    This standby state is disabled when S0 low power idle is supported.

Hybrid Sleep
    Standby (S3) is not available.
    The hypervisor does not support this standby state.

PS > cmd.exe /c ver

Microsoft Windows [Version 10.0.22621.2361] PS > wsl --version WSL version: 1.2.5.0 Kernel version: 5.15.90.1 WSLg version: 1.0.51 MSRDC version: 1.2.3770 Direct3D version: 1.608.2-61064218 DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp Windows version: 10.0.22621.2361

ghost commented 1 year ago

Thanks everyone for the speedy replies! I think I'm reproducing the same issue you're seeing on resume, we've found a cause. Don't have a fix or an eta yet. It's getting bounced around at the moment.

Still looking at the issues reported by @Clockwork-Muse, @benc-uk, @0xabu, @lewissbaker to see if I can reproduce the virtualized hardware clock losing sync.

As an aside, I'd be happy to try S3. S0 is a feature that serves no purpose for me since I don't use any "modern apps" benefitting from S0

@patricklangsonos, if you know how to access S3 on that machine, sure? I think your build is new enough to have had the change that was written in 2019 that worked on S3 when I tested it.

cmullendore commented 1 year ago

...at the risk of being annoying... but really just trying to be helpful...

I download the WSL2 kernel and poked around and added some additional log entries to the debug console, compiled, and deployed on my host. (https://github.com/microsoft/WSL2-Linux-Kernel/compare/linux-msft-wsl-5.15.y...cmullendore:WSL2-Linux-Kernel:linux-msft-wsl-5.15.y)

It appears that on initial startup, the vm/container tests clock sync appx. every 5 seconds. This is visible in the debug window and in /var/log/syslog.

When the system goes to sleep BEFORE the wsl instance timeout, it seems that the wsl instance is simply paused, with no apparent suspend/resume event. However, this breaks the vmbus connection that is used to actually do the time sync. On resume, the vmbus should be re-opened (https://github.com/microsoft/WSL2-Linux-Kernel/blob/a3f9ed689ace4f44ec5462ab95a59fbf072987ba/drivers/hv/hv_util.c#L676) but either it's not re-opened (this is also visible in that the debug console loses connection but does not resume connection on wake) and/or the actual time sync process loop is not re-initiated.

My best guess is (and again, just trying to be helpful) if there were a mechanism to get the WSL instance kernel to resume the loop to verify/correct the time sync, things would be fixed.

(sent with the best of intentions)

EDIT 2023-10-08 : While still believe the above is true, I've also noticed that the clocksource or (virtual) hardware device never receives suspend/resume events. I added logging to the suspend/resume events in hyperv_clocksource.c and even after numerous suspend/resumes with the terminal session open, those events never emit log entries. Goal was to see if I could inject a reconnection of the vmbus, but if those events aren't being called at all, it's useless.

Finally, to note, in both the linux kernel documentation and in both the code and the code comments (https://github.com/cmullendore/WSL2-Linux-Kernel/blob/2782560181946a4183a87e5c995929cfc1eb581e/drivers/clocksource/hyperv_timer.c#L356) in hyperv_clocksource.c, there is a clearly stated preference to use the TSC for clock, not the pure MSR. This may be appropriate and correct for linux in general, but I'm wondering if this is breaking WSL. Specifically, the Hyper-V specs page (https://learn.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/timers) specifically says this about using the TSC Partition Timer Enlightenment:

This facility is not intended to be used a source of wall clock time, since the reference time computed using this facility will appear to stop during the time that a guest partition is saved until the subsequent restore.

This is the exact behavior we're seeing. I'm wondering if the VMP is suspending the wsl container without passing a change in sleep state? The equivalent of clicking "Save" on a Hyper-V VM? This would track all issues... On WSL container save the virtual processors are paused without a notification of sleep state change... The suspension causes the vmbus channel to break/timeout. On resume, again no change in sleep state... vmbus channel isn't re-opened because the resume method isn't called... and so the WSL container knows no difference, doesn't call the vmbus to recheck the time, and when the container attempts to use it's virutal hardware clock, the TSC kicks in and returns time ignorant of the suspend time anyway.

I have hacked code (https://github.com/cmullendore/WSL2-Linux-Kernel/blob/2782560181946a4183a87e5c995929cfc1eb581e/drivers/clocksource/hyperv_timer.c#L558) that precludes the use of the TSC (the kernel has a fallback to the MSR) but I'm unable to demonstrate that this fixes the issue because the vmbus channel isn't re-established such that the MSR can even have the opportunity to present updated time. Note that in the hacked kernel, this print statement is never called (https://github.com/microsoft/WSL2-Linux-Kernel/blob/a3f9ed689ace4f44ec5462ab95a59fbf072987ba/drivers/hv/hv_util.c#L646), indicating that the suspend state is never being communicated.

Custom WSL kernel code fork I'm pushing all of my changes to here (https://github.com/cmullendore/WSL2-Linux-Kernel/tree/linux-msft-wsl-5.15.y). I have no intention at this point of pushing any of this code back to the official repo (I would, but I'm not at that level) so I'm not using proper branches. Please don't judge. This is discovery hacking at this point. :)

multizone-uk commented 1 year ago

Sorry to be late. But same as one above so hopefully helpful. Dell XPS 13 9370.

Microsoft Windows [Version 10.0.22621.2283]

C:\>powercfg /a The following sleep states are available on this system: Standby (S0 Low Power Idle) Network Connected Hibernate Fast Startup

The following sleep states are not available on this system: Standby (S1) Standby (S2) Standby (S3)

Hybrid Sleep Standby (S3) is not available. The hypervisor does not support this standby state.

Clockwork-Muse commented 1 year ago

Was away from machine until now. Dell XPS 15 9500

cmd.exe /c ver

Microsoft Windows [Version 10.0.22621.2283]
powercfg /a
The following sleep states are available on this system:
    Standby (S0 Low Power Idle) Network Connected
    Hibernate
    Fast Startup

The following sleep states are not available on this system:
    Standby (S1)
        The system firmware does not support this standby state.
        This standby state is disabled when S0 low power idle is supported.

    Standby (S2)
        The system firmware does not support this standby state.
        This standby state is disabled when S0 low power idle is supported.

    Standby (S3)
        This standby state is disabled when S0 low power idle is supported.

    Hybrid Sleep
        Standby (S3) is not available.
        The hypervisor does not support this standby state.

I'm not able to test S3 at this time

ghost commented 1 year ago

@cmullendore, Thanks for digging in. I may end up needing more information about your system to see if I can reproduce what you see. When I was triaging timesync messages continued to be sent after the guest pauses/resumes. The problem I observe is that the wrong flag is sent when the guest resumes along with the host. Meaning ICTIMESYNCFLAG_SYNC was not being sent. Granted, I'm testing against the most recent internal builds. I'll go back and test the build you're on in a couple days.

cmullendore commented 1 year ago

@pmartincic Yeah the timesync was happening repeatedly, appx every 5 seconds, until the vmbus disconnected from the suspend. At this point I'm simply trying to trigger an S0 event AT ALL. There doesn't seem to be anything I can do or any features I can flip on that will enable a proper sleep state that might trigger the resume functions. S0 simply never makes it to the container.

I'm having a quirk right now where I've added some logging to the suspend_init areas and I can see the messages that were there before, but not the ones I've added. I'm thinking my build step(s) are missing targets, but I'm not sure... sometimes it picks up the logging, sometimes it doesn't... even after a make clean.

I'm still playing. It's fun hacking and I want to figure this out. 😄

ghost commented 1 year ago

@cmullendore, how are you triggering sleep on the host? I'm not an expert on this. However, because you see S0 on the host, doesn't mean you'll see that on the guest.

cmullendore commented 1 year ago

@pmartincic It may sound cheesy, but I'm just closing my laptop for about 5 minutes. I've confirmed in event viewer that the host OS kernel reports an S0 entry, so at least that much should be good. That said, you could be right... but I'm operating off of the expectation that the sleep state in the host is propagated to the guest(s). In an enlightened OS (and the WSL kernel is using most of the available enlightenments) I believe this is standard behavior for Hyper-V. However, again, you could be right... but if the host OS is in S0 then the guests must either also be in a sleep state (which I'm aspiring for) or they must simply be suspended, which is what I think is causing this issue. Unless the sleep state makes it to the WSL kernel, NONE of the resume functions can/will ever trigger... as they are not triggering now.

superm1 commented 1 year ago

WSL2 by default doesn't have a /sys/power/state; it's not possible to put the WSL2 kernel into suspend without that. TBH - I'm a bit surprised that Modern Standby in the host isn't propagated to WSL2 guests.

jeffska commented 1 year ago

I'm just sitting here waiting to see if any of this has some applicability to #8696. Since Microsoft seems to be actively ignoring it.

cmullendore commented 1 year ago

@superm1 Interestingly, mine does (see attached), and the ACPI module is reporting that my machine (and presumably it) is capable of S0. It's just not making it there.

Screenshot 2023-10-09 110556
superm1 commented 1 year ago

huh... weird.

$ ls /sys/power/
ls: cannot access '/sys/power/': No such file or directory
$ uname -r
5.15.90.1-microsoft-standard-WSL2
ghost commented 1 year ago

@jeffska, I doubt it has anything to do with this.

@cmullendore, Preferably with systemd turned off (we have a bug in our log collection, and I'll be missing lines from dmesg). Can you:

  1. with your wsl instance running collect logs using wpr -start vmiccore.wprp -filemode
  2. wait 10 seconds
  3. sleep the host
  4. resume the host
  5. wait ten seconds
  6. wpr -stop log.etl and attach the log file here.

Also, can you give me the output of hcsdiag list. I'll need it to ID the wsl vm from some of the traces.

The wprp file is inside this zip, github wouldn't let me attach *.wprp. vmiccore.zip

mungojam commented 1 year ago

@jeffska, I doubt it has anything to do with this.

I can't find the reference now but I'm sure I was pointed at this issue from that one which is why I've been following this for so long. I can't think I'm the only one.

Good luck with the time offset thing, sounds like you are really close!

I'll struggle on with almost daily high CPU and wsl restarts, hope you can switch focus to that once this timing issue is sorted.

cmullendore commented 1 year ago

@pmartincic Hah... I just remembered that in the WSL kernel the flag indicating support for S0 hard-set(https://github.com/microsoft/WSL2-Linux-Kernel/blob/fb467c712f32b8e7170899e629576cb2ab439e1a/drivers/acpi/sleep.c#L1072) to 1. It does no checking... It just says "yup". It may do some validation later as part of the init process but the fact that the logging shows S0 as supported is likely useless.

Doing a build now with basically everything power related enabled (just for kicks). Will do the trace and get it back to you. Can you provide an upload location?

cmullendore commented 1 year ago

@pmartincic Sending files as requested separately.

ghost commented 1 year ago

Thanks @cmullendore! Looks like it slept for about 3m30s? What I find most curious about those logs is that vmwp thinks it's still sending the timesync messages. Which is the behavior I expect. As least I see the log lines indicating so. I've very puzzled why you wouldn't be seeing them in the guest.

cmullendore commented 1 year ago

@pmartincic Yeah... I'm thinking I should leave it suspended for longer to verify that the vmbus channel closes and never re-opens. I'm stepping out for a few hours. I'll leave it suspended and see how it goes.

The problem with testing sleep states is that there doesn't appear to be a good way to induce them fully. This'll be my test. I'll run your trace for the duration as well, just for kicks.

cmullendore commented 1 year ago

@pmartincic So the good news is that S0 can be triggered from the host OS and recognized by the WSL kernel as long as the right CONFIG features are enabled. Modern Standby is rather specific about its needs. The bad news is that apparently several drivers in the current WSL kernel that will crash the WSL instance if they hit modern standby..

In the current state options may be limited:

I'll push my current "S0 works" config-wsl file to github tomorrow after I clean up my branches/commits and remove unnecessary changes. Obvs. I'd encourage comparing my config-wsl with the current source version and making your own decisions about what might work. I may have some features enabled that are unnecessary... it was hard to find the combination that enabled S0 at all.

Bonus: Apparently MSFT has a utility specifically to help with power testing. :D https://learn.microsoft.com/en-us/windows-hardware/drivers/devtest/pwrtest

patricklangsonos commented 1 year ago

As a workaround - could Hyper-V lie and send S3 instead of S0 to the guest? Would that work around the crashing drivers?

cmullendore commented 1 year ago

@patricklangsonos I don’t know if that would be possible. HV only relays the current state the host is about to enter… it would require a change that I’m not sure I’d recommend to cause HV to in effect “lie” to guest OSs. The issue isn’t they Hyper-V is wrong… it’s that the kernel drivers can’t handle S0. It would be better to fix the problem than lie to get around it.

I’m wondering if there are simply lingering updates to the WSL version of the Linux kernel that we haven’t adopted… though I believe “Modern Standby” was introduced by MS. If it’s in the official ACPI spec it’s possible newer drivers have adopted it. If not, there’s little pressure for non-Windows platforms to adopt a windows-only feature. I’m not sure which is true.

And I should be more technically accurate… S0 is the “on” state. We’re talking about “S0 low power” mode. SOME drivers do appear to be adopting the capability or at least avoiding negative impacts of it… but not all. The virtio platform is core to the current Linux driver model. If that doesn’t do it (such as virtio-FS) then nothing will.

I’m NOT saying this can’t be addressed… but the answer may need to be bigger than just MSFT and require a bit more collaboration and support of other kernel contributors.

cmullendore commented 1 year ago

Okay… rethinking given the situation…

In my early attempts I came to the conclusion that the guest not getting the s2idle (S0 low power) state and handling it was what needed to be solved (and this would still be the ideal answer).

however… this was an attempt to let the existing code that would attempt to re-initiate the vmbus channel on resume. The real issue is simply the loss of the vmbus channel. Maybe the real issue is to simply make the vmbus channel loss more resilient and on the kernel’s loss of the vmbus channel it should simply repeatedly attempt to re-establish it regardless. IF this were possible, then when the WSL instance came out of its pause state (notice I didn’t say suspend) the vmbus connection would be lost and it would simply try to re-establish it anyway. Again, my belief is that if the vmbus channel were reopened then the time sync would kick right back in again.

The “right” way would be for the guest kernel and drivers to properly support S0 suspend. But if that’s not possible in current-state, maybe compensating for it would be a workaround with minimal need for non-MS focused demands on the Linux kernel itself. Only HV vmbus would require modification.

Must also say again, I’m talking theory, not definite solution.

Net is temporarily down. Will look into this once it’s back up.

ghost commented 1 year ago

The intent for how this is supposed to work is different. For the time being I don't have any expectation of mirroring power states between the guest and the host. The intent of the current system is as follows: The host sends a message with a special flag indicating that time has jumped. The guest receives the message and does a hard sync. The current bug as I see it, is that the host isn't sending the message with that flag when resuming from S0.*/modern standby; It is actively being discussed how to address this internally.

If you see that the TimeSync messages stop being received on the guest, that is something I want to know about.


diff --git a/drivers/hv/hv_util.c b/drivers/hv/hv_util.c
index 835e6039c186..68acf2f6eb4d 100644
--- a/drivers/hv/hv_util.c
+++ b/drivers/hv/hv_util.c
@@ -383,6 +383,8 @@ static inline void adj_guesttime(u64 hosttime, u64 reftime, u8 adj_flags)

        spin_unlock_irqrestore(&host_ts.lock, flags);

+       pr_info("TimeSync flags %d\n", adj_flags);
+
        /* Schedule work to do do_settimeofday64() */
        if (adj_flags & ICTIMESYNCFLAG_SYNC)
                schedule_work(&adj_time_work);
ghost commented 1 year ago

@cmullendore,

If you build with the diff above and stop seeing the messages being received on the guest, let me know. Thanks!

cmullendore commented 1 year ago

@pmartincic Totally get it. My only reason for pursing modern standby was that the reason the messages appeared to be stopping was because the vmbus-channel broke/closed and the suspend/resume functions already know how to reestablish that channel, which would have likely solved the problem.

Redirecting now. Net is still down. Gimme a couple of hours.

patricklangsonos commented 1 year ago

@pmartincic - do you happen to have that timesync flag patch in a GitHub branch? It would be easier for me to build & test from there since I don't have the existing WSL kernel repo handy

ghost commented 1 year ago

@patricklangsonos https://github.com/pmartincic/WSL2-Linux-Kernel-1006/tree/bug Or I can attach my built vmlinux if that's easier.