microsoft / WSL

Issues found on WSL
https://docs.microsoft.com/windows/wsl
MIT License
17.29k stars 814 forks source link

cloud-init aborted by WSL 'watchdog' #11602

Open thielj opened 4 months ago

thielj commented 4 months ago

Windows Version

Microsoft Windows [Version 10.0.22631.3593]

WSL Version

2.2.4.0 (also tested with v2.1.5 and v2.0.5)

Are you using WSL 1 or WSL 2?

Kernel Version

5.15.153.1-microsoft-standard-WSL2

Distro Version

Ubuntu 24.04

Other Software

I uninstalled anything that could possibly interfer

Repro Steps

I made sure .wslconfig doesn't exist.

Create a minimalist .cloud-init/default.user-data as follows:

#cloud-config

runcmd:
  - echo RUNCMD - Going to sleep 60s
  - sleep 60s
  - echo RUNCMD - Wakey wakey 

Run wsl --install Ubuntu-24.04. When prompted for the username, pause for about 30 seconds. Then continue as usual.

Expected Behavior

I would expect that cloud-init would be able to complete an operation that takes longer than just a few seconds (e.g. installing a larger number of packages, which is how I noticed this first).

Actual Behavior

Some kind of watchdog kicks in 10s after the machine is started, logs in as root and executes a shutdown about 15s later, without letting cloud-init finish and leaving the system in an undefined state.

Diagnostic Logs

When logged in, run journalctl. You'll find entries like the following (some lines ~~~ omitted):

May 20 21:46:24 NEUROMANCER kernel: Linux version 5.15.153.1-microsoft-standard-WSL2 (root@941d701f84f1) (gcc (GCC) 11.2.0, GNU ld (GNU Binutils) 2.37) #1 SMP Fri Mar 29 23:14:13 UTC 2024
~~~
May 20 21:46:24 NEUROMANCER cloud-init[157]: Cloud-init v. 24.1.3-0ubuntu3 running 'init-local' at Mon, 20 May 2024 20:46:24 +0000. Up 11.57 seconds.
~~~
May 20 21:46:25 NEUROMANCER cloud-init[173]: Cloud-init v. 24.1.3-0ubuntu3 running 'init' at Mon, 20 May 2024 20:46:25 +0000. Up 12.02 seconds.
~~~
May 20 21:46:27 NEUROMANCER cloud-init[379]: Cloud-init v. 24.1.3-0ubuntu3 running 'modules:final' at Mon, 20 May 2024 20:46:27 +0000. Up 14.05 seconds.
May 20 21:46:27 NEUROMANCER cloud-init[379]: RUNCMD - Going to sleep 60s
~~~
May 20 21:46:33 NEUROMANCER unknown: WSL (2) ERROR: WaitForBootProcess:3342: /sbin/init failed to start within 10000
May 20 21:46:33 NEUROMANCER unknown: ms
May 20 21:46:33 NEUROMANCER unknown: WSL (2): Creating login session for root
~~~
May 20 21:46:34 NEUROMANCER login[518]: ROOT LOGIN  on '/dev/pts/1'
~~~
May 20 21:46:49 NEUROMANCER unknown:
May 20 21:47:00 NEUROMANCER kernel: Linux version 5.15.153.1-microsoft-standard-WSL2 (root@941d701f84f1) (gcc (GCC) 11.2.0, GNU ld (GNU Binutils) 2.37) #1 SMP Fri Mar 29 23:14:13 UTC 2024

The last two lines above are the last logged message and the first message once the machine is started again.

The shutdown (21:46:49) was most likely initiated by whatever the watchdog's root login was doing.

The start (21:48:00) was initiated interactively by entering the default user credentials.

github-actions[bot] commented 4 months ago

Logs are required for review from WSL team

If this a feature request, please reply with '/feature'. If this is a question, reply with '/question'. Otherwise please attach logs by following the instructions below, your issue will not be reviewed unless they are added. These logs will help us understand what is going on in your machine.

How to collect WSL logs Download and execute [collect-wsl-logs.ps1](https://github.com/Microsoft/WSL/blob/master/diagnostics/collect-wsl-logs.ps1) in an **administrative powershell prompt**: ``` Invoke-WebRequest -UseBasicParsing "https://raw.githubusercontent.com/microsoft/WSL/master/diagnostics/collect-wsl-logs.ps1" -OutFile collect-wsl-logs.ps1 Set-ExecutionPolicy Bypass -Scope Process -Force .\collect-wsl-logs.ps1 ``` The scipt will output the path of the log file once done. Once completed please upload the output files to this Github issue. [Click here for more info on logging](https://github.com/microsoft/WSL/blob/master/CONTRIBUTING.md#8-collect-wsl-logs-recommended-method) If you choose to email these logs instead of attaching to the bug, please send them to wsl-gh-logs@microsoft.com with the number of the github issue in the subject, and in the message a link to your comment in the github issue and reply with '/emailed-logs'.

View similar issues

Please view the issues below to see if they solve your problem, and if the issue describes your problem please consider closing this one and thumbs upping the other issue to help us prioritize it!

Closed similar issues:

Note: You can give me feedback by thumbs upping or thumbs downing this comment.

thielj commented 4 months ago

The relevant log excerpts have already been included. Happy to provide additional information if necessary.

Maybe it's worth mentioning that setting vmIdleTimeout to -1 or a sufficient large value has no effect.

github-actions[bot] commented 4 months ago

Logs are required for review from WSL team

If this a feature request, please reply with '/feature'. If this is a question, reply with '/question'. Otherwise please attach logs by following the instructions below, your issue will not be reviewed unless they are added. These logs will help us understand what is going on in your machine.

How to collect WSL logs Download and execute [collect-wsl-logs.ps1](https://github.com/Microsoft/WSL/blob/master/diagnostics/collect-wsl-logs.ps1) in an **administrative powershell prompt**: ``` Invoke-WebRequest -UseBasicParsing "https://raw.githubusercontent.com/microsoft/WSL/master/diagnostics/collect-wsl-logs.ps1" -OutFile collect-wsl-logs.ps1 Set-ExecutionPolicy Bypass -Scope Process -Force .\collect-wsl-logs.ps1 ``` The scipt will output the path of the log file once done. Once completed please upload the output files to this Github issue. [Click here for more info on logging](https://github.com/microsoft/WSL/blob/master/CONTRIBUTING.md#8-collect-wsl-logs-recommended-method) If you choose to email these logs instead of attaching to the bug, please send them to wsl-gh-logs@microsoft.com with the number of the github issue in the subject, and in the message a link to your comment in the github issue and reply with '/emailed-logs'.

View similar issues

Please view the issues below to see if they solve your problem, and if the issue describes your problem please consider closing this one and thumbs upping the other issue to help us prioritize it!

Closed similar issues:

Note: You can give me feedback by thumbs upping or thumbs downing this comment.

thielj commented 4 months ago

WslLogs-2024-05-20_22-18-51.zip

github-actions[bot] commented 4 months ago
Diagnostic information ``` Detected appx version: 2.2.4.0 ```
OneBlue commented 4 months ago

Thank you @thielj.

Based on the symptoms, I wonder if the issue is that the ubuntu distribution installer simply shuts down the WSL distribution after creating the user.

Do you see the same symptoms with let's say debian ?

thielj commented 4 months ago

@OneBlue Hi. Please note that the shutdown is initiated by the '10000ms watchdog' before user credentials are entered (21:46:49). It's not automatically restarting after this.

Only when you provide the default user name and hit enter the instance/distro is brought up again (21:47:00). Just wait long enough (e.g. 30 seconds) when prompted to enter the default username and you will see. The default user is created after this, and the instance/distro is then shutdown a second time before the default user is logged in.

I naturally tried Debian but it comes without cloud-init. I haven't investigated how to roll my own cloud-init enabled WSL distro but would think that with the observed '10000ms watchdog' this would be a waste of time. Even the standard 'apt upgrade' module wouldn't be safe to run.

thielj commented 4 months ago

@OneBlue Maybe this helps to determine if cloud-init is still running?

https://cloudinit.readthedocs.io/en/latest/howto/status.html

OneBlue commented 4 months ago

@thielj: I wonder if what you're experiencing is the normal VM idle time out if there's no activity. Let's try something: Can you put:

[wsl2]
vmIdleTimeout=-1

in .wslconfig and see if that the solves the issue for you ?

thielj commented 4 months ago

I tried that already, and it's not. Sorry, I thought I had mentioned it.

On Tue, May 28, 2024, 19:48 Blue @.***> wrote:

@thielj https://github.com/thielj: I wonder if what you're experiencing is the normal VM idle time out if there's no activity. Let's try something: Can you put:

[wsl2] vmIdleTimeout=-1

in .wslconfig and see if that the solves the issue for you ?

— Reply to this email directly, view it on GitHub https://github.com/microsoft/WSL/issues/11602#issuecomment-2135902133, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA6CUVLJLACGS4TSUIZ7OSLZETGQ3AVCNFSM6AAAAABIAMLEYOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZVHEYDEMJTGM . You are receiving this because you were mentioned.Message ID: @.***>

OneBlue commented 4 months ago

Ok interesting. Could you capture logs with that .wslconfig value set ?

/logs

thielj commented 4 months ago

@OneBlue It was mentioned here: https://github.com/microsoft/WSL/issues/11602#issuecomment-2121227086

Neither -1 or a much longer timeout made any difference at all. Neither did anything else that looked even remotely related in .wslconfig. It's also readily reproduceable on my laptop (with a 900000 timeout). I would be surprised to hear from anyone who can't reproduce this.

And it seems I just discovered a workaround: If I log into the distro from another terminal window, during the installation, as soon as this is possible, WaitForBootProcess still chimes in, the typical root login happens -- but the shutdown does not and cloud-init completes as expected. Just having an interactive shell open seems to prevent it.

Maybe the following is related. The actual shutdown in my case also happens almost exactly 15s after WaitForBootProcess comes in. https://github.com/microsoft/WSL/discussions/8659

The first timer is how long a WSL instance is idle before it is terminated. Think of this as a wsl --terminate that occurs when there are no more running processes in it other than init. This is non-configurable, but https://github.com/microsoft/WSL/issues/6782 is a feature request for the ability to configure it. As far as I can tell, this is hard-coded at about 15 seconds.

Have you looked what is triggering your WaitForBootProcess code? It is clearly setup with a fixed 10000ms timeout and then logs in as root. I don't see this error in the logs after the first/--install start.

If it's supposed to detect a running cloud-init: maybe it would be worth looking into that bit? And if not, my guess is that a service like cloud-init doesn't count as 'running process other than init'. A simple command -v cloud-init && cloud-init status --wait before initiating the shutdown would allow cloud-init to complete and solve the issue.

thielj commented 4 months ago

@OneBlue And another update regarding that workaround: Looking at the debug console for cloud-init output or counting to 15 and then starting wsl ~ cloud-init status --wait in another terminal does indeed work. Once you try to do something more than just sleeping, there's another problem though: /etc/resolv.conf disappears, roughly around the time cloud-init is running apt upgrade. It's probably triggered by your 10000ms watchdog. So sometimes the upgrade runs through, sometimes it fails. Additional packages installed after that always fail (unable to resolve archive.ubuntu.com). It's definitely no problem with my network and is also independent of networkingMode or dnsTunneling.

I can workaround this by replacing the symlink with an immutable copy of /mnt/wsl/resolv.conf very early in the boot process, like so:

#cloud-config

bootcmd: 
  - chattr -i /etc/resolv.conf
  - rm -f /etc/resolv.conf
  - cp /mnt/wsl/resolv.conf /etc/resolv.conf
  - chattr +i /etc/resolv.conf

package_update: true
package_upgrade: true
packages: [ btop ]

Your ConfigCreateResolvConfSymlink isn't happy about that, which makes me think it might also be involved with its disappearance:

[   25.822086] WSL (2) ERROR: WaitForBootProcess:3342: /sbin/init failed to start within 10000
[   25.823081] WSL (2) ERROR: ConfigCreateResolvConfSymlink:1741: symlink(/mnt/wsl/resolv.conf
[   25.833860] WSL (2) ERROR: ConfigCreateResolvConfSymlink:1741: symlink(/mnt/wsl/resolv.conf

I tried generateResolvConf=false to no avail. The symlink would still disappear. I suppose /etc/wsl.conf is read before the distro is launched?