microsoft / WSL

Issues found on WSL
https://docs.microsoft.com/windows/wsl
MIT License
17.23k stars 809 forks source link

systemd >= 256 needs plain cgroupv2 support #11857

Open Vogtinator opened 1 month ago

Vogtinator commented 1 month ago

Windows Version

Microsoft Windows [Version 10.0.19045.4291] (11 is affected the same way)

WSL Version

2.2.4.0

Are you using WSL 1 or WSL 2?

Kernel Version

5.15.153.1-2

Distro Version

openSUSE Tumbleweed

Other Software

No response

Repro Steps

  1. Install any distro with systemd 256 on WSL 2
  2. wsl --shutdown
  3. wsl systemctl is-system-running

Expected Behavior

wsl systemctl is-system-running should immediately report running.

Actual Behavior

wsl systemctl is-system-running fails with an error like Failed to connect to bus: No such file or directory. Executing wsl systemctl is-system-running again will eventually succeed and return starting and running.

Diagnostic Logs

This is because of https://github.com/systemd/systemd/issues/32998. The host VM uses a "hybrid" cgroupv1 support, which is no longer supported by systemd >= 256 (https://github.com/systemd/systemd/releases/tag/v256-rc3).

As systemd is running in a container here ("wsl"), it warns about this by showing a message for 30s before booting. This triggers the /sbin/init failed to start within a 10000ms timeout warning and the command is executed before systemd is up.

github-actions[bot] commented 1 month ago

Logs are required for review from WSL team

If this a feature request, please reply with '/feature'. If this is a question, reply with '/question'. Otherwise please attach logs by following the instructions below, your issue will not be reviewed unless they are added. These logs will help us understand what is going on in your machine.

How to collect WSL logs Download and execute [collect-wsl-logs.ps1](https://github.com/Microsoft/WSL/blob/master/diagnostics/collect-wsl-logs.ps1) in an **administrative powershell prompt**: ``` Invoke-WebRequest -UseBasicParsing "https://raw.githubusercontent.com/microsoft/WSL/master/diagnostics/collect-wsl-logs.ps1" -OutFile collect-wsl-logs.ps1 Set-ExecutionPolicy Bypass -Scope Process -Force .\collect-wsl-logs.ps1 ``` The script will output the path of the log file once done. If this is a networking issue, please use [collect-networking-logs.ps1](https://github.com/Microsoft/WSL/blob/master/diagnostics/collect-networking-logs.ps1), following the instructions [here](https://github.com/microsoft/WSL/blob/master/CONTRIBUTING.md#collect-wsl-logs-for-networking-issues) Once completed please upload the output files to this Github issue. [Click here for more info on logging](https://github.com/microsoft/WSL/blob/master/CONTRIBUTING.md#8-collect-wsl-logs-recommended-method) If you choose to email these logs instead of attaching to the bug, please send them to wsl-gh-logs@microsoft.com with the number of the github issue in the subject, and in the message a link to your comment in the github issue and reply with '/emailed-logs'.

View similar issues

Please view the issues below to see if they solve your problem, and if the issue describes your problem please consider closing this one and thumbs upping the other issue to help us prioritize it!

Open similar issues:

Closed similar issues:

Note: You can give me feedback by thumbs upping or thumbs downing this comment.

Vogtinator commented 1 month ago

I can't collect logs from my system right now, so for now I'll just refer to the ones from #11739 which has the same cause: https://github.com/user-attachments/files/16076419/WslLogs-2024-07-02_20-47-17.zip

github-actions[bot] commented 1 month ago
Diagnostic information ``` Detected appx version: 2.2.4.0 ```
suiryc commented 1 month ago

Very interesting.

By enabling the debug console (debugConsole = true in .wslconfig) I can indeed see the warning:

Legacy cgroup v1 support selected. This is no longer supported. Will proceed anyway after 30s.

This matches the fact that I am seeing systemd do nothing during 30s, then finally really start its services and be visible.

There are some other interesting things to notice. By default, cgroup v1 are used, and we have this issue with systemd not really starting during 30s but WSL doing its login job (too soon). Hence a lot of side effects like /tmpbeing mounted later and messing with other things. Pointers that cgroups v1 are used in this case:

$ stat -fc %T /sys/fs/cgroup/
tmpfs
$ cat /sys/fs/cgroup/cgroup.controllers
cat: /sys/fs/cgroup/cgroup.controllers: No such file or directory

Now by enabling memory reclaim (in .wslconfig)

[experimental]
autoMemoryReclaim = gradual

Apparently the system is started with cgroup v2.

$ stat -fc %T /sys/fs/cgroup/
cgroup2fs
$ cat /sys/fs/cgroup/cgroup.controllers
cpuset cpu io memory hugetlb pids rdma misc

In this case, systemd starts immediately (without the warning), and everything works as was previously (with cgroup v1) in systemd v255.

Alternatively, as commented in issue https://github.com/microsoft/WSL/issues/6662#issuecomment-2002407717, enabling the following kernel parameters (again in .wslconfig) also make it so that WSL is started with cgroup v2:

[wsl2]
kernelCommandLine = cgroup_no_v1=all systemd.unified_cgroup_hierarchy=1

So there are two ways to have WSL+systemd v256 work correctly by having WSL use cgroup v2. The question is whether those are to be considered proper solutions, or only workarounds until something more automatic is done in WSL to address this issue.

Vogtinator commented 1 month ago

So there are two ways to have WSL+systemd v256 work correctly by having WSL use cgroup v2. The question is whether those are to be considered proper solutions, or only workarounds until something more automatic is done in WSL to address this issue.

Yeah. If there's nothing or almost nothing left that requires cgroupv1, then just switching to cgroupv2 fully by default is the proper solution.

Otherwise it might need some new parameter in wsl.conf that specifies whether a distro needs cgroupv1 or v2. I'm not sure whether it's possible to have a cgroupv1 container next to a cgroupv2 container with cgroup namespaces though.

WH-2099 commented 3 weeks ago

I got something more. The systemd compatibility layer in the WSL2 kernel has some problems in determining the first boot. Neither systemd.firstboot=false nor systemd.condition-first-boot=false prevented systemd-firstboot.service from booting by rewriting the kernel command line arguments. In fact, based on the results of systemd-analyze condition 'ConditonFirstBoot=true', the kernel doesn't seem to be handling the relevant parameters correctly.

Vogtinator commented 3 weeks ago

The linux kernel doesn't do anything with systemd parameters. systemd just looks at /proc/cmdline.

In any case, this is unrelated to the issue here.