microsoft / WSL

Issues found on WSL
https://docs.microsoft.com/windows/wsl
MIT License
17.23k stars 808 forks source link

Kernel command line parameter kernelCommandLine=systemd.unified_cgroup_hierarchy=1 results in creation of cgroup V1 and V2 hierarchy. It is prohibited now, #6662

Open PavelSosin-320 opened 3 years ago

PavelSosin-320 commented 3 years ago

Environment

Microsoft Windows [Version 10.0.21327.1010]
(c) Microsoft Corporation. All rights reserved.
Fedora 34 self-installed but works in Ubuntu in the same way. This the Kernel issue.
WSL2
WSL Kernel Linux MSI-wsl 5.4.91-microsoft-standard-WSL2

Steps to reproduce

  1. Install systemd based distro like Ubuntu, fedora33remix,
  2. Edit .wslconfig and add systemd.unified_cgroup_hierarchy=1
  3. Start and attach to the running distro using WT
  4. do ls /sys/fs/cgroup/ - both /sys/fs/cgroup/systemd (V1 hierarchy) and /sys/fs/unifued (V2 hierarchy) are presented. /sys/fs/cgroup/ are polluted with cgroup controllers The systemd.unified_cgroup_hierarchy=1 is missinterpreted.
  5. Install any recent OCI runtime version: RunC, CRun), Docker 20.10 daemon, Podman 3
  6. Do Podman, .. info - Unified cgroup hierarchy is not recognized and cgroup V1 is shown due to /cgroup/systemd presence.

Only cgroup V2 hierarchy is built because the "mixed" setup has been prohibited as a dead-end. The recent runC ( Docker 20.10) and cRun switched to support cgroup V2 . It is necessary for rootless user mode, so important for WSL users. The conversion between mixed mode and cgroup V2 is not supported anymore because of mentioned above reasons.

WSL logs:

Expected behavior

Only cgroup V2 hierarchy is created: /sys/fs/cgroup/unified/ and all controllers are put into the correct place.

Actual behavior

/sys/fs/cgroup is polluted with the random content like controllers and systemd folder

ls /sys/fs/cgroup blkio cpu,cpuacct cpuset freezer memory net_cls,net_prio perf_event rdma unified cpu cpuacct devices hugetlb net_cls net_prio pids systemd

Please, correct to allow upgrade Docker and Podman to the recent releases and working as a rootless user. This is also a security issue because WSL root user has unlimited access to the Windows program Files and program Data directories, i.e. can inject any malicious executive into Windows and run it as MyVirus.exe .

WSLUser commented 3 years ago

WSL root user has unlimited access to the Windows program Files and program Data directories, i.e. can inject any malicious executive into Windows and run it as MyVirus.exe

Not true at all. The WSL root user has the same access as a normal Windows user. Go ahead and navigate to C:\Windows\System32 and try replacing one of the executables from within WSL2. It will fail.

benhillis commented 3 years ago

WSL does not use systemd so that setting is not being respected.

PavelSosin-320 commented 3 years ago

@benhillis WSL Kernel may not support systemd, It is separate module that can be supported by 3th party software. But WSL Kernel doesn't create correcty neither cgroup V1 nor V2 and fails with: [ 1.666152] cgroup1: Need name or subsystem set [ 1.675386] ERROR: Mount:2486: mount((null), /sys/fs/cgroup/memory, cgroup, 0x20000e, memory [ 1.675389] ) failed 22 And distros never reaches the state running. The cost is further networking issues.

WSLUser commented 3 years ago

If you enable systemd yourself through something like genie and set it up to boot with that running first, do you still experience the issue?

PavelSosin-320 commented 3 years ago

@WSLUser After all upgrades to the Kernel 5.10.16.3-microsoft-standard-WSL2 and genie The issue has not going to be solved. Cgroup management and system.d are tightly coupled and the kernel parameter is called systemd.unified_cgroup_hierarchy by the Linux kernel authors. If WSL Kernel doesn't support systemd by itself then I assume that parameter must be called simply unified_cgroup_hierarchy and results in the creation of only the unified group hierarchy without polluting other FS. Unfortunately, it doesn't work. I'm afraid that the entire property kernelCommandLine of wslconfig file is ignored. I see in the of the same Command line: initrd=\initrd.img panic=-1 nr_cpus=2 swiotlb=force console=ttyS0,115200 debug pty.legacy_count=0 regardless of how I pass unified_cgroup_hierarchy, with or without systemd, with or without quotes, etc.

WSLUser commented 3 years ago

Ok, I'm going to ask you to do a couple things. First, set up systemd using https://github.com/shayne/wsl2-hacks and modify from script improvements shown in https://github.com/shayne/wsl2-hacks/issues/7. Then compile the 5.10 WSL2 kernel using https://github.com/microsoft/WSL2-Linux-Kernel/pull/245 for the config. Use https://wsl.dev/wsl2-kernel-zfs/ for steps. Once you restart your distro, do you still experience the original issue (docker unable to use cgroupsv2)?

benhillis commented 3 years ago

WSL doesn't run system do so expecting any of the systemd options to be honored will not work.

PavelSosin-320 commented 3 years ago

I'm working in a different context: running the latest released Podman version on top of Fedora 34 remix distro built by Whitewater Foundation. Systemd functionality is provided by systemd-genie and works very reliable including cgroup management, user management, session management for both root and rootless users. I'm testing both scenarios side-by-side because Podman provides almost equal functionality with some minor restrictions for rootless users mainly in the networking and volume binding. I achieved almost full feature parity in both modes with one exception: When Podman detects cgroup V1 hierarchy in the rootless modes it falls back to cgroupfs because systemd doesn't allow mixed, back-compatibility mode and the systemd version used in Fedora34 doesn't contain a convertor. The systemd version 226 uses unified hierarchy by default. Both Ubuntu and Fedora, Docker and Podman current releases use unified hierarchy and cgroup V1 hierarchy is simply unexpected and missleads Podman. According to error messages that I see in the ConsoleLog it is also not created correctly: attempts to create symbolic links for controllers result in errors: Failed to create symlink /sys/fs/cgroup/cpuacct: File exists Failed to create symlink /sys/fs/cgroup/cpu: File exists Failed to create symlink /sys/fs/cgroup/net_prio: File exists Failed to create symlink /sys/fs/cgroup/net_cls: File exists These is very specific bug in the WSL Kernel cgroup implementation. I don't believe that rebuild of the kernel without bug correction. I'm wondering to see that the person that wrote this code for WSL today working on systemd.

cerebrate commented 3 years ago

Your problem isn't with systemd not seeing the option, because it is passed through: your problem is that systemd isn't the first (and can't be, because of the above-all-distros namespace) init, so by the time systemd gets its hands on it, cgroups have long been initialized.

More specifically, if you set the kernel command-line option cgroup_no_v1=all to try and force it by disabling the controllers for cgroups v1, the following happens at the end of boot:

[    4.424900] Run /init as init process
[    4.431087]   with arguments:
[    4.436243]     /init
[    4.440956]   with environment:
[    4.446506]     HOME=/
[    4.450957]     TERM=linux
[    4.457219] cgroup: Need name or subsystem set
[    4.464697] ERROR: Mount:2513: mount((null), /sys/fs/cgroup/memory, cgroup, 0x20000e, memory
[    4.464700] ) failed 22
[    4.483575] kvm: exiting hardware virtualization
[    4.493039] ACPI: Preparing to enter system sleep state S5
[    4.502509] reboot: Power down
[    4.527725] acpi_power_off called

i.e., it looks like the Microsoft init is making use of v1 memory cgroups, so it doesn't look like you can get to a unified cgroups v2-only hierarchy unless and until that changes.

PavelSosin-320 commented 3 years ago

@cerebrate I totally agree with you because the mixed hierarchy is created regardless Of genie usage. All error messages appear before distro's banner message and 1st systemd message. Some parts of WSL Kernel code are written 2 years ago, long before of unified hierarchy adoption by Linux distros and OCI runtimes. There were no real consumers for the unified hierarchy. The back-compatibility mode was required by RunC, Docker-for-win-Desktop based on the old DockerCE version. Docker needs cgroup V2 starting from 20-xx version. Unfortunately, when I looked into WSL-Kernel repository I found that the person who wrote the initial code version has migrated to the systemd.io, the opposite camp.

PavelSosin-320 commented 3 years ago

@WSLUser After all upgrades to the 5.10.16.3-microsoft-standard-WSL2 and genie 1.40 I still stuck with this issue: although parameter systemd.unified_cgroup_hierarchy is passed and accepted by WSL Kernel the kernel insists to create on re-populate cgroup V1 hierarchy and create unified as well. Log shows: Failed to create symlink /sys/fs/cgroup/net_prio: File exists Failed to create symlink /sys/fs/cgroup/net_cls: File exists Failed to create symlink /sys/fs/cgroup/cpu: File exists Failed to create symlink /sys/fs/cgroup/cpuacct: File exists

I'm passing now parameter unified_cgroup_hierarchy without systemd. ... It looks like entire kernel command line option is ignored: [ 0.000000] Command line: initrd=\initrd.img panic=-1 nr_cpus=2 swiotlb=force console=ttyS0,115200 debug pty.legacy_count=0.

With kernelCommandLine=cgroup_no_v1\=all No group hierarchy is created, neither V1 nor V2. I suppose that at the time when the Kernel was built 2 years ago without FUSE presented in 5.10 MS had problems with mount of group unified FS. It is impossible today too unless mount.fuse is used . But this package is installed only as OCI runtime dependency.

WSLUser commented 3 years ago

I would repurpose this issue to fix the proprietary init to support v2 as celebrate has pointed out this is the issue with getting cgroupsv2 to be supported. Of course it's also possible that systemd is eventually adopted instead of the init but that's an already known feature request.

cerebrate commented 3 years ago

@PavelSosin-320

The kernel doesn't do anything with the systemd.* command-line parameters, though, because they aren't kernel parameters. (As you can see, they don't show up in /proc/sys/kernel or the output of sysctl -a.) If you check bootparam(7), what you will see is this:

Any remaining arguments that were not picked up by the kernel and were not interpreted as environment variables are then passed onto PID 1, which is usually the init(1) program. The most common argument that is passed to the init process is the word 'single' which instructs it to boot the computer in single user mode, and not launch all the usual daemons. Check the manual page for the version of init(1) installed on your system to see what arguments it accepts.

Those parameters only do anything because they're passed on to the init(1) launched by the kernel at the top-non-containerized-level, and require that it be systemd to do anything. Which it isn't, so they don't.

(Now, if someone had a lot of time on their hands, they could modify genie so that it pulled the initial kernel command line out of the /proc/cmdline file, parsed out all the systemd.* parameters, and passed them on to the systemd it spawns inside the bottle.

That wouldn't solve this particular issue, since the cgroup hierarchy is already established by the time genie can start its containerized systemd, and the rest of the potential use cases are obscure enough that it's down on my dogwash-priority list. But. hey, if anyone wants to implement it and PR me, they can go right ahead.)

lightmelodies commented 3 years ago

After a few tries, I just make cgroup v2 working with following steps:

  1. Add kernelCommandLine=systemd.unified_cgroup_hierarchy=1 cgroup_no_v1=all in .wslconfig.
  2. Add cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate 0 0 in fstab.
  3. Run sudo mount -a (This step is important before you start systemd.)
  4. Now start systemd (I am using subsystemctl but genie should work similarly. ),

dmesg will show some error since we set cgroup_no_v1=all, just ignore.

[    7.414936] cgroup: Disabled controller 'cpuset'
[    7.414942] init: (1) ERROR: ConfigInitializeCgroups:1685: mount cgroup cpuset failed 22
[    7.414949] cgroup: Disabled controller 'cpu'
[    7.414953] init: (1) ERROR: ConfigInitializeCgroups:1685: mount cgroup cpu failed 22
[    7.414959] cgroup: Disabled controller 'cpuacct' 

ls /sys/fs/cgroup The cgroup v2 controllers should be correctly created by systemd.

cgroup.controllers
cgroup.max.depth
cgroup.max.descendants
cgroup.procs
cgroup.stat
cgroup.subtree_control
cgroup.threads
cpuset.cpus.effective
cpuset.mems.effective
cpu.stat
dev-hugepages.mount
dev-mqueue.mount
init.scope
io.stat
memory.stat
sys-fs-fuse-connections.mount
sys-kernel-debug.mount
sys-kernel-tracing.mount
system.slice
user.slice

Also check with docker info

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 5
 Server Version: 20.10.6
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 8c906ff108ac28da23f69cc7b74f8e7a470d1df0.m
 runc version: 12644e614e25b05da6fd08a38ffa0cfe1903fdec
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
  cgroupns
 Kernel Version: 5.12.0-sukasuka-kernel+
 Operating System: Arch Linux
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 11.7GiB
 Name: canoe
 ID: ILRY:4MYO:R7F5:2KLA:7TQG:A3PR:D4HL:SY37:5Z7I:JE26:BGPK:HS6E
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
cerebrate commented 3 years ago

@lightmelodies Interesting.

I have tried that my own self, but all I get is the crash at the end of kernel boot mentioned above (https://github.com/microsoft/WSL/issues/6662#issuecomment-833874337).

Can I ask what Windows build you're on, and whether you're using a custom kernel? (And if so, please send .config file?)

cerebrate commented 3 years ago

Oh, wait.

Add kernelCommandLine=systemd.unified_cgroup_hierarchy=1 cgroup_no_v1=all in wsl.config.

You put this in /etc/wsl.config inside the distro, not ~\.wslconfig in Windows? That would be a no-op, since kernelCommandLine only exists as an option in the latter, and would explain why you don't get the crash.

And, curiously enough, I can duplicate your results by just mounting the cgroup2 hierarchy over /sys/fs/cgroup before systemd starts. This doesn't disable cgroups v1 in the kernel (as you can confirm by firing up a second distribution and looking at its /sys/fs/cgroup) or stop its hierarchy from being created/used by earlier processes, but mounting the cgroup2 hierarchy over the hybrid cgroup hierarchy does convince the bottle-container systemd and its children that they should operate in unified mode, not hybrid mode.

I'll leave it up to someone with more cgroups knowledge than me to say whether or not this is actually useful in non-cosmetic ways (or whether it solves @PavelSosin-320 's problem). I'm not adverse to adding a "unified-cgroups" option to genie to enable this automagically, but I'd prefer to know if it's actually useful first.

cerebrate commented 3 years ago

As a side note, in retrospect, having both wsl.config and .wslconfig existing with disparate functions seems like a bit of a naming oops, what?

lightmelodies commented 3 years ago

As a side note, in retrospect, having both wsl.config and .wslconfig existing with disparate functions seems like a bit of a naming oops, what?

Sorry for mistyping, Just set kernelCommandLine in .wslconfig. I am still using windows build 19402 with a custom 5.12 kernel, but the default 5.4.72 kernel also work. Maybe some change in insider add a check in the init process and faill when cgroup v1 is disable.

cerebrate commented 3 years ago

@lightmelodies Ah, right. Guess so, then, since on the current dev build cgroups_no_v1 reliably breaks in it with both the stock and my custom kernel.

I am curious, though - if you don't set the kernelCommandLine, but you do mount the cgroup2 fs, does it behave any differently?

That seems to get systemd etal. running in unified mode for me even without the kernel part.

lightmelodies commented 3 years ago

@lightmelodies Ah, right. Guess so, then, since on the current dev build cgroups_no_v1 reliably breaks in it with both the stock and my custom kernel.

I am curious, though - if you don't set the kernelCommandLine, but you do mount the cgroup2 fs, does it behave any differently?

That seems to get systemd etal. running in unified mode for me even without the kernel part.

If I don't set the cgroup_no_v1=all , I can still mount cgroup v2 over /sys/fs/cgroup using fstab, but mount -l will show cgroup v1 mount as well.

cgroup2 on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/cpu type cgroup (rw,nosuid,nodev,noexec,relatime,cpu)
cgroup on /sys/fs/cgroup/cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpuacct)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/net_cls type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_prio)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma)
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate)  

The docker info also shows Cgroup Version: 2 but with the following warnings, which I think it does not really work.

WARNING: No cpu cfs quota support
WARNING: No cpu cfs period support
WARNING: No cpu shares support
WARNING: No cpuset support
WARNING: No io.weight support
WARNING: No io.weight (per device) support
WARNING: No io.max (rbps) support
WARNING: No io.max (wbps) support
WARNING: No io.max (riops) support
WARNING: No io.max (wiops) support    

systemd.unified_cgroup_hierarchy=1 seem to be unnecessary in both case.

PavelSosin-320 commented 3 years ago

I tested systemd boot process via executing systemctl daemon-reload and systemctl daemon-reexecute and found that the result is absolutely stable, exactly as initialization of the distro using genie. Since daemon-reload and execute don't involve systemd-genie I don't see any reason to change something in the genie. Both the code ofthe package cgroup_v1 in the Microsoft WSL2 kernel and cgroup redesign and re-implementation are results of efforts of the same person https://github.com/htejun distanced in few months. The problem is purely political: After releasing of WSL2 ore then 1 year ago MS insists that systemd is redundant but cgroup v2 is useless without systemd. Some WSL distros published and sold via MS store are useless as WSL1 (BASH) distros with systemd. Actually, all Linux systemd-based distros published in MS Store as LTS or dev releases are junk that doesn't worst penny unless systemd is someway added. Now, MS is trying to avoid using Linux Kernel standard that governed by RedHat exactly as it did it with jscript some years ago. Everybody knows what happened later. What is Kernel 5.10.16.3-microsoft-standard-WSL2 ? Is it 5.10 or not? Is it 5.10 with unpublished restrictions and known bugs? Where "Microsoft Linux kernel standard" is published?

cerebrate commented 3 years ago

I checked. There is exactly one difference between cgroup-v1.c in the latest Microsoft WSL kernel and the canonical straight-from-Linus's-own-repo 5.10 release, and that difference is a non-Microsoft patch that was added to said canonical kernel in 5.11-rc3.

I am as eager as the next chap to see all these things made to work, but before we go making allegations, please remember that diff is your friend.

cerebrate commented 3 years ago

@lightmelodies Makes sense. Seems like there's not a lot of point in adding support for that way of doing things, then.

Thanks for testing it for me.

nunix commented 3 years ago

oh a rootless discussion 😄

I managed to have NERDctl fully working with ContainerD in rootless mode (writting the blog now), however it works with cgroup v1 and this "hybrid" mode.

Also, please note that .wslconfig impacts the underlaying VM for WSL2 itself. When you look at the boot process, our distros are run atop of it (with kvm it seems). And not all Linux distros use SystemD (I'm not defending anything here, just laying facts 😉 ), so what would be neat is to have an extra "Kernel" option inside wsl.conf.

I will try the cgroup_no_v1=all setting, as I had already the systemd.unified_cgroup_hierarchy=1 and as @benhillis said, it's not honored.

PS: I'm switching to kernel 5.13, but for a "nice" rootless experience, you might want to jump to 5.11 at least, as it's where fuse-overlayfs is implemented (and 5.12 has the rootless mount capabilities).

Looking forward to your tests 😄

nunix commented 3 years ago

So, after some testing, I could get cgroups V2 working somehow (see screenshot below with podman)

There is still manual steps to perform, but here is in a nutshell what I've done:

While this should work, the cgroup mount generates errors when we try to write inside it, so for podman I set the pids_limit to 0 (as explained here: https://access.redhat.com/solutions/5913671)

image

Hope this provides some additional hints

lightmelodies commented 3 years ago

The kernel doc says

All controllers which support v2 and are not bound to a v1 hierarchy are automatically bound to the v2 hierarchy and show up at the root.

So while we can manually umount cgroup v1 then mount cgroup v2 to make systemd work in unified mode, no v2 controllers are available because they are already bound to v1. That's why docker show such warnings. Unfortunately I can not find a way to disable v1 controllers dynamically without cgroup_no_v1=all.

PavelSosin-320 commented 3 years ago

@nunix I use Arkane System systemd-genie that offers almost 100% systemd functionality including systemd-user with only 1 dependency - Dotnet 5.0 and exists for all popular distros. So, on one hand, the home-brewed Didleddan's script hardly satisfy me, and on another hand is able to support group V2 via systemd-root, systemd-user. The problem is only in the Kernel that is called 5.10 but lacks cgroup V2 module. Once, systemd had a feature to convert V1 to V2 but today, as you mentioned, the version's mix is not functional.

cerebrate commented 3 years ago

The cgroup v2 support is there in the WSL 2 kernel:

https://github.com/microsoft/WSL2-Linux-Kernel/blob/linux-msft-wsl-5.10.y/kernel/cgroup/cgroup.c .

It's even exactly the same commit as the equivalent-version (which is to say, 5.10) file in the canonical Linux kernel:

https://github.com/torvalds/linux/blob/v5.10/kernel/cgroup/cgroup.c .

And is enabled:

https://github.com/microsoft/WSL2-Linux-Kernel/blob/a571dc8cedc8e0e56487c0dc93243e0b5db8960a/Microsoft/config-wsl#L148

The problem, as has been said, is that cgroups v1 is initialized (and is required by the Microsoft init, since WSL fails to initialize properly if it is disabled) before any user distro starts up, and having been initialized, cannot be uninitialized . The only way to change this is in that init .

nunix commented 3 years ago

@PavelSosin-320 Genie is great, I use Didleddan script from the beggining and right now, with the boot command, there is "more" to config for no dependencies. But again, Genie is really great and having different choices of getting SystemD inside WSL2 is very good overall.

For my examples above, as @lightmelodies stated, we can get a "cgroup2 root", however when I try to add controllers to the subtree (io cpu memory pids) then I do get an error. We might be "near", but I guess the last roadblock will stay as @cerebrate said until a change in the init process is done.

AdsonCicilioti commented 2 years ago

WSL is great, but to work with Podman on WSL is need Cgroups V2.

codebam commented 2 years ago

All I had to do was set cgroup_no_v1=named and podman shows CgroupsV2

sarim commented 2 years ago

All I had to do was set cgroup_no_v1=named and podman shows CgroupsV2

It is cosmetic or it actually works, like can you limit cpu / memory with podman and container runs? podman stats show correct cpu/mem usage etc.. ? Also what is your winver?

EDIT:

I'm in Windows Build 19044.1706 and adding kernelCommandLine=cgroup_no_v1=all and the mount line to /etc/fstab successfully enables cgroup v2 with podman. I can see podman container stats.

Domest0s commented 2 years ago

I tried adding cgroup_no_v1=named to %UserProfile%\.wslconfig file. It had no effect on "cgroup v1" and "cgroup v2". How do I make cgroup v2 working for WSL2? I am running Ubuntu 22.04. Please help.

sarim commented 2 years ago

@Domest0s What is your windows version / wsl version?. cgroup_no_v1=all used to work for me in windows 10. But after upgrading to windows 11, cgroup_no_v1=all crashes the wsl vm as discussed above. cgroup_no_v1=named doesn't have the desired effect.

UPDATE:

In windows 11 it seems like microsoft's "init" process has changed and it no longer able to boot with cgroup_no_v1=all. This effectively kills proper cgroups v2 support As far as I understand. Mounting cgroupsv2 manually tricks podman into showing v2, but No controller is delegated. So you can't use podman / container engine to limit pids, cpus, memory, io etc....

I wonder if wsl version can be downgraded in windows 11. Since WSL2 was decoupled and put into ms store, anyone has a older package?

Domest0s commented 2 years ago

@sarim I am running Windows 10 build 19044. Also, the stock Ubuntu 22.04 from Microsoft Store is Linux 5.10 based. I don't know whether the settings we apply from .wslconfig file are sensitive to kernel version. Do you expect different behavior from other distros (new Fedora, etc.)?

sarim commented 2 years ago

@Domest0s In 19044 it should work. Can you post all the steps you've taken?. First make sure you have kernelCommandLine=cgroup_no_v1=all in c:\Users\USERNAME.wslconfig . Then make sure you have cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate 0 0 in your /etc/fstab. Restart wsl (wsl --shutdown) .Then use genie to start user session. Now run cat /sys/fs/cgroup/user.slice/user-$(id -u).slice/user@$(id -u).service/cgroup.controllers to confirm.

Domest0s commented 2 years ago

@sarim Please forgive my limited Linux knowledge. I did the following steps:

  1. created file C:\Users\<my_user_name>\.wslconfig with content kernelCommandLine=cgroup_no_v1=all.
  2. added cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate 0 0 to /etc/fstab file.
    $ cat /etc/fstab
    LABEL=cloudimg-rootfs   /        ext4   discard,errors=remount-ro       0 1
    cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate 0 0
  3. changed default systemd boot target to "multi-user.target" (recommended by genie)
    $ systemctl set-default multi-user.target
  4. create genie configuration file:
    $ sudo touch /etc/genie.ini
    $ ....
    $ cat /etc/genie.ini
    [genie]
    systemd-timeout=240
    clone-env=WSL_DISTRO_NAME,WSL_INTEROP,WSLENV,DISPLAY,WAYLAND_DISPLAY,PULSE_SERVER
    secure-path=/lib/systemd:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
    clone-path=false
    target-warning=true
    update-hostname=true
    update-hostname-suffix=-wsl
    resolved-stub=false

    (this content is in sync with the installation guide)

  5. install genie
  6. shutdown WSL instance > wsl --shutdown
  7. start genie: > wsl genie --verbose --bash

    
    John C:\Users\Ivan Bereziuk>wsl genie --verbose --shell
    genie: starting shell
    genie: starting bottle
    genie: generating new hostname
    genie: external hostname is DESKTOP-VPJ0EM5
    genie: setting new hostname to DESKTOP-VPJ0EM5-wsl
    genie: updating hosts file
    genie: unmounting binfmt_misc filesystem before proceeding
    genie: AppArmor not available in kernel; attempting to continue without AppArmor namespace
    genie: starting systemd with command line:
    daemonize /usr/bin/unshare -fp --propagation shared --mount-proc -- systemd
    Waiting for systemd....!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    genie: systemd did not enter running state (degraded) after 240 seconds
    genie: this may be due to a problem with your systemd configuration
    genie: information on problematic units is available at https://github.com/arkane-systems/genie/wiki/Systemd-units-known-to-be-problematic-under-WSL
    genie: a list of failed units follows:
    
    UNIT                       LOAD   ACTIVE SUB    DESCRIPTION
    ● docker.service             loaded failed failed Docker Application Container Engine
    ● ssh.service                loaded failed failed OpenBSD Secure Shell server
    ● systemd-remount-fs.service loaded failed failed Remount Root and Kernel File Systems
    ● systemd-sysusers.service   loaded failed failed Create System Users
    ● docker.socket              loaded failed failed Docker Socket for the API

LOAD = Reflects whether the unit definition was properly loaded. ACTIVE = The high-level unit activation state, i.e. generalization of SUB. SUB = The low-level unit activation state, values depend on unit type. 5 loaded units listed. genie: WARNING: systemd is in degraded state, issues may occur!

(not everything went well)
9. check content of `/sys/fs/cgroup/user.slice/user-$(id -u).slice/user@$(id -u).service/cgroup.controllers` file.  
(the file is empty)

$ ls -l /sys/fs/cgroup/user.slice/user-$(id -u).slice/user@$(id -u).service/cgroup.controllers -r--r--r-- 1 root root 0 Jul 20 15:51 /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgroup.controllers

sarim commented 2 years ago

What is the output of your mount command in a genie shell ( wsl genie --shell )

Also you might want to set systemd-timeout=10 as if don't like to wait for 240 seconds :P

Domest0s commented 2 years ago

Thanks for the tip! Here is the output:

john22@DESKTOP-VPJ0EM5-wsl:~$ mount
/dev/sdb on / type ext4 (rw,relatime,discard,errors=remount-ro,data=ordered)
tools on /init type 9p (ro,relatime,dirsync,aname=tools;fmask=022,loose,access=client,trans=fd,rfd=6,wfd=6)
none on /dev type devtmpfs (rw,nosuid,relatime,size=12806548k,nr_inodes=3201637,mode=755)
devpts on /dev/pts type devpts (rw,nosuid,noexec,noatime,gid=5,mode=620,ptmxmode=000)
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,noatime)
tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=755)
cgroup2 on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/cpu type cgroup (rw,nosuid,nodev,noexec,relatime,cpu)
cgroup on /sys/fs/cgroup/cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpuacct)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/net_cls type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_prio)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma)
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate)
proc on /proc type proc (rw,nosuid,nodev,noexec,noatime)
none on /run type tmpfs (rw,nosuid,noexec,noatime,mode=755)
none on /run/lock type tmpfs (rw,nosuid,nodev,noexec,noatime)
none on /run/shm type tmpfs (rw,nosuid,nodev,noatime)
none on /run/user type tmpfs (rw,nosuid,nodev,noexec,noatime,mode=755)
drivers on /usr/lib/wsl/drivers type 9p (ro,nosuid,nodev,noatime,dirsync,aname=drivers;fmask=222;dmask=222,mmap,access=client,msize=65536,trans=fd,rfd=4,wfd=4)
lib on /usr/lib/wsl/lib type 9p (ro,nosuid,nodev,noatime,dirsync,aname=lib;fmask=222;dmask=222,mmap,access=client,msize=65536,trans=fd,rfd=4,wfd=4)
tmpfs on /mnt/wsl type tmpfs (rw,relatime)
C:\ on /mnt/c type 9p (rw,noatime,dirsync,aname=drvfs;path=C:\;uid=1000;gid=1000;symlinkroot=/mnt/,mmap,access=client,msize=65536,trans=fd,rfd=8,wfd=8)
D:\ on /mnt/d type 9p (rw,noatime,dirsync,aname=drvfs;path=D:\;uid=1000;gid=1000;symlinkroot=/mnt/,mmap,access=client,msize=65536,trans=fd,rfd=8,wfd=8)
none on /etc/hostname type tmpfs (rw,nosuid,noexec,noatime,mode=755)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=28,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=16529)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime)
debugfs on /sys/kernel/debug type debugfs (rw,nosuid,nodev,noexec,relatime)
tracefs on /sys/kernel/tracing type tracefs (rw,nosuid,nodev,noexec,relatime)
fusectl on /sys/fs/fuse/connections type fusectl (rw,nosuid,nodev,noexec,relatime)
/var/lib/snapd/snaps/core20_1494.snap on /snap/core20/1494 type squashfs (ro,nodev,relatime,x-gdu.hide)
/var/lib/snapd/snaps/lxd_22923.snap on /snap/lxd/22923 type squashfs (ro,nodev,relatime,x-gdu.hide)
/var/lib/snapd/snaps/snapd_15904.snap on /snap/snapd/15904 type squashfs (ro,nodev,relatime,x-gdu.hide)
snapfuse on /snap/core20/1518 type fuse.snapfuse (ro,nodev,relatime,user_id=0,group_id=0,allow_other)
snapfuse on /snap/snapd/16292 type fuse.snapfuse (ro,nodev,relatime,user_id=0,group_id=0,allow_other)
binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,nosuid,nodev,noexec,relatime)
none on /run/snapd/ns type tmpfs (rw,nosuid,noexec,noatime,mode=755)
nsfs on /run/snapd/ns/lxd.mnt type nsfs (rw)
tmpfs on /run/user/1000 type tmpfs (rw,nosuid,nodev,relatime,size=2561724k,nr_inodes=640431,mode=700,uid=1000,gid=1000)
sarim commented 2 years ago

Try the code from "configuring delegation" section. https://github.com/opencontainers/runc/blob/main/docs/cgroup-v2.md#configuring-delegation

Then do wsl --shutdown and then check again (cat the cgroup.controllers file).

Domest0s commented 2 years ago

Did the steps from "configuring delegation" section:

# mkdir -p /etc/systemd/system/user@.service.d
# cat > /etc/systemd/system/user@.service.d/delegate.conf << EOF
[Service]
Delegate=cpu cpuset io memory pids
EOF
# systemctl daemon-reload

Then "rebooted" WSL > wsl.exe --shutdown. + > wsl.exe genie --shell. No change. Errors are the same. The cgroup.controllers file is empty:(

john22@DESKTOP-VPJ0EM5-wsl:~$ cat /sys/fs/cgroup/user.slice/user-$(id -u).slice/user@$(id -u).service/cgroup.controllers
john22@DESKTOP-VPJ0EM5-wsl:~$
sarim commented 2 years ago

I've able to "partially" enable cgroupsv2 in Windows 11 22621.232. As discussed above, in Windows 11 cgroup_no_v1=all doesn't work as WSL's init tries to mount(/use ?) memory cgroup. I'm guessing this has to do with wslg, as with every distro instance a "system" instance of wslg is created, according to README:

Every WSL 2 user distro is paired with its own instance of the system distro. The system distro runs partially isolated from the user distro to which it is paired, in it's own NS/PID/UTS namespace but shares other namespaces such as IPC, to allow for shared memory optimization across the boundary.

I tried with guiApplications=false to disable wslg, but this doesn't have any effect with init. Maybe INIT is always mounting memory controller without checking guiApplications setting or its used in more places than just wslg. Maybe microsoft can use cgroupv2 api for these tasks? Pretty please :3

So what I did is cgroup_no_v1=cpuset,cpu,cpuacct,io,devices,freezer,net_cls,perf_event,net_prio,hugetlb,pids,rdma , basically everything except memory. Then I mount cgroup2, and delegation to user works. Without memory controller of course.

podman info  | head -n 10
host:
  arch: amd64
  buildahVersion: 1.23.1
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - pids
  cgroupManager: systemd
  cgroupVersion: v2

The critical thing is pids. With it available podman can run rootless containers. No need for pids_limit=0 workaround in https://github.com/microsoft/WSL/issues/6662#issuecomment-840672235

cpu limit of container also works, podman stats works, but memory fields are empty as expected. Trying to limit memory of container results in error as podman doesn't find memory controller at cgroup2.

But I'm happy with it, I can run containers without any special config, and can see running containers with podman stats. Tested with both Ubuntu 20.04 and 22.04.

@Domest0s Umm then I'm out of ideas, can you post output of dmesg? Also what is your kernel version?

Domest0s commented 2 years ago

@sarim sure, here is my dmesg (the output lost the colors, so I prefixed entries with originally red text in them with a hat (^) sign):

john22@DESKTOP-VPJ0EM5-wsl:~$ dmesg
[    0.000000] Linux version 5.10.16.3-microsoft-standard-WSL2 (oe-user@oe-host) (x86_64-msft-linux-gcc (GCC) 9.3.0, GNU ld (GNU Binutils) 2.34.0.20200220) #1 SMP Fri Apr 2 22:23:49 UTC 2021
[    0.000000] Command line: initrd=\initrd.img panic=-1 pty.legacy_count=0 nr_cpus=16
[    0.000000] KERNEL supported cpus:
[    0.000000]   Intel GenuineIntel
[    0.000000]   AMD AuthenticAMD
[    0.000000]   Centaur CentaurHauls
[    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x020: 'AVX-512 opmask'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x040: 'AVX-512 Hi256'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x080: 'AVX-512 ZMM_Hi256'
[    0.000000] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[    0.000000] x86/fpu: xstate_offset[5]:  832, xstate_sizes[5]:   64
[    0.000000] x86/fpu: xstate_offset[6]:  896, xstate_sizes[6]:  512
[    0.000000] x86/fpu: xstate_offset[7]: 1408, xstate_sizes[7]: 1024
[    0.000000] x86/fpu: Enabled xstate features 0xe7, context size is 2432 bytes, using 'compacted' format.
[    0.000000] BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009ffff] usable
[    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000e0fff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000001fffff] ACPI data
[    0.000000] BIOS-e820: [mem 0x0000000000200000-0x00000000f7ffffff] usable
[    0.000000] BIOS-e820: [mem 0x0000000100000000-0x00000006459fffff] usable
[    0.000000] NX (Execute Disable) protection: active
[    0.000000] DMI not present or invalid.
[    0.000000] Hypervisor detected: Microsoft Hyper-V
[    0.000000] Hyper-V: features 0xae7f, privilege high: 0x3b8030, hints 0x20e24, misc 0x20bed7b2
[    0.000000] Hyper-V Host Build:19041-10.0-1-0.1826
[    0.000000] Hyper-V: LAPIC Timer Frequency: 0x1e8480
[    0.000000] Hyper-V: Using hypercall for remote TLB flush
[    0.000000] clocksource: hyperv_clocksource_tsc_page: mask: 0xffffffffffffffff max_cycles: 0x24e6a1710, max_idle_ns: 440795202120 ns
[    0.000001] tsc: Detected 2496.000 MHz processor
[    0.000006] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
[    0.000008] e820: remove [mem 0x000a0000-0x000fffff] usable
[    0.000010] last_pfn = 0x645a00 max_arch_pfn = 0x400000000
[    0.000023] MTRR default type: uncachable
[    0.000023] MTRR fixed ranges enabled:
[    0.000024]   00000-3FFFF write-back
[    0.000024]   40000-7FFFF uncachable
[    0.000025]   80000-8FFFF write-back
[    0.000025]   90000-FFFFF uncachable
[    0.000025] MTRR variable ranges enabled:
[    0.000026]   0 base 0000000000 mask 7F00000000 write-back
[    0.000027]   1 base 0100000000 mask 7000000000 write-back
[    0.000027]   2 disabled
[    0.000027]   3 disabled
[    0.000028]   4 disabled
[    0.000028]   5 disabled
[    0.000028]   6 disabled
[    0.000028]   7 disabled
[    0.000033] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
[    0.000040] last_pfn = 0xf8000 max_arch_pfn = 0x400000000
[    0.000048] Using GB pages for direct mapping
[    0.000285] RAMDISK: [mem 0x03035000-0x03044fff]
[    0.000288] ACPI: Early table checksum verification disabled
[    0.000299] ACPI: RSDP 0x00000000000E0000 000024 (v02 VRTUAL)
[    0.000301] ACPI: XSDT 0x0000000000100000 000044 (v01 VRTUAL MICROSFT 00000001 MSFT 00000001)
[    0.000305] ACPI: FACP 0x0000000000101000 000114 (v06 VRTUAL MICROSFT 00000001 MSFT 00000001)
[    0.000308] ACPI: DSDT 0x00000000001011B8 01E184 (v02 MSFTVM DSDT01   00000001 MSFT 05000000)
[    0.000310] ACPI: FACS 0x0000000000101114 000040
[    0.000312] ACPI: OEM0 0x0000000000101154 000064 (v01 VRTUAL MICROSFT 00000001 MSFT 00000001)
[    0.000314] ACPI: SRAT 0x000000000011F33C 000310 (v02 VRTUAL MICROSFT 00000001 MSFT 00000001)
[    0.000316] ACPI: APIC 0x000000000011F64C 0000C8 (v04 VRTUAL MICROSFT 00000001 MSFT 00000001)
[    0.000321] ACPI: Local APIC address 0xfee00000
[    0.000461] Zone ranges:
[    0.000462]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
[    0.000463]   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
[    0.000463]   Normal   [mem 0x0000000100000000-0x00000006459fffff]
[    0.000464]   Device   empty
[    0.000465] Movable zone start for each node
[    0.000465] Early memory node ranges
[    0.000466]   node   0: [mem 0x0000000000001000-0x000000000009ffff]
[    0.000466]   node   0: [mem 0x0000000000200000-0x00000000f7ffffff]
[    0.000467]   node   0: [mem 0x0000000100000000-0x00000006459fffff]
[    0.000777] Zeroed struct page in unavailable ranges: 10081 pages
[    0.000778] Initmem setup node 0 [mem 0x0000000000001000-0x00000006459fffff]
[    0.000779] On node 0 totalpages: 6543519
[    0.000779]   DMA zone: 59 pages used for memmap
[    0.000780]   DMA zone: 22 pages reserved
[    0.000780]   DMA zone: 3743 pages, LIFO batch:0
[    0.000797]   DMA32 zone: 16320 pages used for memmap
[    0.000798]   DMA32 zone: 1011712 pages, LIFO batch:63
[    0.010620]   Normal zone: 86376 pages used for memmap
[    0.010624]   Normal zone: 5528064 pages, LIFO batch:63
[    0.010969] ACPI: Local APIC address 0xfee00000
[    0.010982] ACPI: LAPIC_NMI (acpi_id[0x01] dfl dfl lint[0x1])
[    0.011139] IOAPIC[0]: apic_id 16, version 17, address 0xfec00000, GSI 0-23
[    0.011141] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
[    0.011143] ACPI: IRQ9 used by override.
[    0.011145] Using ACPI (MADT) for SMP configuration information
[    0.011162] smpboot: Allowing 16 CPUs, 0 hotplug CPUs
[    0.011168] [mem 0xf8000000-0xffffffff] available for PCI devices
[    0.011169] Booting paravirtualized kernel on Hyper-V
[    0.011171] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
[    0.015106] setup_percpu: NR_CPUS:256 nr_cpumask_bits:256 nr_cpu_ids:16 nr_node_ids:1
[    0.015784] percpu: Embedded 52 pages/cpu s173272 r8192 d31528 u262144
[    0.015789] pcpu-alloc: s173272 r8192 d31528 u262144 alloc=1*2097152
[    0.015790] pcpu-alloc: [0] 00 01 02 03 04 05 06 07 [0] 08 09 10 11 12 13 14 15
[    0.015809] Built 1 zonelists, mobility grouping on.  Total pages: 6440742
[    0.015811] Kernel command line: initrd=\initrd.img panic=-1 pty.legacy_count=0 nr_cpus=16
[    0.019007] Dentry cache hash table entries: 4194304 (order: 13, 33554432 bytes, linear)
[    0.020463] Inode-cache hash table entries: 2097152 (order: 12, 16777216 bytes, linear)
[    0.020597] mem auto-init: stack:off, heap alloc:off, heap free:off
[    0.032857] Memory: 4094124K/26174076K available (16403K kernel code, 2459K rwdata, 3464K rodata, 1444K init, 1164K bss, 561032K reserved, 0K cma-reserved)
[    0.032881] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=16, Nodes=1
[    0.032885] ftrace: allocating 49613 entries in 194 pages
[    0.042573] ftrace: allocated 194 pages with 3 groups
[    0.042757] rcu: Hierarchical RCU implementation.
[    0.042758] rcu:     RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=16.
[    0.042759]  Rude variant of Tasks RCU enabled.
[    0.042759]  Tracing variant of Tasks RCU enabled.
[    0.042760] rcu: RCU calculated value of scheduler-enlistment delay is 10 jiffies.
[    0.042760] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=16
[    0.044614] Using NULL legacy PIC
[    0.044615] NR_IRQS: 16640, nr_irqs: 552, preallocated irqs: 0
[    0.044884] random: crng done (trusting CPU's manufacturer)
[    0.044901] Console: colour dummy device 80x25
[    0.044906] printk: console [tty0] enabled
[    0.044909] ACPI: Core revision 20200925
[    0.044962] Failed to register legacy timer interrupt
[    0.044962] APIC: Switch to symmetric I/O mode setup
[    0.044963] Switched APIC routing to physical flat.
[    0.045093] Hyper-V: Using IPI hypercalls
[    0.045118] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x23fa772cf26, max_idle_ns: 440795269835 ns
[    0.045120] Calibrating delay loop (skipped), value calculated using timer frequency.. 4992.00 BogoMIPS (lpj=24960000)
[    0.045121] pid_max: default: 32768 minimum: 301
[    0.045129] LSM: Security Framework initializing
[    0.045165] Mount-cache hash table entries: 65536 (order: 7, 524288 bytes, linear)
[    0.045199] Mountpoint-cache hash table entries: 65536 (order: 7, 524288 bytes, linear)
[    0.045365] x86/cpu: User Mode Instruction Prevention (UMIP) activated
[    0.045376] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
[    0.045377] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0
[    0.045379] Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization
[    0.045380] Spectre V2 : Mitigation: Enhanced IBRS
[    0.045381] Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch
[    0.045382] Spectre V2 : mitigation: Enabling conditional Indirect Branch Prediction Barrier
[    0.045382] Speculative Store Bypass: Mitigation: Speculative Store Bypass disabled via prctl and seccomp
[    0.045501] Freeing SMP alternatives memory: 52K
[    0.045539] smpboot: CPU0: 11th Gen Intel(R) Core(TM) i7-11850H @ 2.50GHz (family: 0x6, model: 0x8d, stepping: 0x1)
[    0.045592] Performance Events: unsupported p6 CPU model 141 no PMU driver, software events only.
[    0.045606] rcu: Hierarchical SRCU implementation.
[    0.045881] smp: Bringing up secondary CPUs ...
[    0.046045] x86: Booting SMP configuration:
[    0.046046] .... node  #0, CPUs:        #1  #2  #3  #4  #5  #6  #7  #8  #9 #10 #11 #12 #13 #14 #15
[    0.046297] smp: Brought up 1 node, 16 CPUs
[    0.046297] smpboot: Max logical packages: 1
[    0.046297] smpboot: Total of 16 processors activated (79872.00 BogoMIPS)
[    0.075137] node 0 deferred pages initialised in 30ms
[    0.075876] devtmpfs: initialized
[    0.075876] x86/mm: Memory block size: 128MB
[    0.075876] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
[    0.075876] futex hash table entries: 4096 (order: 6, 262144 bytes, linear)
[    0.075902] NET: Registered protocol family 16
[    0.075963] thermal_sys: Registered thermal governor 'step_wise'
[    0.075976] cpuidle: using governor menu
[    0.075976] ACPI: bus type PCI registered
[^   0.075976] PCI: Fatal: No config space access function found
[    0.075976] HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages
[    0.075976] HugeTLB registered 2.00 MiB page size, pre-allocated 0 pages
[    0.075976] raid6: skip pq benchmark and using algorithm avx512x4
[    0.075976] raid6: using avx512x2 recovery algorithm
[    0.075976] ACPI: Added _OSI(Module Device)
[    0.075976] ACPI: Added _OSI(Processor Device)
[    0.075976] ACPI: Added _OSI(3.0 _SCP Extensions)
[    0.075976] ACPI: Added _OSI(Processor Aggregator Device)
[    0.075976] ACPI: Added _OSI(Linux-Dell-Video)
[    0.075976] ACPI: Added _OSI(Linux-Lenovo-NV-HDMI-Audio)
[    0.075976] ACPI: Added _OSI(Linux-HPI-Hybrid-Graphics)
[    0.077779] ACPI: 1 ACPI AML tables successfully acquired and loaded
[    0.078327] ACPI: Interpreter enabled
[    0.078329] ACPI: (supports S0 S5)
[    0.078329] ACPI: Using IOAPIC for interrupt routing
[    0.078334] PCI: Using host bridge windows from ACPI; if necessary, use "pci=nocrs" and report a bug
[    0.078422] ACPI: Enabled 2 GPEs in block 00 to 0F
[    0.079139] iommu: Default domain type: Translated
[    0.079179] SCSI subsystem initialized
[    0.085285] hv_vmbus: Vmbus version:5.2
[    0.085285] hv_vmbus: Unknown GUID: c376c1c3-d276-48d2-90a9-c04748072c60
[    0.085285] hv_vmbus: Unknown GUID: 6e382d18-3336-4f4b-acc4-2b7703d4df4a
[    0.085285] hv_vmbus: Unknown GUID: dde9cbc0-5060-4436-9448-ea1254a5d177
[    0.085537] PCI: Using ACPI for IRQ routing
[    0.085538] PCI: System does not support PCI
[    0.085914] clocksource: Switched to clocksource tsc-early
[    0.147402] VFS: Disk quotas dquot_6.6.0
[    0.147409] VFS: Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
[    0.147418] FS-Cache: Loaded
[    0.147430] pnp: PnP ACPI init
[    0.147516] pnp 00:00: Plug and Play ACPI device, IDs PNP0b00 (active)
[    0.147547] pnp: PnP ACPI: found 1 devices
[    0.150668] NET: Registered protocol family 2
[    0.150982] tcp_listen_portaddr_hash hash table entries: 16384 (order: 6, 262144 bytes, linear)
[    0.151002] TCP established hash table entries: 262144 (order: 9, 2097152 bytes, linear)
[    0.151284] TCP bind hash table entries: 65536 (order: 8, 1048576 bytes, linear)
[    0.151495] TCP: Hash tables configured (established 262144 bind 65536)
[    0.151520] UDP hash table entries: 16384 (order: 7, 524288 bytes, linear)
[    0.151558] UDP-Lite hash table entries: 16384 (order: 7, 524288 bytes, linear)
[    0.151615] NET: Registered protocol family 1
[    0.152149] RPC: Registered named UNIX socket transport module.
[    0.152150] RPC: Registered udp transport module.
[    0.152150] RPC: Registered tcp transport module.
[    0.152151] RPC: Registered tcp NFSv4.1 backchannel transport module.
[    0.152152] PCI: CLS 0 bytes, default 64
[    0.152183] Trying to unpack rootfs image as initramfs...
[    0.152443] Freeing initrd memory: 64K
[    0.152445] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
[    0.152446] software IO TLB: mapped [mem 0x00000000f4000000-0x00000000f8000000] (64MB)
[^  0.152478] kvm: no hardware support
[    0.152479] has_svm: not amd or hygon
[^  0.152479] kvm: no hardware support
[    0.155175] Initialise system trusted keyrings
[    0.155239] workingset: timestamp_bits=46 max_order=23 bucket_order=0
[    0.155926] squashfs: version 4.0 (2009/01/31) Phillip Lougher
[    0.156074] NFS: Registering the id_resolver key type
[    0.156078] Key type id_resolver registered
[    0.156078] Key type id_legacy registered
[    0.156079] Installing knfsd (copyright (C) 1996 okir@monad.swb.de).
[    0.156464] Key type cifs.idmap registered
[    0.156499] fuse: init (API version 7.32)
[    0.156581] SGI XFS with ACLs, security attributes, realtime, scrub, repair, quota, no debug enabled
[    0.156794] 9p: Installing v9fs 9p2000 file system support
[    0.156799] FS-Cache: Netfs '9p' registered for caching
[    0.156825] FS-Cache: Netfs 'ceph' registered for caching
[    0.156827] ceph: loaded (mds proto 32)
[    0.163069] NET: Registered protocol family 38
[    0.163070] xor: automatically using best checksumming function   avx
[    0.163070] Key type asymmetric registered
[    0.163071] Asymmetric key parser 'x509' registered
[    0.163073] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 250)
[    0.163689] hv_vmbus: registering driver hv_pci
[    0.163935] hv_pci 293024b9-759d-43c7-95d2-2ed48c022f23: PCI VMBus probing: Using version 0x10003
[    0.164484] hv_pci 293024b9-759d-43c7-95d2-2ed48c022f23: PCI host bridge to bus 759d:00
[    0.164684] pci 759d:00:00.0: [1414:008e] type 00 class 0x030200
[    0.167028] ACPI: AC Adapter [AC1] (on-line)
[    0.167381] Serial: 8250/16550 driver, 4 ports, IRQ sharing disabled
[    0.167770] Non-volatile memory driver v1.3
[    0.167796] battery: ACPI: Battery Slot [BAT1] (battery present)
[    0.169720] brd: module loaded
[    0.170290] loop: module loaded
[    0.170470] hv_vmbus: registering driver hv_storvsc
[    0.170823] wireguard: WireGuard 1.0.0 loaded. See www.wireguard.com for information.
[    0.170823] wireguard: Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
[    0.170831] tun: Universal TUN/TAP device driver, 1.6
[    0.170916] PPP generic driver version 2.4.2
[    0.170987] PPP BSD Compression module registered
[    0.170988] PPP Deflate Compression module registered
[    0.170989] PPP MPPE Compression module registered
[    0.170990] NET: Registered protocol family 24
[    0.170993] hv_vmbus: registering driver hv_netvsc
[    0.171634] scsi host0: storvsc_host_t
[    0.178800] VFIO - User Level meta-driver version: 0.3
[    0.178895] hv_vmbus: registering driver hyperv_keyboard
[    0.178991] rtc_cmos 00:00: RTC can wake from S4
[    0.179779] rtc_cmos 00:00: registered as rtc0
[    0.179999] rtc_cmos 00:00: setting system clock to 2022-07-21T07:21:05 UTC (1658388065)
[    0.180005] rtc_cmos 00:00: alarms up to one month, 114 bytes nvram
[    0.180117] device-mapper: ioctl: 4.43.0-ioctl (2020-10-01) initialised: dm-devel@redhat.com
[    0.180184] device-mapper: raid: Loading target version 1.15.1
[    0.180221] hv_utils: Registering HyperV Utility Driver
[    0.180222] hv_vmbus: registering driver hv_utils
[    0.180236] hv_vmbus: registering driver hv_balloon
[    0.180240] hv_vmbus: registering driver dxgkrnl
[^   0.180251] hv_utils: cannot register PTP clock: 0
[    0.180251] (NULL device *): dxgk: dxg_drv_init  Version: 2103
[    0.180424] drop_monitor: Initializing network drop monitor service
[    0.180464] hv_utils: TimeSync IC version 4.0
[    0.180579] hv_balloon: Using Dynamic Memory protocol version 2.0
[    0.180724] (NULL device *): dxgk: mmio allocated c00000000  200000000 c00000000 dffffffff
[    0.181057] Free page reporting enabled
[    0.181058] hv_balloon: Cold memory discard hint enabled
[    0.181543] Mirror/redirect action on
[    0.181829] IPVS: Registered protocols (TCP, UDP)
[    0.181840] IPVS: Connection hash table configured (size=4096, memory=64Kbytes)
[    0.181858] IPVS: ipvs loaded.
[    0.181858] IPVS: [rr] scheduler registered.
[    0.181859] IPVS: [wrr] scheduler registered.
[    0.181859] IPVS: [sh] scheduler registered.
[    0.181881] ipip: IPv4 and MPLS over IPv4 tunneling driver
[    0.182900] ipt_CLUSTERIP: ClusterIP Version 0.8 loaded successfully
[    0.183203] Initializing XFRM netlink socket
[    0.183235] NET: Registered protocol family 10
[    0.183428] Segment Routing with IPv6
[    0.184256] sit: IPv6, IPv4 and MPLS over IPv4 tunneling driver
[    0.184303] NET: Registered protocol family 17
[    0.184312] Bridge firewalling registered
[    0.184316] 8021q: 802.1Q VLAN Support v1.8
[    0.184329] sctp: Hash tables configured (bind 512/512)
[    0.184364] 9pnet: Installing 9P2000 support
[    0.184372] Key type dns_resolver registered
[    0.184376] Key type ceph registered
[    0.184442] libceph: loaded (mon/osd proto 15/24)
[    0.184481] NET: Registered protocol family 40
[    0.184481] hv_vmbus: registering driver hv_sock
[    0.184495] IPI shorthand broadcast: enabled
[    0.184500] sched_clock: Marking stable (184159371, 314600)->(191140700, -6666729)
[    0.184732] registered taskstats version 1
[    0.184737] Loading compiled-in X.509 certificates
[    0.184856] Btrfs loaded, crc32c=crc32c-generic
[    0.188313] Freeing unused kernel image (initmem) memory: 1444K
[    0.255480] Write protecting the kernel read-only data: 22528k
[    0.256767] Freeing unused kernel image (text/rodata gap) memory: 2028K
[    0.257408] Freeing unused kernel image (rodata/data gap) memory: 632K
[    0.257414] Run /init as init process
[    0.257415]   with arguments:
[    0.257415]     /init
[    0.257416]   with environment:
[    0.257417]     HOME=/
[    0.257418]     TERM=linux
[    0.969019] hv_vmbus: Unknown GUID: 6e382d18-3336-4f4b-acc4-2b7703d4df4a
[    0.969353] hv_pci 6baf6c74-b2fb-4340-b77a-249769d80f63: PCI VMBus probing: Using version 0x10003
[    0.970201] hv_pci 6baf6c74-b2fb-4340-b77a-249769d80f63: PCI host bridge to bus b2fb:00
[    0.970394] pci b2fb:00:00.0: [1414:008e] type 00 class 0x030200
[    1.075191] scsi 0:0:0:0: Direct-Access     Msft     Virtual Disk     1.0  PQ: 0 ANSI: 5
[    1.075604] sd 0:0:0:0: Attached scsi generic sg0 type 0
[    1.076873] sd 0:0:0:0: [sda] 536870912 512-byte logical blocks: (275 GB/256 GiB)
[    1.076882] sd 0:0:0:0: [sda] 4096-byte physical blocks
[    1.077005] sd 0:0:0:0: [sda] Write Protect is off
[    1.077006] sd 0:0:0:0: [sda] Mode Sense: 0f 00 00 00
[    1.077228] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    1.165318] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x23fa772cf26, max_idle_ns: 440795269835 ns
[    1.165536] clocksource: Switched to clocksource tsc
[    1.175354] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[    1.625553] sd 0:0:0:0: [sda] Attached SCSI disk
[    1.630803] EXT4-fs (sda): mounted filesystem with ordered data mode. Opts: discard,errors=remount-ro,data=ordered[    2.985364] Adding 7340032k swap on /swap/file.  Priority:-2 extents:4 across:7364608k
[    3.505766] scsi 0:0:0:1: Direct-Access     Msft     Virtual Disk     1.0  PQ: 0 ANSI: 5
[    3.507004] sd 0:0:0:1: Attached scsi generic sg1 type 0
[    3.509002] sd 0:0:0:1: [sdb] 536870912 512-byte logical blocks: (275 GB/256 GiB)
[    3.509004] sd 0:0:0:1: [sdb] 4096-byte physical blocks
[    3.509432] sd 0:0:0:1: [sdb] Write Protect is off
[    3.509435] sd 0:0:0:1: [sdb] Mode Sense: 0f 00 00 00
[    3.509873] sd 0:0:0:1: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    3.511811] sd 0:0:0:1: [sdb] Attached SCSI disk
[^   3.518746] EXT4-fs (sdb): mounted filesystem with ordered data mode. Opts: discard,errors=remount-ro,data=ordered[    3.533272] FS-Cache: Duplicate cookie detected
[^    3.533273] FS-Cache: O-cookie c=00000000c88d8678 [p=00000000c32bf5d4 fl=222 nc=0 na=1]
[^    3.533274] FS-Cache: O-cookie d=00000000623112bb n=00000000bd5f5f23
[^    3.533274] FS-Cache: O-key=[10] '34323934393337363435'
[^    3.533277] FS-Cache: N-cookie c=0000000068f67e5f [p=00000000c32bf5d4 fl=2 nc=0 na=1]
[^    3.533277] FS-Cache: N-cookie d=00000000623112bb n=000000002e436180
[^    3.533278] FS-Cache: N-key=[10] '34323934393337363435'
[    3.985929] init: (1) WARNING: /etc/resolv.conf updating disabled in /etc/wsl.conf
[    3.993959] init: (1) WARNING: /etc/resolv.conf updating disabled in /etc/wsl.conf
[    4.331903] systemd-journald[48]: Received client request to flush runtime journal.
[    4.341622] systemd-journald[48]: File /var/log/journal/4be97f23f3014a5091fa250bec19f3c0/system.journal corrupted or uncleanly shut down, renaming and replacing.
[   14.839925] systemd-journald[48]: File /var/log/journal/4be97f23f3014a5091fa250bec19f3c0/user-1000.journal corrupted or uncleanly shut down, renaming and replacing.
[   49.031847] hv_balloon: Max. dynamic memory size: 25562 MB
[   62.078600] WSL2: Performing memory compaction.
[  243.099437] WSL2: Performing memory compaction.
sarim commented 2 years ago

@Domest0s From a quick glance I dont see any cgroup related line. Can you double check your .wslconfig ?

Domest0s commented 2 years ago

@sarim Thanks, you are right. I made a mistake in .wslconfig formatting. I didn't have [wsl2] section label on top of kernelCommandLine=... file. After the correction, I got everything working. Thank you very much. From the documentation to .wslconfig file Microsoft explicitly says:

If the file is missing or malformed (improper markup formatting), WSL will continue to launch as normal without the configuration settings applied.

I would be thankful, if Microsoft in this case adds a "your .wslconfig file is malformed" warning message at WSL startup.

cerebrate commented 2 years ago

Also you might want to set systemd-timeout=10 as if don't like to wait for 240 seconds :P

Or, y'know, you could fix the actual underlying issues that stop systemd from starting cleanly in a short time?

hypeitnow commented 2 years ago

I've able to "partially" enable cgroupsv2 in Windows 11 22621.232. As discussed above, in Windows 11 cgroup_no_v1=all doesn't work as WSL's init tries to mount(/use ?) memory cgroup. I'm guessing this has to do with wslg, as with every distro instance a "system" instance of wslg is created, according to README:

Every WSL 2 user distro is paired with its own instance of the system distro. The system distro runs partially isolated from the user distro to which it is paired, in it's own NS/PID/UTS namespace but shares other namespaces such as IPC, to allow for shared memory optimization across the boundary.

I tried with guiApplications=false to disable wslg, but this doesn't have any effect with init. Maybe INIT is always mounting memory controller without checking guiApplications setting or its used in more places than just wslg. Maybe microsoft can use cgroupv2 api for these tasks? Pretty please :3

So what I did is cgroup_no_v1=cpuset,cpu,cpuacct,io,devices,freezer,net_cls,perf_event,net_prio,hugetlb,pids,rdma , basically everything except memory. Then I mount cgroup2, and delegation to user works. Without memory controller of course.

podman info  | head -n 10
host:
  arch: amd64
  buildahVersion: 1.23.1
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - pids
  cgroupManager: systemd
  cgroupVersion: v2

The critical thing is pids. With it available podman can run rootless containers. No need for pids_limit=0 workaround in #6662 (comment)

cpu limit of container also works, podman stats works, but memory fields are empty as expected. Trying to limit memory of container results in error as podman doesn't find memory controller at cgroup2.

But I'm happy with it, I can run containers without any special config, and can see running containers with podman stats. Tested with both Ubuntu 20.04 and 22.04.

@Domest0s Umm then I'm out of ideas, can you post output of dmesg? Also what is your kernel version?

Hi Sarim,

How exactly were you able to mount cgroups v2 so that podman sees it? Your advice on using cgroup_no_v1=cpuset,cpu,cpuacct,io,devices,freezer,net_cls,perf_event,net_prio,hugetlb,pids,rdma definitely helps to disable cgroups v1 under windows 11 but adding cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate 0 0 to th fstab, but it does not point podman to use cgroups v2 controller :(

I am using WSL2 in windows 11 (Kali distribution) with distrod, which actually should achieve the same result as genie. Moreover the same setup cgroup_no_v1=all and the entry in fstab works perfectly in Ubuntu 22.10 under windows 10. Podman sees cgroups v2

  cgroupControllers:
  cgroupManager: cgroupfs
  cgroupVersion: v2

Do you use mount command every time you log in?

Thanks in advance

sarim commented 2 years ago

@hypeitnow distrod (and geine too) will take care of mounting everything from fstab. Also you need to have systemd user session running for podman to work. geine already does it because of using machinectl command. For testing you can use sudo login -f YOURUSERNAME. For details see this issue https://github.com/nullpo-head/wsl-distrod/issues/13 . Check my comment and others to understand what is going on. I actually use https://github.com/sarim/gbash , its a wrapper around bash I made to launch systemd user session.

hypeitnow commented 1 year ago

@sarim thank you for your response If I understand correctly I should

  1. use kernel parameter - cgroup_no_v1=cpuset,cpu,cpuacct,io,devices,freezer,net_cls,perf_event,net_prio,hugetlb,pids,rdma
  2. add fstab entry
  3. use your hack :)
  4. Probably set cgroup_manager="systemd" in containers.conf Right?
sarim commented 1 year ago

@hypeitnow I don't think you need to edit containers.conf, but the rest is okey. Though you might want to try genie first. If successful, then try distrod and my hack :)