Closed ulidtko closed 1 year ago
Further, test fails with:
Can't open memoery.peak: 2 - ENOENT
Edit: on cgroups v2, latest master 643110b
Thanks for the report.
I'll check the header.
What system you are trying to run cgmemtime on?
Does /sys/fs/cgroup/cgroup.subtree_control
list the memory
controller?
FWIW, P_PIDFD
is available from sys/syscall.h
since glibc 2.36.
Thus, it looks like your system comes with an older version.
Your error message seems to indicate that something in your systems' cgroup setup is unexpected - such as the memory controller being disabled.
Feel free to add more details and reopen this issue.
Hi again, thanks for responding. Just rechecked latest master on 2 systems.
Kernel 5.15.79-1-lts, glibc 2.36
55a0520 compiles; ./cgmemtime -t ./testa x 10
gives clone3 failed: 13 - EACCES
, forced with sudo gives Can't create /sys/fs/cgroup/user.slice/user-0.slice/user@0.service/cgmt-l5nbg4 (errno: 2 - ENOENT)
.
$ cat /sys/fs/cgroup/cgroup.controllers
cpuset cpu io memory hugetlb pids rdma misc
$
$ cat /sys/fs/cgroup/cgroup.subtree_control
cpuset cpu io memory pids
The memory controller is enabled, but specifically the memory.peak
interface file is missing:
$ ls -l /sys/fs/cgroup/system.slice/memory.*
.r--r--r-- 0 root 12 Dec 23:03 /sys/fs/cgroup/system.slice/memory.current
.r--r--r-- 0 root 12 Dec 23:03 /sys/fs/cgroup/system.slice/memory.events
.r--r--r-- 0 root 12 Dec 23:03 /sys/fs/cgroup/system.slice/memory.events.local
.rw-r--r-- 0 root 12 Dec 22:49 /sys/fs/cgroup/system.slice/memory.high
.rw-r--r-- 0 root 12 Dec 22:49 /sys/fs/cgroup/system.slice/memory.low
.rw-r--r-- 0 root 12 Dec 22:49 /sys/fs/cgroup/system.slice/memory.max
.rw-r--r-- 0 root 12 Dec 22:49 /sys/fs/cgroup/system.slice/memory.min
.r--r--r-- 0 root 12 Dec 23:03 /sys/fs/cgroup/system.slice/memory.numa_stat
.rw-r--r-- 0 root 12 Dec 22:49 /sys/fs/cgroup/system.slice/memory.oom.group
.rw-r--r-- 0 root 12 Dec 23:03 /sys/fs/cgroup/system.slice/memory.pressure
.r--r--r-- 0 root 12 Dec 23:03 /sys/fs/cgroup/system.slice/memory.stat
.r--r--r-- 0 root 12 Dec 23:03 /sys/fs/cgroup/system.slice/memory.swap.current
.r--r--r-- 0 root 12 Dec 23:03 /sys/fs/cgroup/system.slice/memory.swap.events
.rw-r--r-- 0 root 12 Dec 23:03 /sys/fs/cgroup/system.slice/memory.swap.high
.rw-r--r-- 0 root 12 Dec 22:49 /sys/fs/cgroup/system.slice/memory.swap.max
$
$ find /sys/fs/cgroup -name memory.peak
$
Kernel 5.15.0-56-generic, glibc 2.35 — you'd deduced correctly, this is pre glibc 2.36 indeed
55a0520 does not compile, signal-related definitions missing:
cc -g -Og -Wall -Wextra -Wno-missing-field-initializers -Wno-missing-braces -Wmissing-prototypes -Wfloat-equal -Wwrite-strings -Wpointer-arith -Wcast-align -Wnull-dereference -Werror=multichar -Werror=sizeof-pointer-memaccess -Werror=return-type -fstrict-aliasing cgmemtime.c -o cgmemtime
cgmemtime.c:36:23: error: unknown type name ‘idtype_t’
36 | static int raw_waitid(idtype_t idtype, id_t id, siginfo_t *infop, int options,
| ^~~~~~~~
cgmemtime.c:36:49: error: unknown type name ‘siginfo_t’
36 | static int raw_waitid(idtype_t idtype, id_t id, siginfo_t *infop, int options,
| ^~~~~~~~~
cgmemtime.c: In function ‘execute’:
cgmemtime.c:350:24: error: ‘SIGCHLD’ undeclared (first use in this function)
350 | .exit_signal = SIGCHLD,
| ^~~~~~~
[…]
I had to include both sys/wait.h
and linux/wait.h
(in that order) for compile to pass:
#include <sys/syscall.h> // SYS_*, ...
#if __GLIBC__ == 2 && __GLIBC_MINOR__ < 36
+ #include <sys/wait.h> // waitid macros, ...
#include <linux/wait.h> // waitid macros, ...
#else
#include <sys/wait.h> // waitid macros, ...
Test results exactly the same however:
./cgmemtime -t ./testa x 10
gives clone3 failed: 13 - EACCES
; with sudo — Can't create /sys/fs/cgroup/user.slice/user-0.slice/user@0.service/cgmt-ApAePe (errno: 2 - ENOENT)
And similarly, the memory controller is enabled (this isn't default, I took steps to enable it) — but specifically memory.peak
interface file is nowhere to be found:
$ cat /sys/fs/cgroup/cgroup.controllers
cpuset cpu io memory hugetlb pids rdma misc
$ cat /sys/fs/cgroup/cgroup.subtree_control
cpuset cpu io memory hugetlb pids rdma misc
$
$ find /sys/fs/cgroup/ -name memory.peak
$
Doesn't seem possible to reopen the issue in this repository, I can only comment or open a new one.
Ok, the memory.peak
feature was just introduced in Linux Kernel version 5.19. (see also)
That means that you need to upgrade to 5.19 or later.
An enterprise distribution like RHEL might backport this change to their stabilized kernel, but I haven't looked at that.
I've changed the guard such that linux/wait.h is always included with older glibc versions.
Another thing that concerns me are the access errors.
When I run the following test command under Fedora 37
$ strace -y -o c.log ./cgmemtime -t ./testa x 10
I get the following output:
$ grep '/sys\|clone' c.log
openat(AT_FDCWD</home/juser/program/cgroup>, "/sys/fs/cgroup/cgroup.subtree_control", O_RDONLY) = 3</sys/fs/cgroup/cgroup.subtree_control>
read(3</sys/fs/cgroup/cgroup.subtree_control>, "memory pids\n", 1023) = 12
close(3</sys/fs/cgroup/cgroup.subtree_control>) = 0
mkdir("/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgmt-piXzEL", 0700) = 0
openat(AT_FDCWD</home/juser/program/cgroup>, "/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgmt-piXzEL", O_RDONLY|O_PATH) = 3</sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgmt-piXzEL>
mkdirat(3</sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgmt-piXzEL>, "leaf", 0700) = 0
openat(3</sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgmt-piXzEL>, "cgroup.subtree_control", O_WRONLY) = 4</sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgmt-piXzEL/cgroup.subtree_control>
write(4</sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgmt-piXzEL/cgroup.subtree_control>, "+memory", 7) = 7
close(4</sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgmt-piXzEL/cgroup.subtree_control>) = 0
openat(3</sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgmt-piXzEL>, "leaf", O_RDONLY|O_PATH) = 4</sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgmt-piXzEL/leaf>
clone3({flags=CLONE_PIDFD|CLONE_VFORK|CLONE_INTO_CGROUP, pidfd=0x7fffcd5b54bc, exit_signal=SIGCHLD, stack=NULL, stack_size=0, cgroup=4} => {pidfd=[5<anon_inode:[pidfd]>]}, 88) = 36110
openat(4</sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgmt-piXzEL/leaf>, "memory.peak", O_RDONLY) = 6</sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgmt-piXzEL/leaf/memory.peak>
read(6</sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgmt-piXzEL/leaf/memory.peak>, "10768384\n", 20) = 9
close(6</sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgmt-piXzEL/leaf/memory.peak>) = 0
close(4</sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgmt-piXzEL/leaf>) = 0
unlinkat(3</sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgmt-piXzEL>, "leaf", AT_REMOVEDIR) = 0
close(3</sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgmt-piXzEL>) = 0
rmdir("/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgmt-piXzEL") = 0
So clone3()
is called such that the child process is directly created into the temporary /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgmt-piXzEL/leaf
cgroup (which has file descriptor 4 in that trace and is opened directly before the clone3 call).
It looks like with your system clone3()
failes with EACCES due to some cgroup restriction that isn't present on my system.
Perhaps later kernels don't have this restriction or it might even be a side effect of how systemd sets up the hierarchy in older versions.
So perhaps this is also simply fixed by using Kernel 5.19 or later.
If you have the chance to retest it with a newer Kernel and it still doesn't work please share such a strace log.
The sudo error is caused by my perhaps too simplistic scheme for finding the right sysfs directory for the temporary cgroup:
/sys/fs/cgroup/user.slice/user-$UID.slice/user@$UID.service/cgmt-$RANDOM
That means with sudo the $UID changes to 0
but the cgroup stays the same, i.e. you could check that with something like:
sudo bash -c 'cat /proc/$$/cgroup'
You can work around that by explicitly specifying a temporary directory, e.g.:
sudo ./cgmemtime -c /sys/fs/cgroup/user.slice/user-$UID.slice/user@$UID.service/cgmt-$RANDOM ./testa x 10
Perhaps I should change my code such that it simply looks at the current cgroup the cgmemtime process is running under and derive the appropriate prefix (e.g. /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/
) from that instead of using the above UID based template.
For example, on Fedora 37, one would run into a similar issue when running cgmemtime from a su -
session (when not specifying -c
) because su
also doesn't change the cgroup.
Hey, thanks for explanation, makes sense! Also sensible to create the temporary cgroup under the "current" cgroup, without assuming any particular template for its path; that'll fix the sudo issue, and might enable much more uses for the tool (e.g. in containers or whatnot). /proc/self/cgroup
seems to provide the path.
I do have the memory.peak
file on kernel 6.0.9 — so I'd say, the kernel version requirement is worth a runtime check or warning. Somehow I overlooked the 5.19 requirement being mentioned in the readme :pensive:
The strace of EACCES:
openat(AT_FDCWD, "/sys/fs/cgroup/cgroup.subtree_control", O_RDONLY) = 3
read(3, "cpuset cpu io memory pids\n", 1023) = 26
close(3) = 0
mkdir("/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgmt-2OqzzQ", 0700) = 0
openat(AT_FDCWD, "/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgmt-2OqzzQ", O_RDONLY|O_PATH) = 3
mkdirat(3, "leaf", 0700) = 0
openat(3, "cgroup.subtree_control", O_WRONLY) = 4
write(4, "+memory", 7) = 7
close(4) = 0
openat(3, "leaf", O_RDONLY|O_PATH) = 4
clone3({flags=CLONE_PIDFD|CLONE_VFORK|CLONE_INTO_CGROUP, pidfd=0x7ffe6f20c13c, exit_signal=SIGCHLD, stack=NULL, stack_size=0, cgroup=4}, 88) = -1 EACCES (Permission denied)
write(2, "clone3 failed: 13 - EACCES\n", 27clone3 failed: 13 - EACCES
) = 27
Compile on Ubuntu fixed :+1: Same EACCES issue there, clone3()
throws -1
Ok, perhaps this is triggered by the Cgroup delegation containment rules.
I'm curious what cat /proc/self/cgroup
prints in the terminal where you get the EACCESS error with cgmemtime
.
Can you check that your user has write permissions for cgroup.procs
in the 'nearest common ancestor' cgroup?
I'll try to reproduce the issue on a Ubuntu system, tomorrow.
Ok, I installed Ubuntu 22.04.1 LTS in a VM and when I connect via ssh the default cgroup is:
$ cat /proc/self/cgroup
0::/user.slice/user-1000.slice/session-3.scope
which is similar to what I get when I ssh into a Fedora 36 system:
$ cat /proc/self/cgroup
0::/user.slice/user-1000.slice/session-525.scope
and which is different from what I get in a local Gnome Shell terminal session:
$ cat /proc/self/cgroup
0::/user.slice/user-1000.slice/user@1000.service/app.slice/app-gnome-kitty-3842.scope
Since cgmemtime tries to clone into
/sys/fs/cgroup/user.slice/user-$UID.slice/user@$UID.service/cgmt-$RANDOM
the nearest common ancestor is /sys/fs/cgroup/user.slice/user-$UID.slice
for which the user doesn't have write permissions on its cgroups.procs
file, e.g.:
$ ls /sys/fs/cgroup/user.slice/user-1000.slice/cgroup.procs -l
-rw-r--r-- 1 root root 0 Dec 14 21:53 /sys/fs/cgroup/user.slice/user-1000.slice/cgroup.procs
So this explains the EAGAIN error.
The problem is that even /sys/fs/cgroup//user.slice/user-1000.slice/session-3.scope/cgroup.procs
isn't writable by user 1000 while the original one is - also on ubuntu, e.g.:
ubuntu@ubuntu:~$ ls /sys/fs/cgroup/user.slice/user-1000.slice/session-3.scope/cgroup.procs -l
-rw-r--r-- 1 root root 0 Dec 14 21:45 /sys/fs/cgroup/user.slice/user-1000.slice/session-3.scope/cgroup.procs
ubuntu@ubuntu:~$ ls /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgroup.procs -l
-rw-r--r-- 1 ubuntu ubuntu 0 Dec 14 21:44 /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgroup.procs
I'll check if there is a way around this in such an environment (e.g. when being remotely logged in, via ssh) - i.e. a way that doesn't require root permissions.
One way to make this work (as a normal user), in an ssh session, is to tell systemd to launch cgmemtime in a cgroup under /sys/fs/cgroup/user.slice/user-$UID.slice/user@$UID.service
:
ubuntu@ubuntu:~$ systemd-run --user --scope ./cgmemtime uname -a
Running scope as unit: run-r12549b5eeab14723ba8875fac3e7b997.scope
Linux ubuntu 6.0.0-1008-oem #8-Ubuntu SMP PREEMPT_DYNAMIC Wed Nov 16 17:31:27 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
user: 0.001 s
sys: 0.000 s
wall: 0.001 s
child_RSS_high: 2044 KiB
group_mem_high: 276 KiB
This works because:
$ systemd-run --user --scope bash -c 'cat /proc/self/cgroup'
Running scope as unit: run-rdcefefbcb8d1480188e3dee34542d942.scope
0::/user.slice/user-1000.slice/user@1000.service/app.slice/run-rdcefefbcb8d1480188e3dee34542d942.scope
So I changed cgmemtime such that it inspects its cgroup via /proc/self/cgroup
and re-execs itself via systemd-run --user ...
if it detects that it runs outside of a .../user@$UID.service
cgroup.
One can disable this auto-magic via -Z
.
Now something like cgmemtime uname -a
also works inside an ssh session, out-of-the-box.
Note that running something like sudo cgmemtime ...
from ssh fails because systemd-run --user
can't find dbus. The same goes for running cgmemtime inside a su -
session inside an ssh
session.
Alternatively, when being remotely connected via ssh, one can run cgmemtime like this:
cgmemtime sudo mycommand ...
machinectl shell root@.host
sessionNB: running cgmemtime from a su -
session that was started from a Gnome Shell terminal session does work because then the process already is in an appropriate cgroup hierarchy.
While at it I added some hints to a few error messages.
Hi! Tried to compile, failed like this:
I had to add
#include <linux/wait.h>
for compile to pass. HTH