Compile error: P_PIDFD undeclared

ulidtko commented 1 year ago

Hi! Tried to compile, failed like this:

cgmemtime.c: In function ‘execute’:
cgmemtime.c:340:24: error: ‘P_PIDFD’ undeclared (first use in this function); did you mean ‘P_PID’?
  340 |         if (raw_waitid(P_PIDFD, pid_fd, &info, WEXITED, &usg) == -1) {
      |                        ^~~~~~~
      |                        P_PID

I had to add #include <linux/wait.h> for compile to pass. HTH

ulidtko commented 1 year ago

Further, test fails with:

Can't open memoery.peak: 2 - ENOENT

Edit: on cgroups v2, latest master 643110b

gsauthof commented 1 year ago

Thanks for the report.

I'll check the header.

What system you are trying to run cgmemtime on?

Does /sys/fs/cgroup/cgroup.subtree_control list the memory controller?

gsauthof commented 1 year ago

FWIW, P_PIDFD is available from sys/syscall.h since glibc 2.36. Thus, it looks like your system comes with an older version.

Your error message seems to indicate that something in your systems' cgroup setup is unexpected - such as the memory controller being disabled.

Feel free to add more details and reopen this issue.

ulidtko commented 1 year ago

Hi again, thanks for responding. Just rechecked latest master on 2 systems.

Arch Linux

Kernel 5.15.79-1-lts, glibc 2.36

55a0520 compiles; ./cgmemtime -t ./testa x 10 gives clone3 failed: 13 - EACCES, forced with sudo gives Can't create /sys/fs/cgroup/user.slice/user-0.slice/user@0.service/cgmt-l5nbg4 (errno: 2 - ENOENT).

$ cat /sys/fs/cgroup/cgroup.controllers
cpuset cpu io memory hugetlb pids rdma misc
$
$ cat /sys/fs/cgroup/cgroup.subtree_control
cpuset cpu io memory pids

The memory controller is enabled, but specifically the memory.peak interface file is missing:

$ ls -l /sys/fs/cgroup/system.slice/memory.*
.r--r--r-- 0 root 12 Dec 23:03  /sys/fs/cgroup/system.slice/memory.current
.r--r--r-- 0 root 12 Dec 23:03  /sys/fs/cgroup/system.slice/memory.events
.r--r--r-- 0 root 12 Dec 23:03  /sys/fs/cgroup/system.slice/memory.events.local
.rw-r--r-- 0 root 12 Dec 22:49  /sys/fs/cgroup/system.slice/memory.high
.rw-r--r-- 0 root 12 Dec 22:49  /sys/fs/cgroup/system.slice/memory.low
.rw-r--r-- 0 root 12 Dec 22:49  /sys/fs/cgroup/system.slice/memory.max
.rw-r--r-- 0 root 12 Dec 22:49  /sys/fs/cgroup/system.slice/memory.min
.r--r--r-- 0 root 12 Dec 23:03  /sys/fs/cgroup/system.slice/memory.numa_stat
.rw-r--r-- 0 root 12 Dec 22:49  /sys/fs/cgroup/system.slice/memory.oom.group
.rw-r--r-- 0 root 12 Dec 23:03  /sys/fs/cgroup/system.slice/memory.pressure
.r--r--r-- 0 root 12 Dec 23:03  /sys/fs/cgroup/system.slice/memory.stat
.r--r--r-- 0 root 12 Dec 23:03  /sys/fs/cgroup/system.slice/memory.swap.current
.r--r--r-- 0 root 12 Dec 23:03  /sys/fs/cgroup/system.slice/memory.swap.events
.rw-r--r-- 0 root 12 Dec 23:03  /sys/fs/cgroup/system.slice/memory.swap.high
.rw-r--r-- 0 root 12 Dec 22:49  /sys/fs/cgroup/system.slice/memory.swap.max
$
$ find /sys/fs/cgroup -name memory.peak
$

Ubuntu

Kernel 5.15.0-56-generic, glibc 2.35 — you'd deduced correctly, this is pre glibc 2.36 indeed

55a0520 does not compile, signal-related definitions missing:

cc -g -Og -Wall -Wextra -Wno-missing-field-initializers -Wno-missing-braces -Wmissing-prototypes -Wfloat-equal -Wwrite-strings -Wpointer-arith -Wcast-align -Wnull-dereference -Werror=multichar -Werror=sizeof-pointer-memaccess -Werror=return-type -fstrict-aliasing    cgmemtime.c   -o cgmemtime
cgmemtime.c:36:23: error: unknown type name ‘idtype_t’
   36 | static int raw_waitid(idtype_t idtype, id_t id, siginfo_t *infop, int options,
      |                       ^~~~~~~~
cgmemtime.c:36:49: error: unknown type name ‘siginfo_t’
   36 | static int raw_waitid(idtype_t idtype, id_t id, siginfo_t *infop, int options,
      |                                                 ^~~~~~~~~
cgmemtime.c: In function ‘execute’:
cgmemtime.c:350:24: error: ‘SIGCHLD’ undeclared (first use in this function)
  350 |         .exit_signal = SIGCHLD,
      |                        ^~~~~~~
[…]

I had to include both sys/wait.h and linux/wait.h (in that order) for compile to pass:

 #include <sys/syscall.h>  // SYS_*, ...
 #if __GLIBC__ == 2 && __GLIBC_MINOR__ < 36
+    #include <sys/wait.h>     // waitid macros, ...
     #include <linux/wait.h>   // waitid macros, ...
 #else
     #include <sys/wait.h>     // waitid macros, ...

Test results exactly the same however: ./cgmemtime -t ./testa x 10 gives clone3 failed: 13 - EACCES; with sudo — Can't create /sys/fs/cgroup/user.slice/user-0.slice/user@0.service/cgmt-ApAePe (errno: 2 - ENOENT)

And similarly, the memory controller is enabled (this isn't default, I took steps to enable it) — but specifically memory.peak interface file is nowhere to be found:

$ cat /sys/fs/cgroup/cgroup.controllers 
cpuset cpu io memory hugetlb pids rdma misc
$ cat /sys/fs/cgroup/cgroup.subtree_control 
cpuset cpu io memory hugetlb pids rdma misc
$
$ find /sys/fs/cgroup/ -name memory.peak
$

ulidtko commented 1 year ago

Doesn't seem possible to reopen the issue in this repository, I can only comment or open a new one.

gsauthof commented 1 year ago

Ok, the memory.peak feature was just introduced in Linux Kernel version 5.19. (see also) That means that you need to upgrade to 5.19 or later. An enterprise distribution like RHEL might backport this change to their stabilized kernel, but I haven't looked at that.

I've changed the guard such that linux/wait.h is always included with older glibc versions.

Another thing that concerns me are the access errors.

When I run the following test command under Fedora 37

$ strace -y -o c.log ./cgmemtime -t ./testa x 10

I get the following output:

$ grep '/sys\|clone' c.log
openat(AT_FDCWD</home/juser/program/cgroup>, "/sys/fs/cgroup/cgroup.subtree_control", O_RDONLY) = 3</sys/fs/cgroup/cgroup.subtree_control>
read(3</sys/fs/cgroup/cgroup.subtree_control>, "memory pids\n", 1023) = 12
close(3</sys/fs/cgroup/cgroup.subtree_control>) = 0
mkdir("/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgmt-piXzEL", 0700) = 0
openat(AT_FDCWD</home/juser/program/cgroup>, "/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgmt-piXzEL", O_RDONLY|O_PATH) = 3</sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgmt-piXzEL>
mkdirat(3</sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgmt-piXzEL>, "leaf", 0700) = 0
openat(3</sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgmt-piXzEL>, "cgroup.subtree_control", O_WRONLY) = 4</sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgmt-piXzEL/cgroup.subtree_control>
write(4</sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgmt-piXzEL/cgroup.subtree_control>, "+memory", 7) = 7
close(4</sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgmt-piXzEL/cgroup.subtree_control>) = 0
openat(3</sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgmt-piXzEL>, "leaf", O_RDONLY|O_PATH) = 4</sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgmt-piXzEL/leaf>
clone3({flags=CLONE_PIDFD|CLONE_VFORK|CLONE_INTO_CGROUP, pidfd=0x7fffcd5b54bc, exit_signal=SIGCHLD, stack=NULL, stack_size=0, cgroup=4} => {pidfd=[5<anon_inode:[pidfd]>]}, 88) = 36110
openat(4</sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgmt-piXzEL/leaf>, "memory.peak", O_RDONLY) = 6</sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgmt-piXzEL/leaf/memory.peak>
read(6</sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgmt-piXzEL/leaf/memory.peak>, "10768384\n", 20) = 9
close(6</sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgmt-piXzEL/leaf/memory.peak>) = 0
close(4</sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgmt-piXzEL/leaf>) = 0
unlinkat(3</sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgmt-piXzEL>, "leaf", AT_REMOVEDIR) = 0
close(3</sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgmt-piXzEL>) = 0
rmdir("/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgmt-piXzEL") = 0

So clone3() is called such that the child process is directly created into the temporary /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgmt-piXzEL/leaf cgroup (which has file descriptor 4 in that trace and is opened directly before the clone3 call).

It looks like with your system clone3() failes with EACCES due to some cgroup restriction that isn't present on my system. Perhaps later kernels don't have this restriction or it might even be a side effect of how systemd sets up the hierarchy in older versions.

So perhaps this is also simply fixed by using Kernel 5.19 or later.

If you have the chance to retest it with a newer Kernel and it still doesn't work please share such a strace log.

The sudo error is caused by my perhaps too simplistic scheme for finding the right sysfs directory for the temporary cgroup:

/sys/fs/cgroup/user.slice/user-$UID.slice/user@$UID.service/cgmt-$RANDOM

That means with sudo the $UID changes to 0 but the cgroup stays the same, i.e. you could check that with something like:

sudo bash -c 'cat /proc/$$/cgroup'

You can work around that by explicitly specifying a temporary directory, e.g.:

sudo ./cgmemtime -c /sys/fs/cgroup/user.slice/user-$UID.slice/user@$UID.service/cgmt-$RANDOM ./testa x 10

Perhaps I should change my code such that it simply looks at the current cgroup the cgmemtime process is running under and derive the appropriate prefix (e.g. /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/) from that instead of using the above UID based template.

For example, on Fedora 37, one would run into a similar issue when running cgmemtime from a su - session (when not specifying -c) because su also doesn't change the cgroup.

ulidtko commented 1 year ago

Hey, thanks for explanation, makes sense! Also sensible to create the temporary cgroup under the "current" cgroup, without assuming any particular template for its path; that'll fix the sudo issue, and might enable much more uses for the tool (e.g. in containers or whatnot). /proc/self/cgroup seems to provide the path.

I do have the memory.peak file on kernel 6.0.9 — so I'd say, the kernel version requirement is worth a runtime check or warning. Somehow I overlooked the 5.19 requirement being mentioned in the readme :pensive:

The strace of EACCES:

openat(AT_FDCWD, "/sys/fs/cgroup/cgroup.subtree_control", O_RDONLY) = 3
read(3, "cpuset cpu io memory pids\n", 1023) = 26
close(3)                                = 0
mkdir("/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgmt-2OqzzQ", 0700) = 0
openat(AT_FDCWD, "/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgmt-2OqzzQ", O_RDONLY|O_PATH) = 3
mkdirat(3, "leaf", 0700)                = 0
openat(3, "cgroup.subtree_control", O_WRONLY) = 4
write(4, "+memory", 7)                  = 7
close(4)                                = 0
openat(3, "leaf", O_RDONLY|O_PATH)      = 4
clone3({flags=CLONE_PIDFD|CLONE_VFORK|CLONE_INTO_CGROUP, pidfd=0x7ffe6f20c13c, exit_signal=SIGCHLD, stack=NULL, stack_size=0, cgroup=4}, 88) = -1 EACCES (Permission denied)
write(2, "clone3 failed: 13 - EACCES\n", 27clone3 failed: 13 - EACCES
) = 27

Compile on Ubuntu fixed :+1: Same EACCES issue there, clone3() throws -1

gsauthof commented 1 year ago

Ok, perhaps this is triggered by the Cgroup delegation containment rules.

I'm curious what cat /proc/self/cgroup prints in the terminal where you get the EACCESS error with cgmemtime.

Can you check that your user has write permissions for cgroup.procs in the 'nearest common ancestor' cgroup?

I'll try to reproduce the issue on a Ubuntu system, tomorrow.

gsauthof commented 1 year ago

Ok, I installed Ubuntu 22.04.1 LTS in a VM and when I connect via ssh the default cgroup is:

$ cat /proc/self/cgroup 
0::/user.slice/user-1000.slice/session-3.scope

which is similar to what I get when I ssh into a Fedora 36 system:

$ cat /proc/self/cgroup 
0::/user.slice/user-1000.slice/session-525.scope

and which is different from what I get in a local Gnome Shell terminal session:

$ cat /proc/self/cgroup 
0::/user.slice/user-1000.slice/user@1000.service/app.slice/app-gnome-kitty-3842.scope

Since cgmemtime tries to clone into

/sys/fs/cgroup/user.slice/user-$UID.slice/user@$UID.service/cgmt-$RANDOM

the nearest common ancestor is /sys/fs/cgroup/user.slice/user-$UID.slice for which the user doesn't have write permissions on its cgroups.procs file, e.g.:

$ ls /sys/fs/cgroup/user.slice/user-1000.slice/cgroup.procs -l
-rw-r--r-- 1 root root 0 Dec 14 21:53 /sys/fs/cgroup/user.slice/user-1000.slice/cgroup.procs

So this explains the EAGAIN error.

The problem is that even /sys/fs/cgroup//user.slice/user-1000.slice/session-3.scope/cgroup.procs isn't writable by user 1000 while the original one is - also on ubuntu, e.g.:

ubuntu@ubuntu:~$ ls /sys/fs/cgroup/user.slice/user-1000.slice/session-3.scope/cgroup.procs -l
-rw-r--r-- 1 root root 0 Dec 14 21:45 /sys/fs/cgroup/user.slice/user-1000.slice/session-3.scope/cgroup.procs
ubuntu@ubuntu:~$ ls /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgroup.procs  -l
-rw-r--r-- 1 ubuntu ubuntu 0 Dec 14 21:44 /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/cgroup.procs

I'll check if there is a way around this in such an environment (e.g. when being remotely logged in, via ssh) - i.e. a way that doesn't require root permissions.

gsauthof commented 1 year ago

One way to make this work (as a normal user), in an ssh session, is to tell systemd to launch cgmemtime in a cgroup under /sys/fs/cgroup/user.slice/user-$UID.slice/user@$UID.service:

ubuntu@ubuntu:~$ systemd-run --user --scope ./cgmemtime uname -a
Running scope as unit: run-r12549b5eeab14723ba8875fac3e7b997.scope
Linux ubuntu 6.0.0-1008-oem #8-Ubuntu SMP PREEMPT_DYNAMIC Wed Nov 16 17:31:27 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

user:   0.001 s
sys:    0.000 s
wall:   0.001 s
child_RSS_high:       2044 KiB
group_mem_high:        276 KiB

This works because:

$ systemd-run --user --scope bash -c 'cat /proc/self/cgroup'
Running scope as unit: run-rdcefefbcb8d1480188e3dee34542d942.scope
0::/user.slice/user-1000.slice/user@1000.service/app.slice/run-rdcefefbcb8d1480188e3dee34542d942.scope

gsauthof commented 1 year ago

So I changed cgmemtime such that it inspects its cgroup via /proc/self/cgroup and re-execs itself via systemd-run --user ... if it detects that it runs outside of a .../user@$UID.service cgroup.

One can disable this auto-magic via -Z.

Now something like cgmemtime uname -a also works inside an ssh session, out-of-the-box.

Note that running something like sudo cgmemtime ... from ssh fails because systemd-run --user can't find dbus. The same goes for running cgmemtime inside a su - session inside an ssh session.

Alternatively, when being remotely connected via ssh, one can run cgmemtime like this:

run sudo from cgmemtime, e.g. cgmemtime sudo mycommand ...
run cgmemtime from a machinectl shell root@.host session

NB: running cgmemtime from a su - session that was started from a Gnome Shell terminal session does work because then the process already is in an appropriate cgroup hierarchy.

While at it I added some hints to a few error messages.

gsauthof / cgmemtime

Compile error: P_PIDFD undeclared #5

Arch Linux

Ubuntu