Open morfikov opened 1 month ago
@morfikov Thanks for reporting the issue, does adding the following help?
*:sshd cpu,io,memory,pids,rdma,misc morfikownia/user/ssh/
No, why should it? I'm using ssh from my client machine to connect to the remote SSH server. I'm trying to configure cgroups on the client to filter OUTPUT packets. SSHD is a server, and it's not even started on the client.
The ssh was an example, the same thing happens to ping, curl, and other terminal apps.
@morfikov Ah ! my bad. I miss reading it.
@kamalesh-babulal Andrei Borzenkov from systemd-devel mailing list suggested that the problem may lay in the way nftables checks things related to cgroups:
Not really. nftables checks the socket cgroup, not the process cgroup. The socket may have been created while process was in the old cgroup.
That would explain the weird behavior.
Can this be fixed in libcgroup or should I ask about this issue some guys from the kernel?
@morfikov AFAIK, It's the kernel behavior, where the socket is not migrated along with the task migration to another cgroup. I have an idea for the transient systemd equivalent, which is reusable and is called the delegated scope:
cgcreate -c -gcpu,io,memory,pids,rdma,misc:morfikownia.slice/user.scope
cgexec -gcpu:morfikownia.slice/user.scope <command>
instead of ssh
use cgexec + ssh
as mentioned above, it will create the sockets in the expected cgroup and the window of race might not be seen, which will make cgrules.conf
redundant. Also, change the /etc/cgrules.conf, for existing pids
*:sshfs cpu,io,memory,pids,rdma,misc morfikownia.slice/user.scope
*:ssh cpu,io,memory,pids,rdma,misc morfikownia.slice/user.scope/
I do not know, why user.scope
is not enabling the other controllers other than CPU
I will try debugging it.
Yes, that worked. I just tested with ssh and ping:
chain OUTPUT {
....
socket cgroupv2 level 1 "morfikownia.slice/" counter jump check-cgroup-morfikownia-user-slice
chain check-cgroup-morfikownia-user-slice {
socket cgroupv2 level 2 "morfikownia.slice/user.scope/" meta l4proto tcp counter accept
socket cgroupv2 level 2 "morfikownia.slice/user.scope/" meta l4proto icmp counter accept
}
...
Also added corresponding entries to the cgrules.conf file and tried to exec ssh and ping via cgexec in a loop a few times:
# nft list chain inet filter check-cgroup-morfikownia-user-slice
table inet filter {
chain check-cgroup-morfikownia-user-slice {
socket cgroupv2 level 2 "morfikownia.slice/user.scope" meta l4proto tcp counter packets 100 bytes 6000 accept
socket cgroupv2 level 2 "morfikownia.slice/user.scope" meta l4proto icmp counter packets 100 bytes 8400 accept
}
}
So now it catches every single time, and there's no drops.
So how to make it work using only the cgrules.conf file?
The second question is, what do these warnings mean?
cgrulesengd[13882]: Warning: cgroup_attach_task_pid failed: 50001 cgrulesengd[13882]: Warning: failed to apply the rule. Error was: 50001 cgrulesengd[13882]: Cgroup change for PID: 15280, UID: 1000, GID: 1000, PROCNAME: /usr/bin/ssh FAILED! (Error Code: 50001)
This error is likely coming from here. I'm guessing that your cgroup.subtree_control file is empty, and thus the above error. libcgroup is (perhaps erroneously) expecting you to have enabled at least one controller.
So how to make it work using only the cgrules.conf file?
I'm not sure that you can on a cgroup v2 system. systemd owns the entire cgroup hierarchy, so technically they are "right" in this case. I'd guess that libcgroup and the kernel/systemd are in a race condition for the placement of the process in a cgroup. Sometimes you win, sometimes you lose.
If you want to be certain that your process is running in the correct cgroup, you'll likely want to do something like the following:
Having cgrules move a process violates systemd's single writer rule. One a v1 system it was pretty easy to get away with such a solution, but it's much harder to do it safely on a v2 system. (In fact, it may not be possible.)
This error is likely coming from here. I'm guessing that your cgroup.subtree_control file is empty, and thus the above error. libcgroup is (perhaps erroneously) expecting you to have enabled at least one controller.
Yes, that was the case.
Now basically it works only when I use cgexec:
# cgexec ping wp.pl -c 4
Found cgroup option cpuset, count 0
Found cgroup option cpu, count 1
Found cgroup option io, count 2
Found cgroup option memory, count 3
Found cgroup option pids, count 4
Found cgroup option rdma, count 5
Found cgroup option misc, count 6
Found cgroup option cgroup, count 7
Unable to read /var/run/libcgroup/systemd , continuing without systemd default cgroup.
My euid and egid is: 0,0
Not using cached rules for PID 4945.
Parsing configuration file /etc/cgrules.conf.
Added rule * (UID: -2, GID: -2) -> morfikownia/user/iputils/ for controllers: cpu memory pids
Parsing of configuration file complete.
Found matching rule * for PID: 4945, UID: 0, GID: 0
Executing rule * for PID 4945... Will move pid 4945 to cgroup 'morfikownia/user/iputils/'
Adding controller cpu
Adding controller memory
Adding controller pids
cgroup build procs path: /sys/fs/cgroup//morfikownia/user/iputils/cgroup.procs
cgroup build procs path: /sys/fs/cgroup//morfikownia/user/iputils/cgroup.procs
cgroup build procs path: /sys/fs/cgroup//morfikownia/user/iputils/cgroup.procs
OK!
PING wp.pl (212.77.98.9) 56(84) bytes of data.
64 bytes from www.wp.pl (212.77.98.9): icmp_seq=1 ttl=51 time=32.8 ms
64 bytes from www.wp.pl (212.77.98.9): icmp_seq=2 ttl=51 time=31.4 ms
64 bytes from www.wp.pl (212.77.98.9): icmp_seq=3 ttl=51 time=26.5 ms
64 bytes from www.wp.pl (212.77.98.9): icmp_seq=4 ttl=51 time=25.0 ms
--- wp.pl ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3004ms
rtt min/avg/max/mdev = 24.971/28.910/32.763/3.262 ms
@morfikov There is not much help from the Linux Kernel regarding tracking and migrating the sockets, along with the tasks, that opened them. Another bash hack can be setting the alias for the ssh to be alias ssh cgexec.....
given that the cgroup is predictable. In the case of a daemon, like ssh server
it would work without the cgexec
when the server is placed in the right cgroup, all its spawned threads would be in the same cgroup as the sshd
but here it one-time command.
I thought about aliases, but I'll probably get rid of systemd.
I can't figure one thing out. I'm trying to set cgroup path as: /sys/fs/cgroup/${CG_SLICE}/${CG_SCOPE}/${CG_USER_DIR}/${USERAPP}
The following command works:
# cgcreate -S -c -g cpu,io,memory,pids:${CG_SLICE}/${CG_SCOPE}
The next command would be:
# cgcreate -g cpu,io,memory,pids:${CG_USER_DIR}
But this doesn't work. What worked was:
# cgcreate -g cpu,pids:${CG_USER_DIR}
It looks like, only the two controllers can be used, i.e. cpu and pids, why not all four?
The next command would be:
# cgcreate -g cpu,pids:${CG_USER_DIR}/${USERAPP}
But this one doesn't work, and basically no controllers can be specified.
Am I missing something?
@morfikov Kernel enforces the rule of not enabling the controller if you have a task running in that cgroup. One idea is to create the ${CG_USER_DIR}
under the .scope
and move the idle task created by libcgroup under ${CG_SLICE}/${CG_SCOPE}
to the ${CG_SLICE}/${CG_SCOPE}/${CG_USER_DIR}
and enable the controller in the ${CG_SCOPE}
, it can be achieved using the following:
cgcreate -c -g cpu,io,memory,pids:${CG_SLICE}/${CG_SCOPE}
cgcreate -g:${CG_SLICE}/${CG_SCOPE}/${CG_USER_DIR}
pid=$(cgget -n -v -r cgroup.procs ${CG_SLICE}/${CG_SCOPE})
cgset -r cgroup.procs="$pid" ${CG_SLICE}/${CG_SCOPE}/${CG_USER_DIR}
cgset -r cgroup.subtree_control="+cpu +cpuset +io +memory +pids" ${CG_SLICE}/${CG_SCOPE}
but with a word of caution, ensure one task is always alive (not necessarily running) under ${CG_SLICE}/${CG_SCOPE}
or under any child cgroup of the scope. Otherwise ${CG_SCOPE}
will get removed. In this case there will be libcgroup_systemd_idle_thread
also running under ${CG_SLICE}/${CG_SCOPE}/${CG_USER_DIR}
@kamalesh-babulal
Yes, that works fine, but there's one thing -- what when I want to have multiple dirs under ${CG_SLICE}/${CG_SCOPE}
? For instance:
${CG_SLICE}/${CG_SCOPE}/${CG_USER_DIR}
${CG_SLICE}/${CG_SCOPE}/${CG_SYS_DIR}
In such case, the first path will be working just fine, but in the case of the second path there's no way to add controllers.
@morfikov I am sorry, I do not understand the question. I am assuming you want to enable controllers for all new child cgroups. How about something like below:
# create the slice and scope
cgcreate -c -g cpu,io,memory,pids:${CG_SLICE}/${CG_SCOPE}
# create tmp cgroup
cgcreate -g:${CG_SLICE}/${CG_SCOPE}/_tmp
# move the idle task to tmp cgroup
pid=$(cgget -n -v -r cgroup.procs ${CG_SLICE}/${CG_SCOPE})
cgset -r cgroup.procs="$pid" ${CG_SLICE}/${CG_SCOPE}/_tmp
# Enable the cgroup controllers to the scope
cgset -r cgroup.subtree_control="+cpu +cpuset +io +memory +pids" ${CG_SLICE}/${CG_SCOPE}
# create user dir
cgcreate -g:${CG_SLICE}/${CG_SCOPE}/${CG_USER_DIR}
# enable controllers on user dir
cgset -r cgroup.subtree_control="+cpu +cpuset +io +memory +pids" ${CG_SLICE}/${CG_SCOPE}/${CG_USER_DIR}
# create sys dir
cgcreate -g:${CG_SLICE}/${CG_SCOPE}/${CG_SYS_DIR}
# enable controllers on sys dir
cgset -r cgroup.subtree_control="+cpu +cpuset +io +memory +pids" ${CG_SLICE}/${CG_SCOPE}/${CG_SYS_DIR}
Yes, this is it.
@morfikov Can we close this issue?
I think yes. But there's one thing. Those commands running from a script i a terminal, work well. But when I want to make it work at boot, I get the following error:
cgcreate: can't create cgroup morfikownia.slice/libcgroup.scope: Cgroup operation failed
Error: failed to open the system bus: 2
I'm trying to run my script via the following systemd service:
[Unit]
Description=Control Group configuration service
ConditionDirectoryNotEmpty=/sys/fs/cgroup/
ConditionFileIsExecutable=/opt/skrypty/cgstart
DefaultDependencies=no
Requires=cgrulesengd.service
Before=sysinit.target nftables.service network-pre.target umount.target shutdown.target
After=cgrulesengd.service
Conflicts=umount.target shutdown.target
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/opt/skrypty/cgstart
OOMScoreAdjust=-800
[Install]
WantedBy=sysinit.target
@morfikov I am not an expert on systemd service files. Adding After=dbus.service cgrulesengd.service
and removing sysinit.target
from Before=
helps?
Before=nftables.service network-pre.target umount.target shutdown.target
After=dbus.service cgrulesengd.service
Yes, that helped:
Requires=cgrulesengd.service dbus.service
After=cgrulesengd.service dbus.service
Before=nftables.service network-pre.target umount.target shutdown.target
@kamalesh-babulal One more question.
To make it work for regular users (members of some group), I need to allow adding pids to cgroup.procs
file. I tried to create cgroup paths using: cgcreate -a root:root -t root:cgroups...
but all files are still owned by root:root
. So how to make it work?
@morfikov This is a little tricky in comparison to the cgroup v1, where just changing the permission of the tasks
file of the cgroup or group ownership is sufficient but cgroup v2 has an enforced rule, that says the user writing pid into destination cgroup should have permission to write on the nearest common ancestor of both source and destination cgroup. It is to avoid moving the tasks from the delegated subtree to the non-delegated subtree. It suggested moving the parent or the first task of the user to the delegated subtree by the root user, so all the tasks forked by the first task will be under the delegated subtree and freely moved between the children cgroups under the delegated subtree.
The https://man7.org/linux/man-pages/man7/cgroups.7.html Cgroup delegation containment rules
section outlines the rules for delegation.
that says the user writing pid into destination cgroup should have permission to write on the nearest common ancestor of both source and destination cgroup.
Does this mean that in the case of systemd, the common ancestor would be the root path, i.e. /sys/fs/cgroup/ ? If so there's no way to make it work for regular users?
@kamalesh-babulal
I made it work:
# chown root:cgroups /usr/bin/cgexec
# chmod 2750 /usr/bin/cgexec
# chown root:cgroups /sys/fs/cgroup/cgroup.procs
# chown root:cgroups /sys/fs/cgroup/cgroup.threads
# chmod 660 /sys/fs/cgroup/cgroup.procs
# chmod 660 /sys/fs/cgroup/cgroup.threads
# find /sys/fs/cgroup/morfikownia.slice -iname cgroup.procs | while read pid; do chown root:cgroups $pid; chmod 660 $pid; done
# find /sys/fs/cgroup/morfikownia.slice -iname cgroup.threads | while read thread; do chown root:cgroups $thread; chmod 660 $thread; done
Now it works as a regular user:
$ cgexec ping wp.pl -c 4
Found cgroup option cpuset, count 0
Found cgroup option cpu, count 1
Found cgroup option io, count 2
Found cgroup option memory, count 3
Found cgroup option pids, count 4
Found cgroup option rdma, count 5
Found cgroup option misc, count 6
Found cgroup option cgroup, count 7
Unable to read /var/run/libcgroup/systemd , continuing without systemd default cgroup.
My euid and egid is: 1000,5060
Not using cached rules for PID 11165.
Parsing configuration file /etc/cgrules.conf.
Added rule * (UID: -2, GID: -2) -> morfikownia.slice/libcgroup.scope/apps-user/iputils/ for controllers: cpu cpuset io memory pids
Parsing of configuration file complete.
Found matching rule * for PID: 11165, UID: 1000, GID: 1000
Executing rule * for PID 11165... Will move pid 11165 to cgroup 'morfikownia.slice/libcgroup.scope/apps-user/iputils/'
Adding controller cpu
Adding controller cpuset
Adding controller io
Adding controller memory
Adding controller pids
cgroup build procs path: /sys/fs/cgroup//morfikownia.slice/libcgroup.scope/apps-user/iputils/cgroup.procs
cgroup build procs path: /sys/fs/cgroup//morfikownia.slice/libcgroup.scope/apps-user/iputils/cgroup.procs
cgroup build procs path: /sys/fs/cgroup//morfikownia.slice/libcgroup.scope/apps-user/iputils/cgroup.procs
cgroup build procs path: /sys/fs/cgroup//morfikownia.slice/libcgroup.scope/apps-user/iputils/cgroup.procs
cgroup build procs path: /sys/fs/cgroup//morfikownia.slice/libcgroup.scope/apps-user/iputils/cgroup.procs
OK!
PING wp.pl (212.77.98.9) 56(84) bytes of data.
64 bytes from www.wp.pl (212.77.98.9): icmp_seq=1 ttl=51 time=178 ms
64 bytes from www.wp.pl (212.77.98.9): icmp_seq=2 ttl=51 time=28.1 ms
64 bytes from www.wp.pl (212.77.98.9): icmp_seq=3 ttl=51 time=31.0 ms
64 bytes from www.wp.pl (212.77.98.9): icmp_seq=4 ttl=51 time=50.1 ms
--- wp.pl ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3003ms
rtt min/avg/max/mdev = 28.054/71.672/177.514/61.690 ms
@morfikov Super cool workaround! Thanks for sharing it.
I'm trying to filter OUTPUT packets of all internet apps using cgroupsv2 and nftables. This was working fine with cgroupsv1, but systemd wants to remove the support for v1, so I had to switch to v2 and it looks like that it doesn't work as it should.
Basically everything works well for GUI apps, for instance:
But there's a problem with terminal tools like ping or ssh -- sometimes they work, and sometimes they don't. Take a look a the following example.
The following logs are from 3 attempts when I try to connect to the remote SSH host.
The firtst try:
It worked:
The second try:
It also worked.
The third try:
But the third attempt didn't work. The packets were dropped in nftables. The question is why? The
NFTABLES:cgroup-systemd
label indicates, that the packets didn't go where they should in nftables:So they should go to the
check-cgroup
chain, and in the first two attempts they went, but in the third attempt they went to thecheck-cgroup-systemd
and since there's no rules there for SSH client, they were dropped. Why does this happen? In the GUI apps, everything works well each time.When I try to connect to remote SSH server, and the connection is successful, I can see that pids were added in the right place:
When I try to connect to remote SSH server, and the connection fails, the pids also are added to the right place:
So I can't figure this out -- why does it work sometimes, and sometimes it doesn't?
The second question is, what do these warnings mean?
It looks like that it happens all the time in my system, no matter whether it works or not:
So what's going on?