Closed seirl closed 4 months ago
For tracking purposes, dumping more docs:
https://systemd.io/CGROUP_DELEGATION
https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html#Delegate=
Apparently this is a WONTFIX and can only be resolved by moving to cgroupsv2: https://lists.freedesktop.org/archives/systemd-devel/2019-May/042558.html
I don't know where this issue fits in the roadmap of isolate, but I think it should be prioritized a bit. In fact Arch Linux disabled by default cgroups v1 and therefore isolate stopped working. They can still be enabled manually by adding systemd.unified_cgroup_hierarchy=0
to the kernel parameters (link), but that's not a long-term solution.
May you update us on it? Thanks!
This is currently the top item on my TODO list.
Rudimentary implementation of the move to cgroup v2 is in the cg2
branch.
First of all, I had to solve integration with systemd. It can delegate some types of cgroups to other managers, but apparently the only way how to make the cgroup persistent is to keep a process in it. I therefore wrote a simple daemon called isolate-cg-keeper
, which sets up the cgroup and then sleeps forever. See systemd/isolate.service
and systemd/isolate.scope
for the relevant systemd configuration.
Isolate's config file now contains the path to a master cgroup, under which cgroups for individual sandboxes will be created. If you are using systemd, this is the cgroup maintained by isolate.service
. With other service managers, you have to create it yourself and configure isolate to use it.
The good news is that the switch to cgroup v2 simplified isolate a lot.
The bad news is that I failed to find a way how to measure maximum memory usage: there is nothing like memory.max_usage_in_bytes
in cgroup v2.
The code is still almost untested and it has a plenty of rough edges and close to no documentation. However, if you want to get your feet wet, I will be glad for any feedback.
The bad news is that I failed to find a way how to measure maximum memory usage: there is nothing like
memory.max_usage_in_bytes
in cgroup v2.
There is now memory.peak
, which seems to correspond to memory.max_usage_in_bytes
from v1. It was introduced with this commit.
There's no memory.swap.peak
to replace memory.memsw.max_usage_in_bytes
, though, so I guess there's still no way of correctly measuring memory usage with swap enabled (EDIT, to clarify: apart from the legacy memory.memsw.max_usage_in_bytes
accounting).
I'm testing the cg2
branch of isolate on Ubuntu 22.04. This version of Ubuntu only has cgroups v2 available, see:
The CMS test suite fails with this new version of isolate, it seems that the --init
step fails. Do you know if I'm doing something wrong? @gollux
This is the work-in-progress PR over at the CMS repo: cms-dev/cms#1222
2022-12-18 15:46:29,255 - ERROR [Worker,2 88 Worker::execute_job_group] Worker failed: Failed to initialize sandbox.
Traceback (most recent call last):
File "/home/cmsuser/cms/cms/grading/Sandbox.py", line 1416, in initialize_isolate
subprocess.check_call(init_cmd)
File "/usr/local/lib/python3.8/dist-packages/gevent/subprocess.py", line 316, in check_call
raise CalledProcessError(retcode, cmd) # pylint:disable=undefined-variable
subprocess.CalledProcessError: Command '['isolate', '--cg', '--box-id=31', '--init']' returned non-zero exit status 2.
Ok I realized I have to install the systemd configuration files 😅 I'm now getting some more interesting errors.
When I try to start the isolate.service
I see this:
Dec 18 16:49:22 f8df5033d1b0 systemd[1]: Started A trivial daemon to keep Isolate's control group hierarchy.
Dec 18 16:49:22 f8df5033d1b0 systemd[1]: isolate.service: Main process exited, code=exited, status=1/FAILURE
Dec 18 16:49:22 f8df5033d1b0 isolate-cg-keeper[3780]: Cannot create subgroup /sys/fs/cgroup/isolate.slice/isolate.service/daemon: No such file or directory
Dec 18 16:49:22 f8df5033d1b0 systemd[1]: isolate.service: Failed with result 'exit-code'.
But if I check the status of isolate.slice
I see that it was created in a different folder, seemingly related to Docker:
# systemctl status isolate.slice
* isolate.slice - Slice for Isolate's sandboxes
Loaded: loaded (/etc/systemd/system/isolate.slice; static)
Active: active since Sun 2022-12-18 16:49:22 UTC; 7s ago
Tasks: 0
Memory: 0B
CGroup: /docker/f8df5033d1b02ee218e750be331e5ebe073c46d37f9a63ca5cf78d1c96c56f5f/isolate.slice
Dec 18 16:49:22 f8df5033d1b0 systemd[1]: Created slice Slice for Isolate's sandboxes.
Dec 18 16:49:22 f8df5033d1b0 isolate-cg-keeper[3780]: Cannot create subgroup /sys/fs/cgroup/isolate.slice/isolate.service/daemon: No such file or directory
I can find the folder under /sys/fs/cgroup/systemd
, the full path is: /sys/fs/cgroup/systemd/docker/f8df5033d1b02ee218e750be331e5ebe073c46d37f9a63ca5cf78d1c96c56f5f/isolate.slice/
Maybe this would work without docker?
Could you please try it without Docker first?
I consider the cgroup v2 code almost ready now.
Among other things, the name of the cgroup is no longer hard-coded in the configuration file. Instead, isolate-cg-keeper
finds out in which cgroup it is started, and it passed the name to isolate
via /run/isolate/cgroup
. Beside simplifying configuration, this should also help with running Isolate in containers.
Also, I implemented proper locking of sandboxes, so different users cannot stomp on each other's sandboxes. It also prevents --run --cg
if the sandbox was not initialized with --cg
and vice versa.
There is some support for having a system-wide daemon which manages access to sandboxes. The daemon itself is not ready yet, but some rudiments can be found in the daemon
branch.
I removed the --cg-timing
option. We use CG-based timing whenever --cg
is active. (This was the default behavior anyway, so I expect nobody was really using the option.)
In the daemon
branch, you find my first attempt to create a daemon for managing sandboxes. Local users can connect to the daemon via a UNIX socket and they are given fresh sandboxes for use. This allows isolate to be used by multiple programs running in parallel, possibly belonging to different system users.
You will find a sketch of documentation at the top of daemon.py
. Run the daemon as root.
I will be glad for any feedback.
Hi @gollux, Does it compatible with CGroup v2?
The version in the cg2
branch supports only CGroup v2, the version in master
supports only v1.
I plan to deprecate v1 and merge cg2
into master
.
Thanks @gollux for confirmation.
Hi @gollux ser,
I am running your cg2
branch with judge0
. I removed the older isolate and added isolate v2. Below i have attached the log. Have you any idea why this is happening?
isolate --cg -s -b 32 -M /var/local/lib/isolate/32/metadata.txt --stderr-to-stdout -i /dev/null -t 15.0 -x 0 -w 20.0 -k 128000 -p120 --cg-timing --cg-mem=512000 -f 4096 -E HOME=/tmp -E PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" -E LANG -E LANGUAGE -E LC_ALL -E JUDGE0_HOMEPAGE -E JUDGE0_SOURCE_CODE -E JUDGE0_MAINTAINER -E JUDGE0_VERSION -d /etc:noexec --run -- /bin/bash compile > /var/local/lib/isolate/32/compile_output.txt
Cannot write /sys/fs/cgroup/memory/box-32/tasks: No such file or directory
Hi @gollux!
First of all, thanks for the great library that has been very useful for us in implementing infrastructure for sandboxed live coding environments.
At the moment we're trying to make a decision whether to migrate to cgroup v2 or no. Do you think the v2 is ready for production now?
Thanks.
I'm already running it in production and I plan to release it soon. The only missing thing is a bit of documentation.
Any update on this?
hi, anyone was able to run this under docker?, I realize I need to start the service, just not sure how to do that.
Here is the simple command I'm trying to run:
# isolate --run --cg python
Cannot open /run/isolate/cgroup: No such file or directory
and without --cg
it won't work unless I'm running docker with privilege
# isolate --run python
Cannot run proxy, clone failed: Operation not permitted
Also running the check I get the following:
# isolate-check-environment
Checking for cgroup support for memory ... CAUTION
WARNING: the memory is not present. isolate --cg cannot be used.
Checking for cgroup support for cpuacct ... CAUTION
WARNING: the cpuacct is not present. isolate --cg cannot be used.
Checking for cgroup support for cpuset ... CAUTION
WARNING: the cpuset is not present. isolate --cg cannot be used.
Checking for swap ... FAIL
WARNING: swap is enabled, but swap accounting is not. isolate will not be able to enforce memory limits.
swapoff -a
Checking for CPU frequency scaling ... SKIPPED (not detected)
Checking for Intel frequency boost ... SKIPPED (not detected)
Checking for general frequency boost ... SKIPPED (not detected)
Checking for kernel address space randomisation ... FAIL
WARNING: address space randomisation is enabled.
echo 0 > /proc/sys/kernel/randomize_va_space
Checking for transparent hugepage support ... FAIL
WARNING: transparent hugepages are enabled.
echo never > /sys/kernel/mm/transparent_hugepage/enabled
WARNING: transparent hugepage defrag is enabled.
echo never > /sys/kernel/mm/transparent_hugepage/defrag
WARNING: khugepaged defrag is enabled.
echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/defrag
Are those fail and warnings normal? Seems like I'm doing something wrong. I'm using the branch cg2.
If I manually run isolate-cg-keeper
, this is what happens:
/usr/local/sbin/isolate-cg-keeper
Cannot create subgroup /sys/fs/cgroup//daemon: Read-only file system
# isolate --cg --run python
Control group root does not exist
Thanks!
isolate-check-environment
wasn't updated for the cg2 branch yet. I hope to do it soon -- besides some documentation, it's the only roadblock on the way to merging cg2.
You probably need a privileged container (I'm not sure as I don't use Docker myself).
You certainly need systemctl start isolate.service
.
@gollux , thanks for the quick response. I'm ignoring the check for now, but --cg
won't even work with a privileged container.
Running with privilege I can run successfully the following command:
# isolate --run -- /usr/local/bin/python
however,
# isolate --cg --run -- /usr/local/bin/python
fails with
Control group root _
does not exist
I think it's related to the service, can't run the service on docker, and running the keeper manually throws:
/usr/local/sbin/isolate-cg-keeper
Cannot write to /sys/fs/cgroup//cgroup.subtree_control: Device or resource busy
not sure what that means. I'll keep experimenting and will write an update if I find something useful.
Thanks
You need to have systemd running inside the container.
Thanks, that won't work in my environment, I thought there may be a way around it, perhaps there still is, gotta investigate more. I'm trying to run an app that would run untrusted user code in AWS, and I thought I could spin it up as a microservice in fargate, but I don't have much control over how docker spins up, though technically they do support cgroups v2, just can't run the keeper as a service.
Worst case I can deploy it to a virtual machine, but that's painful to maintain for a single man operation hehe.
Is there a reason why you set up a new process with the keeper and not directly as part of the isolate one?
Is there a reason why you set up a new process with the keeper and not directly as part of the isolate one?
Isolate needs its own subtree in the cgroup hierarchy. On systems with systemd, we can ask systemd to delegate such subtree to a service (and there must be a process running in the service to keep the subtree alive ... this is what the keeper process does). If you can obtain a subtree delegation in a different way, you can let Isolate use it by putting the path to the subtree in Isolate's config file.
Can't isolate use cgroupfs instead of systemd for cgroupv2?
Hi @gollux ser, I am running your
cg2
branch withjudge0
. I removed the older isolate and added isolate v2. Below i have attached the log. Have you any idea why this is happening?isolate --cg -s -b 32 -M /var/local/lib/isolate/32/metadata.txt --stderr-to-stdout -i /dev/null -t 15.0 -x 0 -w 20.0 -k 128000 -p120 --cg-timing --cg-mem=512000 -f 4096 -E HOME=/tmp -E PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" -E LANG -E LANGUAGE -E LC_ALL -E JUDGE0_HOMEPAGE -E JUDGE0_SOURCE_CODE -E JUDGE0_MAINTAINER -E JUDGE0_VERSION -d /etc:noexec --run -- /bin/bash compile > /var/local/lib/isolate/32/compile_output.txt Cannot write /sys/fs/cgroup/memory/box-32/tasks: No such file or directory
Hi, make sure you initialized your sandbox with --cg flag before running it with --cg
Finally merged.
When running a container with systemd-nspawn, systemd remounts /sys/fs/cgroup in read-only. This prevents isolate from creating its own cgroup inside /sys.
Apparently, this is intended, as isolate shouldn't create its own cgroup in the root, but do it in a subgroup of the one provided by systemd: https://lists.freedesktop.org/archives/systemd-devel/2017-November/039736.html
I'm completely unfamiliar with the cgroup/Delegate API of systemd, so I'm not sure what a proper fix should look like. I'll try to investigate, but if anyone already knows what a good fix would be, don't hesitate to tell me :-P