ioi / isolate

Sandbox for securely executing untrusted programs
Other
1.05k stars 157 forks source link

Migrate to CGroupv2 #78

Closed seirl closed 4 months ago

seirl commented 5 years ago

When running a container with systemd-nspawn, systemd remounts /sys/fs/cgroup in read-only. This prevents isolate from creating its own cgroup inside /sys.

Apparently, this is intended, as isolate shouldn't create its own cgroup in the root, but do it in a subgroup of the one provided by systemd: https://lists.freedesktop.org/archives/systemd-devel/2017-November/039736.html

I'm completely unfamiliar with the cgroup/Delegate API of systemd, so I'm not sure what a proper fix should look like. I'll try to investigate, but if anyone already knows what a good fix would be, don't hesitate to tell me :-P

seirl commented 5 years ago

For tracking purposes, dumping more docs:

https://systemd.io/CGROUP_DELEGATION

https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html#Delegate=

seirl commented 5 years ago

Apparently this is a WONTFIX and can only be resolved by moving to cgroupsv2: https://lists.freedesktop.org/archives/systemd-devel/2019-May/042558.html

edomora97 commented 2 years ago

I don't know where this issue fits in the roadmap of isolate, but I think it should be prioritized a bit. In fact Arch Linux disabled by default cgroups v1 and therefore isolate stopped working. They can still be enabled manually by adding systemd.unified_cgroup_hierarchy=0 to the kernel parameters (link), but that's not a long-term solution.

May you update us on it? Thanks!

gollux commented 2 years ago

This is currently the top item on my TODO list.

gollux commented 2 years ago

Rudimentary implementation of the move to cgroup v2 is in the cg2 branch.

First of all, I had to solve integration with systemd. It can delegate some types of cgroups to other managers, but apparently the only way how to make the cgroup persistent is to keep a process in it. I therefore wrote a simple daemon called isolate-cg-keeper, which sets up the cgroup and then sleeps forever. See systemd/isolate.service and systemd/isolate.scope for the relevant systemd configuration.

Isolate's config file now contains the path to a master cgroup, under which cgroups for individual sandboxes will be created. If you are using systemd, this is the cgroup maintained by isolate.service. With other service managers, you have to create it yourself and configure isolate to use it.

The good news is that the switch to cgroup v2 simplified isolate a lot.

The bad news is that I failed to find a way how to measure maximum memory usage: there is nothing like memory.max_usage_in_bytes in cgroup v2.

The code is still almost untested and it has a plenty of rough edges and close to no documentation. However, if you want to get your feet wet, I will be glad for any feedback.

magula commented 1 year ago

The bad news is that I failed to find a way how to measure maximum memory usage: there is nothing like memory.max_usage_in_bytes in cgroup v2.

There is now memory.peak, which seems to correspond to memory.max_usage_in_bytes from v1. It was introduced with this commit.

There's no memory.swap.peak to replace memory.memsw.max_usage_in_bytes, though, so I guess there's still no way of correctly measuring memory usage with swap enabled (EDIT, to clarify: apart from the legacy memory.memsw.max_usage_in_bytes accounting).

wil93 commented 1 year ago

I'm testing the cg2 branch of isolate on Ubuntu 22.04. This version of Ubuntu only has cgroups v2 available, see:

image

The CMS test suite fails with this new version of isolate, it seems that the --init step fails. Do you know if I'm doing something wrong? @gollux

This is the work-in-progress PR over at the CMS repo: cms-dev/cms#1222

2022-12-18 15:46:29,255 - ERROR [Worker,2 88 Worker::execute_job_group] Worker failed: Failed to initialize sandbox.
Traceback (most recent call last):
  File "/home/cmsuser/cms/cms/grading/Sandbox.py", line 1416, in initialize_isolate
    subprocess.check_call(init_cmd)
  File "/usr/local/lib/python3.8/dist-packages/gevent/subprocess.py", line 316, in check_call
    raise CalledProcessError(retcode, cmd) # pylint:disable=undefined-variable
subprocess.CalledProcessError: Command '['isolate', '--cg', '--box-id=31', '--init']' returned non-zero exit status 2.
wil93 commented 1 year ago

Ok I realized I have to install the systemd configuration files 😅 I'm now getting some more interesting errors.

When I try to start the isolate.service I see this:

Dec 18 16:49:22 f8df5033d1b0 systemd[1]: Started A trivial daemon to keep Isolate's control group hierarchy.
Dec 18 16:49:22 f8df5033d1b0 systemd[1]: isolate.service: Main process exited, code=exited, status=1/FAILURE
Dec 18 16:49:22 f8df5033d1b0 isolate-cg-keeper[3780]: Cannot create subgroup /sys/fs/cgroup/isolate.slice/isolate.service/daemon: No such file or directory
Dec 18 16:49:22 f8df5033d1b0 systemd[1]: isolate.service: Failed with result 'exit-code'.

But if I check the status of isolate.slice I see that it was created in a different folder, seemingly related to Docker:

# systemctl status isolate.slice
* isolate.slice - Slice for Isolate's sandboxes
     Loaded: loaded (/etc/systemd/system/isolate.slice; static)
     Active: active since Sun 2022-12-18 16:49:22 UTC; 7s ago
      Tasks: 0
     Memory: 0B
     CGroup: /docker/f8df5033d1b02ee218e750be331e5ebe073c46d37f9a63ca5cf78d1c96c56f5f/isolate.slice

Dec 18 16:49:22 f8df5033d1b0 systemd[1]: Created slice Slice for Isolate's sandboxes.
Dec 18 16:49:22 f8df5033d1b0 isolate-cg-keeper[3780]: Cannot create subgroup /sys/fs/cgroup/isolate.slice/isolate.service/daemon: No such file or directory

I can find the folder under /sys/fs/cgroup/systemd, the full path is: /sys/fs/cgroup/systemd/docker/f8df5033d1b02ee218e750be331e5ebe073c46d37f9a63ca5cf78d1c96c56f5f/isolate.slice/

Maybe this would work without docker?

gollux commented 1 year ago

Could you please try it without Docker first?

gollux commented 1 year ago

I consider the cgroup v2 code almost ready now.

Among other things, the name of the cgroup is no longer hard-coded in the configuration file. Instead, isolate-cg-keeper finds out in which cgroup it is started, and it passed the name to isolate via /run/isolate/cgroup. Beside simplifying configuration, this should also help with running Isolate in containers.

Also, I implemented proper locking of sandboxes, so different users cannot stomp on each other's sandboxes. It also prevents --run --cg if the sandbox was not initialized with --cg and vice versa.

There is some support for having a system-wide daemon which manages access to sandboxes. The daemon itself is not ready yet, but some rudiments can be found in the daemon branch.

I removed the --cg-timing option. We use CG-based timing whenever --cg is active. (This was the default behavior anyway, so I expect nobody was really using the option.)

gollux commented 1 year ago

In the daemon branch, you find my first attempt to create a daemon for managing sandboxes. Local users can connect to the daemon via a UNIX socket and they are given fresh sandboxes for use. This allows isolate to be used by multiple programs running in parallel, possibly belonging to different system users.

You will find a sketch of documentation at the top of daemon.py. Run the daemon as root.

I will be glad for any feedback.

BhautikChudasama commented 11 months ago

Hi @gollux, Does it compatible with CGroup v2?

gollux commented 11 months ago

The version in the cg2 branch supports only CGroup v2, the version in master supports only v1.

I plan to deprecate v1 and merge cg2 into master.

BhautikChudasama commented 11 months ago

Thanks @gollux for confirmation.

Bhautik0110 commented 11 months ago

Hi @gollux ser, I am running your cg2 branch with judge0. I removed the older isolate and added isolate v2. Below i have attached the log. Have you any idea why this is happening?

isolate --cg -s -b 32 -M /var/local/lib/isolate/32/metadata.txt --stderr-to-stdout -i /dev/null -t 15.0 -x 0 -w 20.0 -k 128000 -p120 --cg-timing --cg-mem=512000 -f 4096 -E HOME=/tmp -E PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" -E LANG -E LANGUAGE -E LC_ALL -E JUDGE0_HOMEPAGE -E JUDGE0_SOURCE_CODE -E JUDGE0_MAINTAINER -E JUDGE0_VERSION -d /etc:noexec --run -- /bin/bash compile > /var/local/lib/isolate/32/compile_output.txt 
Cannot write /sys/fs/cgroup/memory/box-32/tasks: No such file or directory
maxkt commented 8 months ago

Hi @gollux!

First of all, thanks for the great library that has been very useful for us in implementing infrastructure for sandboxed live coding environments.

At the moment we're trying to make a decision whether to migrate to cgroup v2 or no. Do you think the v2 is ready for production now?

Thanks.

gollux commented 8 months ago

I'm already running it in production and I plan to release it soon. The only missing thing is a bit of documentation.

jwd-dev commented 8 months ago

Any update on this?

bajcmartinez commented 7 months ago

hi, anyone was able to run this under docker?, I realize I need to start the service, just not sure how to do that.

Here is the simple command I'm trying to run:

# isolate --run --cg python
Cannot open /run/isolate/cgroup: No such file or directory

and without --cg it won't work unless I'm running docker with privilege

# isolate --run python
Cannot run proxy, clone failed: Operation not permitted

Also running the check I get the following:

# isolate-check-environment
Checking for cgroup support for memory ... CAUTION
WARNING: the memory is not present. isolate --cg cannot be used.
Checking for cgroup support for cpuacct ... CAUTION
WARNING: the cpuacct is not present. isolate --cg cannot be used.
Checking for cgroup support for cpuset ... CAUTION
WARNING: the cpuset is not present. isolate --cg cannot be used.
Checking for swap ... FAIL
WARNING: swap is enabled, but swap accounting is not. isolate will not be able to enforce memory limits.
swapoff -a
Checking for CPU frequency scaling ... SKIPPED (not detected)
Checking for Intel frequency boost ... SKIPPED (not detected)
Checking for general frequency boost ... SKIPPED (not detected)
Checking for kernel address space randomisation ... FAIL
WARNING: address space randomisation is enabled.
echo 0 > /proc/sys/kernel/randomize_va_space
Checking for transparent hugepage support ... FAIL
WARNING: transparent hugepages are enabled.
echo never > /sys/kernel/mm/transparent_hugepage/enabled
WARNING: transparent hugepage defrag is enabled.
echo never > /sys/kernel/mm/transparent_hugepage/defrag
WARNING: khugepaged defrag is enabled.
echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/defrag

Are those fail and warnings normal? Seems like I'm doing something wrong. I'm using the branch cg2.


UPDATE:

If I manually run isolate-cg-keeper, this is what happens:

/usr/local/sbin/isolate-cg-keeper
Cannot create subgroup /sys/fs/cgroup//daemon: Read-only file system
# isolate --cg --run python
Control group root  does not exist

Thanks!

gollux commented 7 months ago

isolate-check-environment wasn't updated for the cg2 branch yet. I hope to do it soon -- besides some documentation, it's the only roadblock on the way to merging cg2.

You probably need a privileged container (I'm not sure as I don't use Docker myself).

You certainly need systemctl start isolate.service.

bajcmartinez commented 7 months ago

@gollux , thanks for the quick response. I'm ignoring the check for now, but --cg won't even work with a privileged container.

Running with privilege I can run successfully the following command:

# isolate --run -- /usr/local/bin/python

however,

# isolate --cg --run -- /usr/local/bin/python

fails with

Control group root      _
 does not exist

I think it's related to the service, can't run the service on docker, and running the keeper manually throws:

/usr/local/sbin/isolate-cg-keeper
Cannot write to /sys/fs/cgroup//cgroup.subtree_control: Device or resource busy

not sure what that means. I'll keep experimenting and will write an update if I find something useful.

Thanks

gollux commented 7 months ago

You need to have systemd running inside the container.

bajcmartinez commented 7 months ago

Thanks, that won't work in my environment, I thought there may be a way around it, perhaps there still is, gotta investigate more. I'm trying to run an app that would run untrusted user code in AWS, and I thought I could spin it up as a microservice in fargate, but I don't have much control over how docker spins up, though technically they do support cgroups v2, just can't run the keeper as a service.

Worst case I can deploy it to a virtual machine, but that's painful to maintain for a single man operation hehe.

Is there a reason why you set up a new process with the keeper and not directly as part of the isolate one?

gollux commented 7 months ago

Is there a reason why you set up a new process with the keeper and not directly as part of the isolate one?

Isolate needs its own subtree in the cgroup hierarchy. On systems with systemd, we can ask systemd to delegate such subtree to a service (and there must be a process running in the service to keep the subtree alive ... this is what the keeper process does). If you can obtain a subtree delegation in a different way, you can let Isolate use it by putting the path to the subtree in Isolate's config file.

Emru1 commented 7 months ago

Can't isolate use cgroupfs instead of systemd for cgroupv2?

yahya-abdul-majeed commented 6 months ago

Hi @gollux ser, I am running your cg2 branch with judge0. I removed the older isolate and added isolate v2. Below i have attached the log. Have you any idea why this is happening?

isolate --cg -s -b 32 -M /var/local/lib/isolate/32/metadata.txt --stderr-to-stdout -i /dev/null -t 15.0 -x 0 -w 20.0 -k 128000 -p120 --cg-timing --cg-mem=512000 -f 4096 -E HOME=/tmp -E PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" -E LANG -E LANGUAGE -E LC_ALL -E JUDGE0_HOMEPAGE -E JUDGE0_SOURCE_CODE -E JUDGE0_MAINTAINER -E JUDGE0_VERSION -d /etc:noexec --run -- /bin/bash compile > /var/local/lib/isolate/32/compile_output.txt 
Cannot write /sys/fs/cgroup/memory/box-32/tasks: No such file or directory

Hi, make sure you initialized your sandbox with --cg flag before running it with --cg

gollux commented 4 months ago

Finally merged.