opencontainers / runtime-spec

OCI Runtime Specification
http://www.opencontainers.org
Apache License 2.0
3.18k stars 539 forks source link

Default “defaultErrnoRet” breaks the ability for hosts to run more recent container images. #1266

Open ghadi-rahme opened 1 month ago

ghadi-rahme commented 1 month ago

Currently the spec defines the default defaultErrnoRet to be EPERM, this is troublesome and causes issues when running newer containers on hosts running an older kernel/userspace where the libseccomp version might not be aware of some of the syscalls used by the container. There have been many issues reported about syscalls getting an EPERM return code instead of ENOSYS when not available and breaking the user-space inside the container. Below is an example of such reports:

runc does have some hacky code in order to try and figure out if a syscall is supported or not but the method is not always reliable and we have seen at Canonical reports of Ubuntu Noble containers breaking under Ubuntu Jammy hosts on ARM as well as PPC . This issue currently affects all fixed release distros which also happen to be the most popular distros for running containers, and for these distros updating libseccomp for every new syscall provides an unnecessary risk of regression and defeats the whole purpose of a fixed released distro.

I understand that changing the current defaultErrnoRet to ENOSYS may also cause regressions, however it also needs to be acknowledged that the EPERM default value was an oversight and that the OCI spec is fundamentally not compatible with fixed release distros which are the most popular distros for running containers. It also violates one of the most fundamental rules/expectations of containers which is to be able to run any version of a user-space whether it is older or newer than the version on that host.

Having said that, I believe there is a way to satisfy both camps. A list of the currently available syscalls (up to kernel 6.10 as of the writing of this post) can be compiled and be manually set as EPERM for those who were relying on the default EPERM return value while having the others unchanged in the seccomp profiles. This means that when changing defaultErrnoRet to be ENOSYS all previously available syscalls will still return EPERM while newer added ones or even older ones that are defined in the seccomp profile but not known by the libseccomp package on the host will return the correct ENOSYS. As an example, this can be expressed as the following in the runtime spec:

This means that anyone currently using the spec will see no change to their containers since they are all using syscalls from linux 6.10 and below. But it also means that newer containers using post 6.10 syscalls will return the expected ENOSYS error limiting the issue.

Also the spec should define the behavior to follow if the syscall name is not known to the host. I believe the spec should explicitly define ENOSYS for such syscalls, and I am planning on working on a kernel driver that would expose to user-space the list of supported syscalls by the kernel, making it easier to determine the return value of each syscall.

cyphar commented 1 month ago

tl;dr: It would be nice to fix this in the spec, however solving this is more complicated than you expect and libseccomp is missing necessary features (not to mention this should be a problem solved by libseccomp itself IMHO). The simplest solution to the immediate problem is to fix Docker's profile so that it uses -ENOSYS (like Podman does).


We discussed switching to -ENOSYS in general and just doing -EPERM for a limited subset of syscalls in a specific kernel version before (my proposal was Linux 3.0, but I suspect now that it's been a few years it might make sense to pick Linux 4.0 or 5.0 as a baseline).

The roadblock we ran into is that libseccomp didn't provide enough facilities to make this work seamlessly. https://github.com/seccomp/libseccomp/issues/11 and https://github.com/seccomp/libseccomp/issues/286 are the upstream issues where this topic was discussed. In short, there are a few key issues (there are also outlined in the runc commit that added the patchbpf code):

  1. libseccomp has no mechanism to create a rule like if nr > X && no_other_rules_matched { return -ENOSYS }. This means that there is no trivial way to create a filter with libseccomp that returns -EPERM by default for syscalls that were available in kernel version X, and -ENOSYS otherwise.

  2. You might think that we could then just create a rule for every syscall that was not mentioned in the filter. This works for simple syscalls where the filter allows the syscall for all arguments, but for syscalls where only some arguments are permitted and otherwise they are blocked (such as clone, unshare, or the fairly complicated socket rules Podman has), you would need to construct inverse rules to return -EPERM (this is true for all syscalls regardless of age -- it wouldn't make sense to return -ENOSYS for a syscall that you permit some flags for). Unfortunately libseccomp does not allow this in general:

    • libseccomp lacks the ability to do an inverse mask check. See https://github.com/seccomp/libseccomp/issues/310. This is needed for very common seccomp rules. You can work around this by generating an exponential number of rules to cover all possible cases, but that won't work in general.
    • libseccomp does not have the ability to create the complicated boolean expressions you need to create an inverse rule in general. I can't remember which exact boolean expression was the issue, but from memory you couldn't create an inverse rule for a rule that has multiple checks for the same syscall (so you couldn't take a rule like (arg0 > FOO && arg1 > BAR) || (arg0 < BAZ && arg2 > ABCD) which can be constructed by libseccomp and construct its inverse (arg0 <= FOO || arg1 <= BAR) && (arg0 >= BAZ || arg2 <= ABCD)).
    • I'm a little fuzzy on the exact details, but from memory there was also an issue where libseccomp would consolidate rules in a way that would break inverted rules. I can't quite remember the details though, this might just be part of the previous issue with inverse rules.
  3. (Minor) libseccomp used to not provide information about the set of syscalls available for a given kernel version. They have now added this information in csv files (in response to this issue) but there is still no API to get this information. We could just copy these csvs into every runtime project, but it would be nice if there was an API for this.

runc does have some hacky code in order to try and figure out if a syscall is supported or not but the method is not always reliable and we have seen at Canonical reports of Ubuntu Noble containers breaking under Ubuntu Jammy hosts on ARM as well as PPC .

When working on the "hacky" runc code to work around these problems, I tried many approaches and rewrote the code several times. I tried to implement a minimum-kernel-version setup several times (including coming up with very complicated mechanisms to generate inverse rules) and came to the conclusion that it is not possible to implement this with libseccomp currently, and that this needs to be implemented by libseccomp itself. I also tried to come up with a different patching system that just patched return statements but that also doesn't work in general and required modifying too much of the libseccomp filter (which could lead to bugs as well).

The best solution we have at the moment is for runtimes to use -ENOSYS for defaultErrnoRet so that we don't need to do these awful workarounds. Docker was supposed to switch to this but it seems that never happened after this patch by @rata. However, podman/cri-o do use -ENOSYS.

Regarding the ARM and PPC issues you mentioned -- we did fix a bug related to ppc64le recently that might also have affected ARM. Please verify whether that patch fixes the issue you mentioned, and if not please submit a bug report to runc directly.

While I don't like that code, having a maximum kernel version would require very similar code unless we add support for all of this to libseccomp directly. The reason we implemented it in runc without pushing it to the runtime-spec is because the libseccomp folks said they were working on it and so we opted to wait for libseccomp to be ready before we define the right behaviour in the spec.

I am planning on working on a kernel driver that would expose to user-space the list of supported syscalls by the kernel, making it easier to determine the return value of each syscall.

In theory you can already get this information from BTF if you really want to. However, you don't actually need the current set of syscalls on the running kernel to solve this problem, you need historical data so you can set a minimum kernel version.

What we need is for libseccomp to have the ability to specify a minimum kernel version which will cause libseccomp to replace its final generic return defaultErrnoRet rule with if !syscalls_for_version_X.contains(nr) { return -ENOSYS } else { return defaultErrnoRet }. This way you would be able to use defaultErrnoRet = EPERM to create complicated deny rules while also getting -ENOSYS for new syscalls.