ManaSugi commented 3 years ago

OCI Security Context

Summary

The existing high-level container runtimes (e.g., containerd) offer their default Seccomp profiles that are allowlists of system calls to make containers secure.
- However, the runtime default profiles still include many system calls that actually are not used by containers because the profiles drop only potentially dangerous system calls.
Recently, new system call analysis techniques have been proposed in research papers.
- By using the techniques, container image developers can generate more accurate default profiles for the container image than the runtime default profiles.
In this issue, we propose defining the new securitycontext media type in the Image Media Types and adding SecurityContext as a new field to the config section of the Image Configuration.
- The goal of this proposal is to allow the users to choose the image default security context including the default Seccomp profiles from the container orchestration software such as Kubernetes.
- To use this feature from the existing orchestration software, we need to add a new setting like ImageDefault to the orchestration's configuration as extra work.
There is no formal definition for backward-compatible changes in this new feature.

Background

Containers offer weaker isolation than Virtual Machines because all containers running on the same host share the same OS kernel. Therefore, it is important to reduce the attack surface of the kernel used by containers. The attack surface can be reduced by Secure computing (Seccomp) that can restrict the system calls available to each container. Additionally, OS Capability and Mandatory Access Control (MAC) like SELinux and AppArmor provide defense in depth.

The existing high-level container runtimes such as containerd and CRI-O offer their default Seccomp profiles if the user sets them in a configuration of Kubernetes as follows.

securityContext:
  seccompProfile:
    type: RuntimeDefault

The default Seccomp profiles are allowlists that drop potentially dangerous system calls such as pivot_root, ptrace, and etc. Due to the default profiles, users can enforce Seccomp to containers easily without any analysis of system calls used by containers.

However, the profiles still include many system calls that actually are not used by the containers. If the users want to deny those system calls, they need to inspect the containers and identify system calls required for the containers using DockerSlim [1] or other dynamic analysis tools [2] [3]. Unfortunately, the dynamic analysis tools are not perfect because they cannot catch workloads that are executed rarely, such as error handling routines. To identify system calls correctly, a static analysis strategy is necessary, but there are many challenges to inspect system calls inside containerized applications correctly.

Motivation

Recently, various state-of-the-art system call analysis techniques have been proposed in research papers to tackle the above issues. Typical examples include Confine [4] and Sysfilter [5].

Confine is a new static analysis-based system for automatically extracting and enforcing system call policies on containers. Confine inspects containerized applications and all their dependencies, identifies the superset of system calls required for the correct operation of the containers, and generates corresponding Seccomp system call policies that can be readily enforced while loading the containers. Compared to the existing system call analysis tools, Confine can extract system calls more correctly by analyzing containers statically. The results of Confine's evaluation by the authors with 150 publicly available Docker images show that Confine can successfully reduce their attack surface by disabling 145 or more system calls for more than half of the containers, neutralizing 51 disclosed kernel vulnerabilities.

If container image developers can use Confine or other new static analysis-based systems to extract system calls that are used by container images, they can generate more accurate default profiles for the container image than runtime default profiles. The image default profiles can drop more system calls in the containers, with other services and functionality disabled. As a result, attack surfaces are typically much smaller than they would be with general-purpose containers, so there are fewer opportunities to attack and compromise the containers.

Proposal

The goal of this proposal is to allow the users to choose the image default security context including the default Seccomp profiles and Capability setttings from the container orchestration software such as Kubernetes. This proposal can make containers more secure and the user can save time and effort for the security configurations of the containers. To achieve this, we propose defining a security context media type in the OCI Image Media Types and adding a security context field to the OCI Image Configuration.

The reason for naming the media type securitycontext is to allow security information such as Capability to be added in the future. Recently, various techniques that measure Linux container security have been proposed in research papers [6] [7]. If image developers can measure accurately Capabilities used by applications in container images leveraging those tools, they can set the default Capabilities to the image config. Considering this, we think it is better to add general security settings to the Image Configuration, not limited to Seccomp.

Each change is described below.

Image Media Type

We propose defining the new securitycontext media type in the Image Media Types.

application/vnd.oci.image.securitycontext.v1+json

This contains information about security context that includes Seccomp and Linux Capability. We expect that the information is created by container image developers. For example, the image developer analyzes a container image in advance using system call analysis tools such as Confine and writes the seccomp profiles into this securitycontext JSON file.

The information is passed to each section in the OCI runtime specification by the high-level container runtimes. Hence, all the contents in the securitycontext follow the runtime specification configurations.

Here is an example:

application/vnd.oci.image.securitycontext.v1+json

{
    "seccomp": {
        "defaultAction": "SCMP_ACT_ALLOW",
        "architectures": [
            "SCMP_ARCH_X86",
            "SCMP_ARCH_X32"
        ],
        "syscalls": [
            {
                "names": [
                    "swapoff",
                    "pivot_root",
            ...
                ],
                "action": "SCMP_ACT_ERRNO"
            }
        ]
    },
    "capabilites": {
        "bounding": [
            "CAP_AUDIT_WRITE",
            "CAP_KILL",
            "CAP_NET_BIND_SERVICE"
        ],
    ...
    }
}

Image Configuration

We propose adding SecurityContext as a new field to the config section of the Image Configuration. This field points to a specific security context that includes information about security configurations. SecurityContext includes a set of descriptor properties.

Here is an example:

application/vnd.oci.image.config.v1+json

"config": {
    "User": "alice",
    ...
    "SecurityContext": {
        "mediaType": "application/vnd.oci.image.securitycontext.v1+json",
        "size": 200,
        "digest": "sha256:5b0bcabd1ed22e9fb1310cf6c2dec7cdef19f0ad69efa1f392e94a4333501270"
    }
},

Expected Use Cases

User Side:

An example of Seccomp for Kubernetes users is described below. Set default Seccomp profiles for a container image.

spec:
  securityContext:
    seccompProfile:
      type: ImageDefault

By the above configuration, Kubernetes enforces the image default profiles to the container.

Image Developer Side:

An example for image developers is described below.

Analyze a container image using system call analysis tools such as Confine.
Add information about Seccomp to the OCI image configuration
- Create a security context file in accordance with application/vnd.oci.image.securitycontext.v1+json and add the information to the SecurityContext in the Image Configuration.
Push container image public or private registries.

Limitations

This default security context is just default settings for a container image that was analyzed by the image developer in advance. Therefore, if the user puts additional binaries into the default image, the user cannot use the default security context because it does not consider system calls used by the binaries.

Future Work

Currently, we have plans to develop a tool that allows image developers to easily analyze containerized applications inside an image using Confine and create an OCI image configuration including the image default Seccomp profiles. We're also thinking about adding support for Kubernetes to Confine because the current implementation of Confine can extract system calls from only Docker containers. Additionally, we need to add a new Seccomp type ImageDefault in the security context of Kubernetes and modify the high-level container runtimes such as containerd to extract the Seccomp profiles from the Image Configuration when users choose the image default Seccomp profiles.

Backward Compatibility

There is no formal definition for backward-compatible changes in this new feature.

References

[1] DockerSlim. https://dockersl.im [2] strace. https://strace.io [3] oci-seccomp-bpf-hook. https://github.com/containers/oci-seccomp-bpf-hook [4] Seyedhamed Ghavamnia, Tapti Palit, Azzedine Benameur, and Michalis Polychronakis. Confine: Automated System Call Policy Generation for Container Attack Surface Reduction. In International Symposium on Research in Attacks, Intrusions and Defenses (RAID), 2020. [5] Nicholas DeMarinis and Kent Williams-King and Di Jin and Rodrigo Fonseca and Vasileios P. Kemerlis. sysfilter: Automated System Call Filtering for Commodity Software. In International Symposium on Research in Attacks, Intrusions and Defenses (RAID), 2020. [6] J. Criswell, J. Zhou, S. Gravani and X. Hu. "PrivAnalyzer: Measuring the Efficacy of Linux Privilege Use," 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2019. [7] Xin Lin, Lingguang Lei, Yuewu Wang, Jiwu Jing, Kun Sun, and Quan Zhou. A measurement study on Linux container security: Attacks and countermeasures. In Proceedings of the 34th Annual Computer Security Applications Conference (ACSAC), 2018.

kailun-qin commented 3 years ago

Thanks for the proposal!

Some general questions below: 1) Is this new securityContext image media type a reflection of container-level K8s security context: https://kubernetes.io/docs/tasks/configure-pod-container/security-context/? Say a subset, full set or super set of it for example? 2) Why this abstraction has to bind with images/image-spec? An image is possible to run with different security contexts. And we have to anyway follow the runtime spec configs for this image.securitycontext. Letting a higher level to handle this looks fine (which can still leverage the tools like Confine etc. in their DevOps pipeline). What's the benefit for a image default one? 3) How is the conflict handled if specified differently in runtime-spec, CRI runtime or elsewhere? 4) Tools like Confine seems to only work for syscall analysis. What about other properties in the security context? Looks like a combination of utilities is thus needed? 5) Any burden brought to developers? Though the media type can be optional, it brings no benefit then. I wonder how demanding/practical is this feature if without mature tooling as asked in 3).