opencontainers / image-spec

OCI Image Format
https://www.opencontainers.org/
Apache License 2.0
3.53k stars 658 forks source link

Proposal: Add Security Context #867

Open ManaSugi opened 3 years ago

ManaSugi commented 3 years ago

OCI Security Context

Summary

Background

Containers offer weaker isolation than Virtual Machines because all containers running on the same host share the same OS kernel. Therefore, it is important to reduce the attack surface of the kernel used by containers. The attack surface can be reduced by Secure computing (Seccomp) that can restrict the system calls available to each container. Additionally, OS Capability and Mandatory Access Control (MAC) like SELinux and AppArmor provide defense in depth.

The existing high-level container runtimes such as containerd and CRI-O offer their default Seccomp profiles if the user sets them in a configuration of Kubernetes as follows.

securityContext:
  seccompProfile:
    type: RuntimeDefault 

The default Seccomp profiles are allowlists that drop potentially dangerous system calls such as pivot_root, ptrace, and etc. Due to the default profiles, users can enforce Seccomp to containers easily without any analysis of system calls used by containers.

However, the profiles still include many system calls that actually are not used by the containers. If the users want to deny those system calls, they need to inspect the containers and identify system calls required for the containers using DockerSlim [1] or other dynamic analysis tools [2] [3]. Unfortunately, the dynamic analysis tools are not perfect because they cannot catch workloads that are executed rarely, such as error handling routines. To identify system calls correctly, a static analysis strategy is necessary, but there are many challenges to inspect system calls inside containerized applications correctly.

Motivation

Recently, various state-of-the-art system call analysis techniques have been proposed in research papers to tackle the above issues. Typical examples include Confine [4] and Sysfilter [5].

Confine is a new static analysis-based system for automatically extracting and enforcing system call policies on containers. Confine inspects containerized applications and all their dependencies, identifies the superset of system calls required for the correct operation of the containers, and generates corresponding Seccomp system call policies that can be readily enforced while loading the containers. Compared to the existing system call analysis tools, Confine can extract system calls more correctly by analyzing containers statically. The results of Confine's evaluation by the authors with 150 publicly available Docker images show that Confine can successfully reduce their attack surface by disabling 145 or more system calls for more than half of the containers, neutralizing 51 disclosed kernel vulnerabilities.

If container image developers can use Confine or other new static analysis-based systems to extract system calls that are used by container images, they can generate more accurate default profiles for the container image than runtime default profiles. The image default profiles can drop more system calls in the containers, with other services and functionality disabled. As a result, attack surfaces are typically much smaller than they would be with general-purpose containers, so there are fewer opportunities to attack and compromise the containers.

Proposal

The goal of this proposal is to allow the users to choose the image default security context including the default Seccomp profiles and Capability setttings from the container orchestration software such as Kubernetes. This proposal can make containers more secure and the user can save time and effort for the security configurations of the containers. To achieve this, we propose defining a security context media type in the OCI Image Media Types and adding a security context field to the OCI Image Configuration.

The reason for naming the media type securitycontext is to allow security information such as Capability to be added in the future. Recently, various techniques that measure Linux container security have been proposed in research papers [6] [7]. If image developers can measure accurately Capabilities used by applications in container images leveraging those tools, they can set the default Capabilities to the image config. Considering this, we think it is better to add general security settings to the Image Configuration, not limited to Seccomp.

Each change is described below.

Image Media Type

We propose defining the new securitycontext media type in the Image Media Types.

This contains information about security context that includes Seccomp and Linux Capability. We expect that the information is created by container image developers. For example, the image developer analyzes a container image in advance using system call analysis tools such as Confine and writes the seccomp profiles into this securitycontext JSON file.

The information is passed to each section in the OCI runtime specification by the high-level container runtimes. Hence, all the contents in the securitycontext follow the runtime specification configurations.

Here is an example:

application/vnd.oci.image.securitycontext.v1+json

{
    "seccomp": {
        "defaultAction": "SCMP_ACT_ALLOW",
        "architectures": [
            "SCMP_ARCH_X86",
            "SCMP_ARCH_X32"
        ],
        "syscalls": [
            {
                "names": [
                    "swapoff",
                    "pivot_root",
            ...
                ],
                "action": "SCMP_ACT_ERRNO"
            }
        ]
    },
    "capabilites": {
        "bounding": [
            "CAP_AUDIT_WRITE",
            "CAP_KILL",
            "CAP_NET_BIND_SERVICE"
        ],
    ...
    }
}

Image Configuration

We propose adding SecurityContext as a new field to the config section of the Image Configuration. This field points to a specific security context that includes information about security configurations. SecurityContext includes a set of descriptor properties.

Here is an example:

application/vnd.oci.image.config.v1+json

"config": {
    "User": "alice",
    ...
    "SecurityContext": {
        "mediaType": "application/vnd.oci.image.securitycontext.v1+json",
        "size": 200,
        "digest": "sha256:5b0bcabd1ed22e9fb1310cf6c2dec7cdef19f0ad69efa1f392e94a4333501270"
    }
},

Expected Use Cases

User Side:

An example of Seccomp for Kubernetes users is described below. Set default Seccomp profiles for a container image.

spec:
  securityContext:
    seccompProfile:
      type: ImageDefault

By the above configuration, Kubernetes enforces the image default profiles to the container.

Image Developer Side:

An example for image developers is described below.

  1. Analyze a container image using system call analysis tools such as Confine.
  2. Add information about Seccomp to the OCI image configuration
    • Create a security context file in accordance with application/vnd.oci.image.securitycontext.v1+json and add the information to the SecurityContext in the Image Configuration.
  3. Push container image public or private registries.

Limitations

This default security context is just default settings for a container image that was analyzed by the image developer in advance. Therefore, if the user puts additional binaries into the default image, the user cannot use the default security context because it does not consider system calls used by the binaries.

Future Work

Currently, we have plans to develop a tool that allows image developers to easily analyze containerized applications inside an image using Confine and create an OCI image configuration including the image default Seccomp profiles. We're also thinking about adding support for Kubernetes to Confine because the current implementation of Confine can extract system calls from only Docker containers. Additionally, we need to add a new Seccomp type ImageDefault in the security context of Kubernetes and modify the high-level container runtimes such as containerd to extract the Seccomp profiles from the Image Configuration when users choose the image default Seccomp profiles.

Backward Compatibility

There is no formal definition for backward-compatible changes in this new feature.

References

[1] DockerSlim. https://dockersl.im [2] strace. https://strace.io [3] oci-seccomp-bpf-hook. https://github.com/containers/oci-seccomp-bpf-hook [4] Seyedhamed Ghavamnia, Tapti Palit, Azzedine Benameur, and Michalis Polychronakis. Confine: Automated System Call Policy Generation for Container Attack Surface Reduction. In International Symposium on Research in Attacks, Intrusions and Defenses (RAID), 2020. [5] Nicholas DeMarinis and Kent Williams-King and Di Jin and Rodrigo Fonseca and Vasileios P. Kemerlis. sysfilter: Automated System Call Filtering for Commodity Software. In International Symposium on Research in Attacks, Intrusions and Defenses (RAID), 2020. [6] J. Criswell, J. Zhou, S. Gravani and X. Hu. "PrivAnalyzer: Measuring the Efficacy of Linux Privilege Use," 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2019. [7] Xin Lin, Lingguang Lei, Yuewu Wang, Jiwu Jing, Kun Sun, and Quan Zhou. A measurement study on Linux container security: Attacks and countermeasures. In Proceedings of the 34th Annual Computer Security Applications Conference (ACSAC), 2018.

kailun-qin commented 3 years ago

Thanks for the proposal!

Some general questions below: 1) Is this new securityContext image media type a reflection of container-level K8s security context: https://kubernetes.io/docs/tasks/configure-pod-container/security-context/? Say a subset, full set or super set of it for example? 2) Why this abstraction has to bind with images/image-spec? An image is possible to run with different security contexts. And we have to anyway follow the runtime spec configs for this image.securitycontext. Letting a higher level to handle this looks fine (which can still leverage the tools like Confine etc. in their DevOps pipeline). What's the benefit for a image default one? 3) How is the conflict handled if specified differently in runtime-spec, CRI runtime or elsewhere? 4) Tools like Confine seems to only work for syscall analysis. What about other properties in the security context? Looks like a combination of utilities is thus needed? 5) Any burden brought to developers? Though the media type can be optional, it brings no benefit then. I wonder how demanding/practical is this feature if without mature tooling as asked in 3).

ManaSugi commented 3 years ago

@kailun-qin I apologize for the late reply. Thank you for your valuable comments and questions!

  1. Yes, this new securityContext is a subset of the container-level K8s security context or runtime-spec. The new media type allows users to set a default security context that is more secure than high-level container runtimes' default seccomp profiles.

  2. The main benefit for image-spec is that users can apply the default security context to their containers transparently without burden. The image.securitycontext is a default security context of an image that is created by the image developer. To use the default security context transparently from users, the security context should be image-spec because the high-level container runtimes such as containerd creates runtime-spec config file (config.json) based on the image-spec and K8s config. If we let a higher level such as DevOps pipeline to handle this, we have to run analysis tools like Confine to extract default seccomp profiles and apply them to containers by ourselves. This is good for users who modify the existing images or manage their images in their registry. However, it is tiring for users who use the images in the public registry without modifying them. Therefore, the image.securitycontext should be created by the image developers from the perspective of demarcation of responsibility and stored in the image spec to be able to extract it from high-level runtimes.

  3. This is a default configuration, so if users have already set the security configuration in CRI runtime, etc., this image.securitycontext should be overwritten.

  4. Yes, Confine works only syscall analysis, so if image developers want to set default Capabilities, they have to use other tools. As I mentioned above, recently, various techniques that measure Linux container security have been proposed in research papers [6] [7]. If image developers can measure accurately the Capabilities used by applications in container images leveraging those techniques, they can set the default Capabilities to the image config. Therefore, I'd like to add seccomp profiles but also Capabilities in this proposal for the future.

  5. The burden for image developers is just to run the tool such as Confine and store the information in the image spec. As you said, without mature tools such as Confine, this feature will not be useful. However, as of now, Confine is the most practical tool to meet the requirements for this new feature and we have confirmed that it works properly though we need to apply a few patches to Confine to be able to run on the newer kernel version.

Thank you.