Open ManaSugi opened 3 years ago
Thanks for the proposal!
Some general questions below:
1) Is this new securityContext
image media type a reflection of container-level K8s security context: https://kubernetes.io/docs/tasks/configure-pod-container/security-context/? Say a subset, full set or super set of it for example?
2) Why this abstraction has to bind with images/image-spec? An image is possible to run with different security contexts. And we have to anyway follow the runtime spec configs for this image.securitycontext
. Letting a higher level to handle this looks fine (which can still leverage the tools like Confine
etc. in their DevOps pipeline). What's the benefit for a image default one?
3) How is the conflict handled if specified differently in runtime-spec, CRI runtime or elsewhere?
4) Tools like Confine
seems to only work for syscall analysis. What about other properties in the security context? Looks like a combination of utilities is thus needed?
5) Any burden brought to developers? Though the media type can be optional, it brings no benefit then. I wonder how demanding/practical is this feature if without mature tooling as asked in 3).
@kailun-qin I apologize for the late reply. Thank you for your valuable comments and questions!
Yes, this new securityContext
is a subset of the container-level K8s security context or runtime-spec. The new media type allows users to set a default security context that is more secure than high-level container runtimes' default seccomp profiles.
The main benefit for image-spec is that users can apply the default security context to their containers transparently without burden. The image.securitycontext
is a default security context of an image that is created by the image developer. To use the default security context transparently from users, the security context should be image-spec because the high-level container runtimes such as containerd
creates runtime-spec config file (config.json
) based on the image-spec and K8s config.
If we let a higher level such as DevOps pipeline to handle this, we have to run analysis tools like Confine
to extract default seccomp profiles and apply them to containers by ourselves. This is good for users who modify the existing images or manage their images in their registry. However, it is tiring for users who use the images in the public registry without modifying them. Therefore, the image.securitycontext
should be created by the image developers from the perspective of demarcation of responsibility and stored in the image spec to be able to extract it from high-level runtimes.
This is a default configuration, so if users have already set the security configuration in CRI runtime, etc., this image.securitycontext
should be overwritten.
Yes, Confine
works only syscall analysis, so if image developers want to set default Capabilities, they have to use other tools. As I mentioned above, recently, various techniques that measure Linux container security have been proposed in research papers [6] [7]. If image developers can measure accurately the Capabilities used by applications in container images leveraging those techniques, they can set the default Capabilities to the image config. Therefore, I'd like to add seccomp profiles but also Capabilities in this proposal for the future.
The burden for image developers is just to run the tool such as Confine
and store the information in the image spec. As you said, without mature tools such as Confine
, this feature will not be useful. However, as of now, Confine
is the most practical tool to meet the requirements for this new feature and we have confirmed that it works properly though we need to apply a few patches to Confine
to be able to run on the newer kernel version.
Thank you.
OCI Security Context
Summary
containerd
) offer their default Seccomp profiles that are allowlists of system calls to make containers secure.securitycontext
media type in the Image Media Types and addingSecurityContext
as a new field to theconfig
section of the Image Configuration.Kubernetes
.ImageDefault
to the orchestration's configuration as extra work.Background
Containers offer weaker isolation than Virtual Machines because all containers running on the same host share the same OS kernel. Therefore, it is important to reduce the attack surface of the kernel used by containers. The attack surface can be reduced by Secure computing (Seccomp) that can restrict the system calls available to each container. Additionally, OS Capability and Mandatory Access Control (MAC) like SELinux and AppArmor provide defense in depth.
The existing high-level container runtimes such as
containerd
andCRI-O
offer their default Seccomp profiles if the user sets them in a configuration ofKubernetes
as follows.The default Seccomp profiles are allowlists that drop potentially dangerous system calls such as
pivot_root
,ptrace
, and etc. Due to the default profiles, users can enforce Seccomp to containers easily without any analysis of system calls used by containers.However, the profiles still include many system calls that actually are not used by the containers. If the users want to deny those system calls, they need to inspect the containers and identify system calls required for the containers using
DockerSlim
[1] or other dynamic analysis tools [2] [3]. Unfortunately, the dynamic analysis tools are not perfect because they cannot catch workloads that are executed rarely, such as error handling routines. To identify system calls correctly, a static analysis strategy is necessary, but there are many challenges to inspect system calls inside containerized applications correctly.Motivation
Recently, various state-of-the-art system call analysis techniques have been proposed in research papers to tackle the above issues. Typical examples include
Confine
[4] andSysfilter
[5].Confine
is a new static analysis-based system for automatically extracting and enforcing system call policies on containers.Confine
inspects containerized applications and all their dependencies, identifies the superset of system calls required for the correct operation of the containers, and generates corresponding Seccomp system call policies that can be readily enforced while loading the containers. Compared to the existing system call analysis tools,Confine
can extract system calls more correctly by analyzing containers statically. The results ofConfine
's evaluation by the authors with 150 publicly available Docker images show thatConfine
can successfully reduce their attack surface by disabling 145 or more system calls for more than half of the containers, neutralizing 51 disclosed kernel vulnerabilities.If container image developers can use
Confine
or other new static analysis-based systems to extract system calls that are used by container images, they can generate more accurate default profiles for the container image than runtime default profiles. The image default profiles can drop more system calls in the containers, with other services and functionality disabled. As a result, attack surfaces are typically much smaller than they would be with general-purpose containers, so there are fewer opportunities to attack and compromise the containers.Proposal
The goal of this proposal is to allow the users to choose the image default security context including the default Seccomp profiles and Capability setttings from the container orchestration software such as
Kubernetes
. This proposal can make containers more secure and the user can save time and effort for the security configurations of the containers. To achieve this, we propose defining a security context media type in the OCI Image Media Types and adding a security context field to the OCI Image Configuration.The reason for naming the media type
securitycontext
is to allow security information such as Capability to be added in the future. Recently, various techniques that measure Linux container security have been proposed in research papers [6] [7]. If image developers can measure accurately Capabilities used by applications in container images leveraging those tools, they can set the default Capabilities to the image config. Considering this, we think it is better to add general security settings to the Image Configuration, not limited to Seccomp.Each change is described below.
Image Media Type
We propose defining the new
securitycontext
media type in the Image Media Types.application/vnd.oci.image.securitycontext.v1+json
This contains information about security context that includes Seccomp and Linux Capability. We expect that the information is created by container image developers. For example, the image developer analyzes a container image in advance using system call analysis tools such as
Confine
and writes the seccomp profiles into thissecuritycontext
JSON file.The information is passed to each section in the OCI runtime specification by the high-level container runtimes. Hence, all the contents in the
securitycontext
follow the runtime specification configurations.Here is an example:
application/vnd.oci.image.securitycontext.v1+json
Image Configuration
We propose adding
SecurityContext
as a new field to theconfig
section of the Image Configuration. This field points to a specific security context that includes information about security configurations.SecurityContext
includes a set of descriptor properties.Here is an example:
application/vnd.oci.image.config.v1+json
Expected Use Cases
User Side:
An example of Seccomp for
Kubernetes
users is described below. Set default Seccomp profiles for a container image.By the above configuration,
Kubernetes
enforces the image default profiles to the container.Image Developer Side:
An example for image developers is described below.
Confine
.application/vnd.oci.image.securitycontext.v1+json
and add the information to theSecurityContext
in the Image Configuration.Limitations
This default security context is just default settings for a container image that was analyzed by the image developer in advance. Therefore, if the user puts additional binaries into the default image, the user cannot use the default security context because it does not consider system calls used by the binaries.
Future Work
Currently, we have plans to develop a tool that allows image developers to easily analyze containerized applications inside an image using
Confine
and create an OCI image configuration including the image default Seccomp profiles. We're also thinking about adding support forKubernetes
toConfine
because the current implementation ofConfine
can extract system calls from only Docker containers. Additionally, we need to add a new Seccomp typeImageDefault
in the security context ofKubernetes
and modify the high-level container runtimes such ascontainerd
to extract the Seccomp profiles from the Image Configuration when users choose the image default Seccomp profiles.Backward Compatibility
There is no formal definition for backward-compatible changes in this new feature.
References
[1] DockerSlim. https://dockersl.im [2] strace. https://strace.io [3] oci-seccomp-bpf-hook. https://github.com/containers/oci-seccomp-bpf-hook [4] Seyedhamed Ghavamnia, Tapti Palit, Azzedine Benameur, and Michalis Polychronakis. Confine: Automated System Call Policy Generation for Container Attack Surface Reduction. In International Symposium on Research in Attacks, Intrusions and Defenses (RAID), 2020. [5] Nicholas DeMarinis and Kent Williams-King and Di Jin and Rodrigo Fonseca and Vasileios P. Kemerlis. sysfilter: Automated System Call Filtering for Commodity Software. In International Symposium on Research in Attacks, Intrusions and Defenses (RAID), 2020. [6] J. Criswell, J. Zhou, S. Gravani and X. Hu. "PrivAnalyzer: Measuring the Efficacy of Linux Privilege Use," 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2019. [7] Xin Lin, Lingguang Lei, Yuewu Wang, Jiwu Jing, Kun Sun, and Quan Zhou. A measurement study on Linux container security: Attacks and countermeasures. In Proceedings of the 34th Annual Computer Security Applications Conference (ACSAC), 2018.