kata-containers / kata-containers

Kata Containers is an open source project and community working to build a standard implementation of lightweight Virtual Machines (VMs) that feel and perform like containers, but provide the workload isolation and security advantages of VMs. https://katacontainers.io/
Apache License 2.0
5.11k stars 1.01k forks source link

Possible issue with NET_BIND_SERVICE behavior. #2650

Open jlclx opened 2 years ago

jlclx commented 2 years ago
Show kata-collect-data.sh details

# Meta details Running `kata-collect-data.sh` version `2.2.0 (commit )` at `2021-09-16.01:44:53.313740980-0500`. ---

Runtime

Runtime is `/usr/local/bin/kata-runtime`. # `kata-env`

/usr/local/bin/kata-runtime kata-env

```toml [Kernel] Path = "/opt/kata/share/kata-containers/vmlinux-5.10.25-85" Parameters = "systemd.unit=kata-containers.target systemd.mask=systemd-networkd.service systemd.mask=systemd-networkd.socket" [Meta] Version = "1.0.25" [Image] Path = "/opt/kata/share/kata-containers/kata-clearlinux-latest.image" [Initrd] Path = "" [Agent] TraceMode = "" TraceType = "" Debug = false Trace = false [Hypervisor] MachineType = "q35" Version = "cloud-hypervisor v17.0.0" Path = "/opt/kata/bin/cloud-hypervisor" BlockDeviceDriver = "virtio-blk" EntropySource = "/dev/urandom" SharedFS = "virtio-fs" VirtioFSDaemon = "/opt/kata/libexec/kata-qemu/virtiofsd" Msize9p = 8192 MemorySlots = 10 PCIeRootPort = 0 HotplugVFIOOnRootBus = false Debug = false [Netmon] Path = "/opt/kata/libexec/kata-containers/kata-netmon" Debug = false Enable = false [Netmon.Version] Semver = "2.2.0" Commit = "<>" Major = 2 Minor = 2 Patch = 0 [Runtime] Path = "/opt/kata/bin/kata-runtime" Debug = false Trace = false DisableGuestSeccomp = true DisableNewNetNs = false SandboxCgroupOnly = false [Runtime.Config] Path = "/etc/kata-containers/configuration.toml" [Runtime.Version] OCI = "1.0.2-dev" [Runtime.Version.Version] Semver = "2.2.0" Commit = "" Major = 2 Minor = 2 Patch = 0 [Host] Kernel = "5.4.0-84-generic" Architecture = "amd64" VMContainerCapable = true SupportVSocks = true [Host.Distro] Name = "Ubuntu" Version = "20.04" [Host.CPU] Vendor = "GenuineIntel" Model = "Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz" CPUs = 32 [Host.Memory] Total = 131971608 Free = 98931540 Available = 106923220 ```

---

Runtime config files

# Runtime config files ## Runtime default config files ``` /etc/kata-containers/configuration.toml /opt/kata/share/defaults/kata-containers/configuration.toml ``` ## Runtime config file contents

cat "/etc/kata-containers/configuration.toml"

```toml # Copyright (c) 2019 Ericsson Eurolab Deutschland GmbH # # SPDX-License-Identifier: Apache-2.0 # # XXX: WARNING: this file is auto-generated. # XXX: # XXX: Source file: "cli/config/configuration-clh.toml.in" # XXX: Project: # XXX: Name: Kata Containers # XXX: Type: kata [hypervisor.clh] path = "/opt/kata/bin/cloud-hypervisor" kernel = "/opt/kata/share/kata-containers/vmlinux.container" image = "/opt/kata/share/kata-containers/kata-containers.img" # List of valid annotation names for the hypervisor # Each member of the list is a regular expression, which is the base name # of the annotation, e.g. "path" for io.katacontainers.config.hypervisor.path" enable_annotations = [] # List of valid annotations values for the hypervisor # Each member of the list is a path pattern as described by glob(3). # The default if not set is empty (all annotations rejected.) # Your distribution recommends: ["/opt/kata/bin/cloud-hypervisor"] valid_hypervisor_paths = ["/opt/kata/bin/cloud-hypervisor"] # Optional space-separated list of options to pass to the guest kernel. # For example, use `kernel_params = "vsyscall=emulate"` if you are having # trouble running pre-2.15 glibc. # # WARNING: - any parameter specified here will take priority over the default # parameter value of the same name used to start the virtual machine. # Do not set values here unless you understand the impact of doing so as you # may stop the virtual machine from booting. # To see the list of default parameters, enable hypervisor debug, create a # container and look for 'default-kernel-parameters' log entries. kernel_params = "" # Default number of vCPUs per SB/VM: # unspecified or 0 --> will be set to 1 # < 0 --> will be set to the actual number of physical cores # > 0 <= number of physical cores --> will be set to the specified number # > number of physical cores --> will be set to the actual number of physical cores default_vcpus = 1 # Default maximum number of vCPUs per SB/VM: # unspecified or == 0 --> will be set to the actual number of physical cores or to the maximum number # of vCPUs supported by KVM if that number is exceeded # > 0 <= number of physical cores --> will be set to the specified number # > number of physical cores --> will be set to the actual number of physical cores or to the maximum number # of vCPUs supported by KVM if that number is exceeded # WARNING: Depending of the architecture, the maximum number of vCPUs supported by KVM is used when # the actual number of physical cores is greater than it. # WARNING: Be aware that this value impacts the virtual machine's memory footprint and CPU # the hotplug functionality. For example, `default_maxvcpus = 240` specifies that until 240 vCPUs # can be added to a SB/VM, but the memory footprint will be big. Another example, with # `default_maxvcpus = 8` the memory footprint will be small, but 8 will be the maximum number of # vCPUs supported by the SB/VM. In general, we recommend that you do not edit this variable, # unless you know what are you doing. default_maxvcpus = 0 # Default memory size in MiB for SB/VM. # If unspecified then it will be set 2048 MiB. default_memory = 2048 # Default memory slots per SB/VM. # If unspecified then it will be set 10. # This is will determine the times that memory will be hotadded to sandbox/VM. #memory_slots = 10 # Path to vhost-user-fs daemon. virtio_fs_daemon = "/opt/kata/libexec/kata-qemu/virtiofsd" # List of valid annotations values for the virtiofs daemon # The default if not set is empty (all annotations rejected.) # Your distribution recommends: ["/opt/kata/libexec/kata-qemu/virtiofsd"] valid_virtio_fs_daemon_paths = ["/opt/kata/libexec/kata-qemu/virtiofsd"] # Default size of DAX cache in MiB virtio_fs_cache_size = 0 # Extra args for virtiofsd daemon # # Format example: # ["-o", "arg1=xxx,arg2", "-o", "hello world", "--arg3=yyy"] # # see `virtiofsd -h` for possible options. virtio_fs_extra_args = ["--thread-pool-size=1"] # Cache mode: # # - none # Metadata, data, and pathname lookup are not cached in guest. They are # always fetched from host and any changes are immediately pushed to host. # # - auto # Metadata and pathname lookup cache expires after a configured amount of # time (default is 1 second). Data is cached while the file is open (close # to open consistency). # # - always # Metadata, data, and pathname lookup are cached in guest and never expire. virtio_fs_cache = "auto" # Block storage driver to be used for the hypervisor in case the container # rootfs is backed by a block device. This is virtio-scsi, virtio-blk # or nvdimm. block_device_driver = "virtio-blk" # This option changes the default hypervisor and kernel parameters # to enable debug output where available. # # Default false #enable_debug = true # Path to OCI hook binaries in the *guest rootfs*. # This does not affect host-side hooks which must instead be added to # the OCI spec passed to the runtime. # # You can create a rootfs with hooks by customizing the osbuilder scripts: # https://github.com/kata-containers/kata-containers/tree/main/tools/osbuilder # # Hooks must be stored in a subdirectory of guest_hook_path according to their # hook type, i.e. "guest_hook_path/{prestart,poststart,poststop}". # The agent will scan these directories for executable files and add them, in # lexicographical order, to the lifecycle of the guest container. # Hooks are executed in the runtime namespace of the guest. See the official documentation: # https://github.com/opencontainers/runtime-spec/blob/v1.0.1/config.md#posix-platform-hooks # Warnings will be logged if any error is encountered while scanning for hooks, # but it will not abort container execution. #guest_hook_path = "/usr/share/oci/hooks" # [agent.kata] # If enabled, make the agent display debug-level messages. # (default: disabled) #enable_debug = true # Enable agent tracing. # # If enabled, the default trace mode is "dynamic" and the # default trace type is "isolated". The trace mode and type are set # explicity with the `trace_type=` and `trace_mode=` options. # # Notes: # # - Tracing is ONLY enabled when `enable_tracing` is set: explicitly # setting `trace_mode=` and/or `trace_type=` without setting `enable_tracing` # will NOT activate agent tracing. # # - See https://github.com/kata-containers/agent/blob/master/TRACING.md for # full details. # # (default: disabled) #enable_tracing = true # #trace_mode = "dynamic" #trace_type = "isolated" # Enable debug console. # If enabled, user can connect guest OS running inside hypervisor # through "kata-runtime exec " command #debug_console_enabled = true # Agent connection dialing timeout value in seconds # (default: 30) #dial_timeout = 30 [netmon] # If enabled, the network monitoring process gets started when the # sandbox is created. This allows for the detection of some additional # network being added to the existing network namespace, after the # sandbox has been created. # (default: disabled) #enable_netmon = true # Specify the path to the netmon binary. path = "/opt/kata/libexec/kata-containers/kata-netmon" # If enabled, netmon messages will be sent to the system log # (default: disabled) #enable_debug = true [runtime] # If enabled, the runtime will log additional debug messages to the # system log # (default: disabled) #enable_debug = true # # Internetworking model # Determines how the VM should be connected to the # the container network interface # Options: # # - bridged (Deprecated) # Uses a linux bridge to interconnect the container interface to # the VM. Works for most cases except macvlan and ipvlan. # ***NOTE: This feature has been deprecated with plans to remove this # feature in the future. Please use other network models listed below. # # # - macvtap # Used when the Container network interface can be bridged using # macvtap. # # - none # Used when customize network. Only creates a tap device. No veth pair. # # - tcfilter # Uses tc filter rules to redirect traffic from the network interface # provided by plugin to a tap interface connected to the VM. # internetworking_model="tcfilter" # disable guest seccomp # Determines whether container seccomp profiles are passed to the virtual # machine and applied by the kata agent. If set to true, seccomp is not applied # within the guest # (default: true) disable_guest_seccomp=true # If enabled, the runtime will create opentracing.io traces and spans. # (See https://www.jaegertracing.io/docs/getting-started). # (default: disabled) #enable_tracing = true # Set the full url to the Jaeger HTTP Thrift collector. # The default if not set will be "http://localhost:14268/api/traces" #jaeger_endpoint = "" # Sets the username to be used if basic auth is required for Jaeger. #jaeger_user = "" # Sets the password to be used if basic auth is required for Jaeger. #jaeger_password = "" # If enabled, the runtime will not create a network namespace for shim and hypervisor processes. # This option may have some potential impacts to your host. It should only be used when you know what you're doing. # `disable_new_netns` conflicts with `enable_netmon` # `disable_new_netns` conflicts with `internetworking_model=bridged` and `internetworking_model=macvtap`. It works only # with `internetworking_model=none`. The tap device will be in the host network namespace and can connect to a bridge # (like OVS) directly. # If you are using docker, `disable_new_netns` only works with `docker run --net=none` # (default: false) #disable_new_netns = true # if enabled, the runtime will add all the kata processes inside one dedicated cgroup. # The container cgroups in the host are not created, just one single cgroup per sandbox. # The runtime caller is free to restrict or collect cgroup stats of the overall Kata sandbox. # The sandbox cgroup path is the parent cgroup of a container with the PodSandbox annotation. # The sandbox cgroup is constrained if there is no container type annotation. # See: https://godoc.org/github.com/kata-containers/runtime/virtcontainers#ContainerType sandbox_cgroup_only=false # If specified, sandbox_bind_mounts identifieds host paths to be mounted (ro) into the sandboxes shared path. # This is only valid if filesystem sharing is utilized. The provided path(s) will be bindmounted into the shared fs directory. # If defaults are utilized, these mounts should be available in the guest at `/run/kata-containers/shared/containers/sandbox-mounts` # These will not be exposed to the container workloads, and are only provided for potential guest services. sandbox_bind_mounts=[] # Enabled experimental feature list, format: ["a", "b"]. # Experimental features are features not stable enough for production, # they may break compatibility, and are prepared for a big version bump. # Supported experimental features: # (default: []) experimental=[] # If enabled, user can run pprof tools with shim v2 process through kata-monitor. # (default: false) # enable_pprof = true ```

cat "/opt/kata/share/defaults/kata-containers/configuration.toml"

```toml # Copyright (c) 2017-2019 Intel Corporation # # SPDX-License-Identifier: Apache-2.0 # # XXX: WARNING: this file is auto-generated. # XXX: # XXX: Source file: "cli/config/configuration-qemu.toml.in" # XXX: Project: # XXX: Name: Kata Containers # XXX: Type: kata [hypervisor.qemu] path = "/opt/kata/bin/qemu-system-x86_64" kernel = "/opt/kata/share/kata-containers/vmlinux.container" image = "/opt/kata/share/kata-containers/kata-containers.img" machine_type = "q35" # Enable confidential guest support. # Toggling that setting may trigger different hardware features, ranging # from memory encryption to both memory and CPU-state encryption and integrity. # The Kata Containers runtime dynamically detects the available feature set and # aims at enabling the largest possible one. # Default false # confidential_guest = true # List of valid annotation names for the hypervisor # Each member of the list is a regular expression, which is the base name # of the annotation, e.g. "path" for io.katacontainers.config.hypervisor.path" enable_annotations = [] # List of valid annotations values for the hypervisor # Each member of the list is a path pattern as described by glob(3). # The default if not set is empty (all annotations rejected.) # Your distribution recommends: ["/opt/kata/bin/qemu-system-x86_64"] valid_hypervisor_paths = ["/opt/kata/bin/qemu-system-x86_64"] # Optional space-separated list of options to pass to the guest kernel. # For example, use `kernel_params = "vsyscall=emulate"` if you are having # trouble running pre-2.15 glibc. # # WARNING: - any parameter specified here will take priority over the default # parameter value of the same name used to start the virtual machine. # Do not set values here unless you understand the impact of doing so as you # may stop the virtual machine from booting. # To see the list of default parameters, enable hypervisor debug, create a # container and look for 'default-kernel-parameters' log entries. kernel_params = "" # Path to the firmware. # If you want that qemu uses the default firmware leave this option empty firmware = "" # Machine accelerators # comma-separated list of machine accelerators to pass to the hypervisor. # For example, `machine_accelerators = "nosmm,nosmbus,nosata,nopit,static-prt,nofw"` machine_accelerators="" # CPU features # comma-separated list of cpu features to pass to the cpu # For example, `cpu_features = "pmu=off,vmx=off" cpu_features="pmu=off" # Default number of vCPUs per SB/VM: # unspecified or 0 --> will be set to 1 # < 0 --> will be set to the actual number of physical cores # > 0 <= number of physical cores --> will be set to the specified number # > number of physical cores --> will be set to the actual number of physical cores default_vcpus = 1 # Default maximum number of vCPUs per SB/VM: # unspecified or == 0 --> will be set to the actual number of physical cores or to the maximum number # of vCPUs supported by KVM if that number is exceeded # > 0 <= number of physical cores --> will be set to the specified number # > number of physical cores --> will be set to the actual number of physical cores or to the maximum number # of vCPUs supported by KVM if that number is exceeded # WARNING: Depending of the architecture, the maximum number of vCPUs supported by KVM is used when # the actual number of physical cores is greater than it. # WARNING: Be aware that this value impacts the virtual machine's memory footprint and CPU # the hotplug functionality. For example, `default_maxvcpus = 240` specifies that until 240 vCPUs # can be added to a SB/VM, but the memory footprint will be big. Another example, with # `default_maxvcpus = 8` the memory footprint will be small, but 8 will be the maximum number of # vCPUs supported by the SB/VM. In general, we recommend that you do not edit this variable, # unless you know what are you doing. # NOTICE: on arm platform with gicv2 interrupt controller, set it to 8. default_maxvcpus = 0 # Bridges can be used to hot plug devices. # Limitations: # * Currently only pci bridges are supported # * Until 30 devices per bridge can be hot plugged. # * Until 5 PCI bridges can be cold plugged per VM. # This limitation could be a bug in qemu or in the kernel # Default number of bridges per SB/VM: # unspecified or 0 --> will be set to 1 # > 1 <= 5 --> will be set to the specified number # > 5 --> will be set to 5 default_bridges = 1 # Default memory size in MiB for SB/VM. # If unspecified then it will be set 2048 MiB. default_memory = 2048 # # Default memory slots per SB/VM. # If unspecified then it will be set 10. # This is will determine the times that memory will be hotadded to sandbox/VM. #memory_slots = 10 # The size in MiB will be plused to max memory of hypervisor. # It is the memory address space for the NVDIMM devie. # If set block storage driver (block_device_driver) to "nvdimm", # should set memory_offset to the size of block device. # Default 0 #memory_offset = 0 # Specifies virtio-mem will be enabled or not. # Please note that this option should be used with the command # "echo 1 > /proc/sys/vm/overcommit_memory". # Default false #enable_virtio_mem = true # Disable block device from being used for a container's rootfs. # In case of a storage driver like devicemapper where a container's # root file system is backed by a block device, the block device is passed # directly to the hypervisor for performance reasons. # This flag prevents the block device from being passed to the hypervisor, # 9pfs is used instead to pass the rootfs. disable_block_device_use = false # Shared file system type: # - virtio-fs (default) # - virtio-9p shared_fs = "virtio-fs" # Path to vhost-user-fs daemon. virtio_fs_daemon = "/opt/kata/libexec/kata-qemu/virtiofsd" # List of valid annotations values for the virtiofs daemon # The default if not set is empty (all annotations rejected.) # Your distribution recommends: ["/opt/kata/libexec/kata-qemu/virtiofsd"] valid_virtio_fs_daemon_paths = ["/opt/kata/libexec/kata-qemu/virtiofsd"] # Default size of DAX cache in MiB virtio_fs_cache_size = 0 # Extra args for virtiofsd daemon # # Format example: # ["-o", "arg1=xxx,arg2", "-o", "hello world", "--arg3=yyy"] # # see `virtiofsd -h` for possible options. virtio_fs_extra_args = ["--thread-pool-size=1"] # Cache mode: # # - none # Metadata, data, and pathname lookup are not cached in guest. They are # always fetched from host and any changes are immediately pushed to host. # # - auto # Metadata and pathname lookup cache expires after a configured amount of # time (default is 1 second). Data is cached while the file is open (close # to open consistency). # # - always # Metadata, data, and pathname lookup are cached in guest and never expire. virtio_fs_cache = "auto" # Block storage driver to be used for the hypervisor in case the container # rootfs is backed by a block device. This is virtio-scsi, virtio-blk # or nvdimm. block_device_driver = "virtio-scsi" # Specifies cache-related options will be set to block devices or not. # Default false #block_device_cache_set = true # Specifies cache-related options for block devices. # Denotes whether use of O_DIRECT (bypass the host page cache) is enabled. # Default false #block_device_cache_direct = true # Specifies cache-related options for block devices. # Denotes whether flush requests for the device are ignored. # Default false #block_device_cache_noflush = true # Enable iothreads (data-plane) to be used. This causes IO to be # handled in a separate IO thread. This is currently only implemented # for SCSI. # enable_iothreads = false # Enable pre allocation of VM RAM, default false # Enabling this will result in lower container density # as all of the memory will be allocated and locked # This is useful when you want to reserve all the memory # upfront or in the cases where you want memory latencies # to be very predictable # Default false #enable_mem_prealloc = true # Enable huge pages for VM RAM, default false # Enabling this will result in the VM memory # being allocated using huge pages. # This is useful when you want to use vhost-user network # stacks within the container. This will automatically # result in memory pre allocation #enable_hugepages = true # Enable vhost-user storage device, default false # Enabling this will result in some Linux reserved block type # major range 240-254 being chosen to represent vhost-user devices. enable_vhost_user_store = false # The base directory specifically used for vhost-user devices. # Its sub-path "block" is used for block devices; "block/sockets" is # where we expect vhost-user sockets to live; "block/devices" is where # simulated block device nodes for vhost-user devices to live. vhost_user_store_path = "/var/run/kata-containers/vhost-user" # Enable vIOMMU, default false # Enabling this will result in the VM having a vIOMMU device # This will also add the following options to the kernel's # command line: intel_iommu=on,iommu=pt #enable_iommu = true # Enable IOMMU_PLATFORM, default false # Enabling this will result in the VM device having iommu_platform=on set #enable_iommu_platform = true # List of valid annotations values for the vhost user store path # The default if not set is empty (all annotations rejected.) # Your distribution recommends: ["/var/run/kata-containers/vhost-user"] valid_vhost_user_store_paths = ["/var/run/kata-containers/vhost-user"] # Enable file based guest memory support. The default is an empty string which # will disable this feature. In the case of virtio-fs, this is enabled # automatically and '/dev/shm' is used as the backing folder. # This option will be ignored if VM templating is enabled. #file_mem_backend = "" # List of valid annotations values for the file_mem_backend annotation # The default if not set is empty (all annotations rejected.) # Your distribution recommends: [""] valid_file_mem_backends = [""] # Enable swap of vm memory. Default false. # The behaviour is undefined if mem_prealloc is also set to true #enable_swap = true # -pflash can add image file to VM. The arguments of it should be in format # of ["/path/to/flash0.img", "/path/to/flash1.img"] pflashes = [] # This option changes the default hypervisor and kernel parameters # to enable debug output where available. # # Default false #enable_debug = true # Disable the customizations done in the runtime when it detects # that it is running on top a VMM. This will result in the runtime # behaving as it would when running on bare metal. # #disable_nesting_checks = true # This is the msize used for 9p shares. It is the number of bytes # used for 9p packet payload. #msize_9p = 8192 # If false and nvdimm is supported, use nvdimm device to plug guest image. # Otherwise virtio-block device is used. # Default is false #disable_image_nvdimm = true # VFIO devices are hotplugged on a bridge by default. # Enable hotplugging on root bus. This may be required for devices with # a large PCI bar, as this is a current limitation with hotplugging on # a bridge. # Default false #hotplug_vfio_on_root_bus = true # Before hot plugging a PCIe device, you need to add a pcie_root_port device. # Use this parameter when using some large PCI bar devices, such as Nvidia GPU # The value means the number of pcie_root_port # This value is valid when hotplug_vfio_on_root_bus is true and machine_type is "q35" # Default 0 #pcie_root_port = 2 # If vhost-net backend for virtio-net is not desired, set to true. Default is false, which trades off # security (vhost-net runs ring0) for network I/O performance. #disable_vhost_net = true # # Default entropy source. # The path to a host source of entropy (including a real hardware RNG) # /dev/urandom and /dev/random are two main options. # Be aware that /dev/random is a blocking source of entropy. If the host # runs out of entropy, the VMs boot time will increase leading to get startup # timeouts. # The source of entropy /dev/urandom is non-blocking and provides a # generally acceptable source of entropy. It should work well for pretty much # all practical purposes. #entropy_source= "/dev/urandom" # List of valid annotations values for entropy_source # The default if not set is empty (all annotations rejected.) # Your distribution recommends: ["/dev/urandom","/dev/random",""] valid_entropy_sources = ["/dev/urandom","/dev/random",""] # Path to OCI hook binaries in the *guest rootfs*. # This does not affect host-side hooks which must instead be added to # the OCI spec passed to the runtime. # # You can create a rootfs with hooks by customizing the osbuilder scripts: # https://github.com/kata-containers/kata-containers/tree/main/tools/osbuilder # # Hooks must be stored in a subdirectory of guest_hook_path according to their # hook type, i.e. "guest_hook_path/{prestart,poststart,poststop}". # The agent will scan these directories for executable files and add them, in # lexicographical order, to the lifecycle of the guest container. # Hooks are executed in the runtime namespace of the guest. See the official documentation: # https://github.com/opencontainers/runtime-spec/blob/v1.0.1/config.md#posix-platform-hooks # Warnings will be logged if any error is encountered while scanning for hooks, # but it will not abort container execution. #guest_hook_path = "/usr/share/oci/hooks" # # Use rx Rate Limiter to control network I/O inbound bandwidth(size in bits/sec for SB/VM). # In Qemu, we use classful qdiscs HTB(Hierarchy Token Bucket) to discipline traffic. # Default 0-sized value means unlimited rate. #rx_rate_limiter_max_rate = 0 # Use tx Rate Limiter to control network I/O outbound bandwidth(size in bits/sec for SB/VM). # In Qemu, we use classful qdiscs HTB(Hierarchy Token Bucket) and ifb(Intermediate Functional Block) # to discipline traffic. # Default 0-sized value means unlimited rate. #tx_rate_limiter_max_rate = 0 # Set where to save the guest memory dump file. # If set, when GUEST_PANICKED event occurred, # guest memeory will be dumped to host filesystem under guest_memory_dump_path, # This directory will be created automatically if it does not exist. # # The dumped file(also called vmcore) can be processed with crash or gdb. # # WARNING: # Dump guest’s memory can take very long depending on the amount of guest memory # and use much disk space. #guest_memory_dump_path="/var/crash/kata" # If enable paging. # Basically, if you want to use "gdb" rather than "crash", # or need the guest-virtual addresses in the ELF vmcore, # then you should enable paging. # # See: https://www.qemu.org/docs/master/qemu-qmp-ref.html#Dump-guest-memory for details #guest_memory_dump_paging=false # Enable swap in the guest. Default false. # When enable_guest_swap is enabled, insert a raw file to the guest as the swap device # if the swappiness of a container (set by annotation "io.katacontainers.container.resource.swappiness") # is bigger than 0. # The size of the swap device should be # swap_in_bytes (set by annotation "io.katacontainers.container.resource.swap_in_bytes") - memory_limit_in_bytes. # If swap_in_bytes is not set, the size should be memory_limit_in_bytes. # If swap_in_bytes and memory_limit_in_bytes is not set, the size should # be default_memory. #enable_guest_swap = true [factory] # VM templating support. Once enabled, new VMs are created from template # using vm cloning. They will share the same initial kernel, initramfs and # agent memory by mapping it readonly. It helps speeding up new container # creation and saves a lot of memory if there are many kata containers running # on the same host. # # When disabled, new VMs are created from scratch. # # Note: Requires "initrd=" to be set ("image=" is not supported). # # Default false #enable_template = true # Specifies the path of template. # # Default "/run/vc/vm/template" #template_path = "/run/vc/vm/template" # The number of caches of VMCache: # unspecified or == 0 --> VMCache is disabled # > 0 --> will be set to the specified number # # VMCache is a function that creates VMs as caches before using it. # It helps speed up new container creation. # The function consists of a server and some clients communicating # through Unix socket. The protocol is gRPC in protocols/cache/cache.proto. # The VMCache server will create some VMs and cache them by factory cache. # It will convert the VM to gRPC format and transport it when gets # requestion from clients. # Factory grpccache is the VMCache client. It will request gRPC format # VM and convert it back to a VM. If VMCache function is enabled, # kata-runtime will request VM from factory grpccache when it creates # a new sandbox. # # Default 0 #vm_cache_number = 0 # Specify the address of the Unix socket that is used by VMCache. # # Default /var/run/kata-containers/cache.sock #vm_cache_endpoint = "/var/run/kata-containers/cache.sock" [agent.kata] # If enabled, make the agent display debug-level messages. # (default: disabled) #enable_debug = true # Enable agent tracing. # # If enabled, the default trace mode is "dynamic" and the # default trace type is "isolated". The trace mode and type are set # explicity with the `trace_type=` and `trace_mode=` options. # # Notes: # # - Tracing is ONLY enabled when `enable_tracing` is set: explicitly # setting `trace_mode=` and/or `trace_type=` without setting `enable_tracing` # will NOT activate agent tracing. # # - See https://github.com/kata-containers/agent/blob/master/TRACING.md for # full details. # # (default: disabled) #enable_tracing = true # #trace_mode = "dynamic" #trace_type = "isolated" # Comma separated list of kernel modules and their parameters. # These modules will be loaded in the guest kernel using modprobe(8). # The following example can be used to load two kernel modules with parameters # - kernel_modules=["e1000e InterruptThrottleRate=3000,3000,3000 EEE=1", "i915 enable_ppgtt=0"] # The first word is considered as the module name and the rest as its parameters. # Container will not be started when: # * A kernel module is specified and the modprobe command is not installed in the guest # or it fails loading the module. # * The module is not available in the guest or it doesn't met the guest kernel # requirements, like architecture and version. # kernel_modules=[] # Enable debug console. # If enabled, user can connect guest OS running inside hypervisor # through "kata-runtime exec " command #debug_console_enabled = true # Agent connection dialing timeout value in seconds # (default: 30) #dial_timeout = 30 [netmon] # If enabled, the network monitoring process gets started when the # sandbox is created. This allows for the detection of some additional # network being added to the existing network namespace, after the # sandbox has been created. # (default: disabled) #enable_netmon = true # Specify the path to the netmon binary. path = "/opt/kata/libexec/kata-containers/kata-netmon" # If enabled, netmon messages will be sent to the system log # (default: disabled) #enable_debug = true [runtime] # If enabled, the runtime will log additional debug messages to the # system log # (default: disabled) #enable_debug = true # # Internetworking model # Determines how the VM should be connected to the # the container network interface # Options: # # - macvtap # Used when the Container network interface can be bridged using # macvtap. # # - none # Used when customize network. Only creates a tap device. No veth pair. # # - tcfilter # Uses tc filter rules to redirect traffic from the network interface # provided by plugin to a tap interface connected to the VM. # internetworking_model="tcfilter" # disable guest seccomp # Determines whether container seccomp profiles are passed to the virtual # machine and applied by the kata agent. If set to true, seccomp is not applied # within the guest # (default: true) disable_guest_seccomp=true # If enabled, the runtime will create opentracing.io traces and spans. # (See https://www.jaegertracing.io/docs/getting-started). # (default: disabled) #enable_tracing = true # Set the full url to the Jaeger HTTP Thrift collector. # The default if not set will be "http://localhost:14268/api/traces" #jaeger_endpoint = "" # Sets the username to be used if basic auth is required for Jaeger. #jaeger_user = "" # Sets the password to be used if basic auth is required for Jaeger. #jaeger_password = "" # If enabled, the runtime will not create a network namespace for shim and hypervisor processes. # This option may have some potential impacts to your host. It should only be used when you know what you're doing. # `disable_new_netns` conflicts with `enable_netmon` # `disable_new_netns` conflicts with `internetworking_model=tcfilter` and `internetworking_model=macvtap`. It works only # with `internetworking_model=none`. The tap device will be in the host network namespace and can connect to a bridge # (like OVS) directly. # If you are using docker, `disable_new_netns` only works with `docker run --net=none` # (default: false) #disable_new_netns = true # if enabled, the runtime will add all the kata processes inside one dedicated cgroup. # The container cgroups in the host are not created, just one single cgroup per sandbox. # The runtime caller is free to restrict or collect cgroup stats of the overall Kata sandbox. # The sandbox cgroup path is the parent cgroup of a container with the PodSandbox annotation. # The sandbox cgroup is constrained if there is no container type annotation. # See: https://godoc.org/github.com/kata-containers/runtime/virtcontainers#ContainerType sandbox_cgroup_only=false # If specified, sandbox_bind_mounts identifieds host paths to be mounted (ro) into the sandboxes shared path. # This is only valid if filesystem sharing is utilized. The provided path(s) will be bindmounted into the shared fs directory. # If defaults are utilized, these mounts should be available in the guest at `/run/kata-containers/shared/containers/sandbox-mounts` # These will not be exposed to the container workloads, and are only provided for potential guest services. sandbox_bind_mounts=[] # Enabled experimental feature list, format: ["a", "b"]. # Experimental features are features not stable enough for production, # they may break compatibility, and are prepared for a big version bump. # Supported experimental features: # (default: []) experimental=[] # If enabled, user can run pprof tools with shim v2 process through kata-monitor. # (default: false) # enable_pprof = true # WARNING: All the options in the following section have not been implemented yet. # This section was added as a placeholder. DO NOT USE IT! [image] # Container image service. # # Offload the CRI image management service to the Kata agent. # (default: false) #service_offload = true # Container image decryption keys provisioning. # Applies only if service_offload is true. # Keys can be provisioned locally (e.g. through a special command or # a local file) or remotely (usually after the guest is remotely attested). # The provision setting is a complete URL that lets the Kata agent decide # which method to use in order to fetch the keys. # # Keys can be stored in a local file, in a measured and attested initrd: #provision=data:///local/key/file # # Keys could be fetched through a special command or binary from the # initrd (guest) image, e.g. a firmware call: #provision=file:///path/to/bin/fetcher/in/guest # # Keys can be remotely provisioned. The Kata agent fetches them from e.g. # a HTTPS URL: #provision=https://my-key-broker.foo/tenant/ ```

Config file `/usr/share/defaults/kata-containers/configuration.toml` not found ---

Containerd shim v2

Containerd shim v2 is `/usr/local/bin/containerd-shim-kata-v2`.

containerd-shim-kata-v2 --version

``` Kata Containers containerd shim: id: "io.containerd.kata.v2", version: 2.2.0, commit: ```

---

KSM throttler

# KSM throttler ## version ## systemd service

Image details

# Image details ```yaml --- osbuilder: url: "https://github.com/kata-containers/kata-containers/tools/osbuilder" version: "2.2.0-caafd0f9525dfdd26b9e0fd22930b9995f1399bb" rootfs-creation-time: "2021-08-31T22:52:28.939000501+0000Z" description: "osbuilder rootfs" file-format-version: "0.0.2" architecture: "x86_64" base-distro: name: "Clear" version: "35000" packages: default: - "chrony" - "iptables-bin" - "kmod-bin" - "libudev0-shim" - "systemd" - "util-linux-bin" extra: agent: url: "https://github.com/kata-containers/kata-containers" name: "kata-agent" version: "2.2.0" agent-is-init-daemon: "no" ``` ---

Initrd details

# Initrd details No initrd ---

Logfiles

# Logfiles ## Runtime logs

Runtime logs

Recent runtime problems found in system journal: ``` ```

## Throttler logs
Throttler logs

No recent throttler problems found in system journal.

## Kata Containerd Shim v2 logs
Kata Containerd Shim v2

Recent problems found in system journal: ``` ```

---

Container manager details

# Container manager details

Kubernetes

## Kubernetes

kubectl version

``` Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.4+k3s1", GitCommit:"3e250fdbab72d88f7e6aae57446023a0567ffc97", GitTreeState:"clean", BuildDate:"2021-08-19T19:09:53Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.4+k3s1", GitCommit:"3e250fdbab72d88f7e6aae57446023a0567ffc97", GitTreeState:"clean", BuildDate:"2021-08-19T19:09:53Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"} ```

kubectl config view

``` apiVersion: v1 clusters: - cluster: certificate-authority-data: DATA+OMITTED server: https://127.0.0.1:6443 name: default contexts: - context: cluster: default user: default name: default current-context: default kind: Config preferences: {} users: - name: default user: client-certificate-data: REDACTED client-key-data: REDACTED ```

systemctl show kubelet

``` LoadError=org.freedesktop.systemd1.NoSuchUnit "Unit kubelet.service not found." ```

---

Packages

# Packages Have `dpkg`

dpkg -l|egrep "(cc-oci-runtime|cc-runtime|runv|kata-runtime|kata-ksm-throttler|kata-containers-image|linux-container|qemu-)"

``` ii ipxe-qemu-256k-compat-efi-roms 1.0.0+git-20150424.a25a16d-0ubuntu4 all PXE boot firmware - Compat EFI ROM images for qemu ii qemu-block-extra:amd64 1:4.2-3ubuntu6.17 amd64 extra block backend modules for qemu-system and qemu-utils ii qemu-kvm 1:4.2-3ubuntu6.17 amd64 QEMU Full virtualization on x86 hardware ii qemu-system-common 1:4.2-3ubuntu6.17 amd64 QEMU full system emulation binaries (common files) ii qemu-system-data 1:4.2-3ubuntu6.17 all QEMU full system emulation (data files) ii qemu-system-gui:amd64 1:4.2-3ubuntu6.17 amd64 QEMU full system emulation binaries (user interface and audio support) ii qemu-system-x86 1:4.2-3ubuntu6.17 amd64 QEMU full system emulation binaries (x86) ii qemu-utils 1:4.2-3ubuntu6.17 amd64 QEMU utilities ```

No `rpm` ---

Kata Monitor

Kata Monitor `kata-monitor`.

kata-monitor --version

``` /usr/local/bin/kata-collect-data.sh: line 218: kata-monitor: command not found ```

---

Description of problem

Attempting to run the kubernetes/ingress-nginx helm chart, version 3.11.1 at https://github.com/kubernetes/ingress-nginx/tree/ingress-nginx-3.11.1 using kata 2.2 as the runtime class, does not work for me.

Specifically, the non-root user (www-data) inside the nginx controller pod's container seems to lack the permissions or capabilities needed to execute the application despite being configured with a securityContext that works when using runC.

Expected result

The container to launch and execute as "normal".

Actual result

Container application fails to bind to port 80 with error

[emerg] 65#65: bind() to 0.0.0.0:80 failed (13: Permission denied) 

when executing in default configuration.

Container application fails to access some of the filesystem with error

open /etc/ingress-controller/ssl/default-fake-certificate.pem: permission denied 

when attempting to run as UID 0.

Further information

Modifying https://github.com/kubernetes/ingress-nginx/blob/ingress-nginx-3.11.1/charts/ingress-nginx/templates/controller-deployment.yaml#L119 by setting both securityContext.capabilities.add to [ALL] and securityContext.runAsUser to 0 allows proper execution of the container application.

It's possible this issue may be worked around when/if kubernetes/ingress-nginx accepts https://github.com/kubernetes/ingress-nginx/pull/7533, but I raised this issue because I was unsure of if this was intended and/or known behavior with the NET_BIND_SERVICE capability and kata.

If there is a list of differences between securityContext capability support/behavior between runC and kata that I'm unable to find in the documentation, or a similar issue that I missed, I apologize.

This kata-containers deployment was installed by manually unpacking a release tar.

Thank you for your time and effort.

c3d commented 2 years ago

Assigned both area/networking and area/storage since there seems to be both a networking permission failure and a storage permission issue.

@jlclx Could you give us the exact method you used to run your workload. I'm notably looking for details on whether it runs "privileged" and about storage configuration. I don't thinks this is captured by kata-collect.sh. Thanks.

jlclx commented 2 years ago

Thank you @c3d!

Regarding the privileged option for the securityContext

The fully rendered securityContext from the helm template appears to be

securityContext:
  allowPrivilegeEscalation: true
  capabilities:
    add:
    - NET_BIND_SERVICE
    drop:
    - ALL
  runAsUser: 101

the privileged option does not seem to be in play here.

Regarding the storage configuration on the machine

The primary storageClass on the system is https://github.com/rancher/local-path-provisioner, but no PVCs are used in the configuration. As for containerd's general storage it is on a ext4 filesystem with default mount options. The container root is mounted via virtiofs

kataShared on / type virtiofs (rw,relatime)

Further workload information

k8s.gcr.io/ingress-nginx/controller:v0.41.2@sha256:1f4f402b9c14f3ae92b11ada1dfe9893a88f0faeb0b2f4b903e2c67a0c3bf0de seems to be the image in use, which looks to be built from https://github.com/kubernetes/ingress-nginx/blob/controller-v0.41.2/rootfs/Dockerfile From what I can tell this Dockerfile sets the correct capabilities for the binaries in question,

RUN apk add --no-cache libcap \
  && setcap    cap_net_bind_service=+ep /nginx-ingress-controller \
  && setcap -v cap_net_bind_service=+ep /nginx-ingress-controller \
  && setcap    cap_net_bind_service=+ep /usr/local/nginx/sbin/nginx \
  && setcap -v cap_net_bind_service=+ep /usr/local/nginx/sbin/nginx \
  && apk del libcap

but when inspecting the container there seems to be a difference during execution time in the applied permitted and effective capabilities

I've emphasized these differences with added <---

kata-containers

bash-5.0$ ps aux
PID   USER     TIME  COMMAND
    1 www-data  0:00 /usr/bin/dumb-init -- /nginx-ingress-controller --publish-service=my-apps/test-nginx-ingress-nginx
    2 www-data  0:00 /nginx-ingress-controller --publish-service=my-apps/test-nginx-ingress-nginx-controller --election
   21 www-data  0:00 nginx: master process /usr/local/nginx/sbin/nginx -c /etc/nginx/nginx.conf
   29 www-data  0:00 nginx: worker process
   31 www-data  0:00 bash
   60 www-data  0:00 ps aux
bash-5.0$ cat /proc/1/status | grep Cap
CapInh: 0000000000000400
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000000000000400
CapAmb: 0000000000000000
bash-5.0$ cat /proc/2/status | grep Cap
CapInh: 0000000000000400
CapPrm: 0000000000000000       <---
CapEff: 0000000000000000       <---
CapBnd: 0000000000000400
CapAmb: 0000000000000000
bash-5.0$ cat /proc/21/status | grep Cap
CapInh: 0000000000000400
CapPrm: 0000000000000000       <---
CapEff: 0000000000000000       <---
CapBnd: 0000000000000400
CapAmb: 0000000000000000

runC

bash-5.0$ ps aux
PID   USER     TIME  COMMAND
    1 www-data  0:00 /usr/bin/dumb-init -- /nginx-ingress-controller --publish-service=my-apps/test-nginx-ingress-nginx
    7 www-data  0:01 /nginx-ingress-controller --publish-service=my-apps/test-nginx-ingress-nginx-controller --election
   50 www-data  0:00 nginx: master process /usr/local/nginx/sbin/nginx -c /etc/nginx/nginx.conf
   60 www-data  0:00 nginx: worker process
   61 www-data  0:00 nginx: worker process
bash-5.0$ cat /proc/1/status | grep Cap
CapInh: 0000000000000400
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000000000000400
CapAmb: 0000000000000000
bash-5.0$ cat /proc/7/status | grep Cap
CapInh: 0000000000000400
CapPrm: 0000000000000400       <---
CapEff: 0000000000000400       <---
CapBnd: 0000000000000400
CapAmb: 0000000000000000
bash-5.0$ cat /proc/50/status | grep Cap
CapInh: 0000000000000400
CapPrm: 0000000000000400       <---
CapEff: 0000000000000400       <---
CapBnd: 0000000000000400
CapAmb: 0000000000000000

As well as attempting to query the capabilities of a binary fails

kata-containers

bash-5.0# getcap /nginx-ingress-controller
Failed to get capabilities of file `/nginx-ingress-controller' (Not supported)

runC

bash-5.0# getcap /nginx-ingress-controller
/nginx-ingress-controller = cap_net_bind_service+ep

Please let me know if you need any further information.

jlclx commented 2 years ago

I believe I've discovered the issue @c3d The default CloudHypervisor configuration contains the following additional flags for virtiofsd by default

virtio_fs_extra_args = ["--thread-pool-size=1"]

virtiofsd's documentation at https://qemu.readthedocs.io/en/latest/tools/virtiofsd.html indicates that the default behavior for extended attributes is to strip or not pass them, this would include capability information

xattr|no_xattr - Enable/disable extended attributes (xattr) on files and directories. The default is no_xattr.

The following configuration change resolves the issue for the container

virtio_fs_extra_args = ["--thread-pool-size=1", "-o", "xattr"]

As you can see the expected Permitted and Effective capabilities are applied now when using kata-containers

PID   USER     TIME  COMMAND
    1 www-data  0:00 /usr/bin/dumb-init -- /nginx-ingress-controller --publish-service=my-apps/test-nginx-ingress-nginx
    2 www-data  0:00 /nginx-ingress-controller --publish-service=my-apps/test-nginx-ingress-nginx-controller --election
   20 www-data  0:00 nginx: master process /usr/local/nginx/sbin/nginx -c /etc/nginx/nginx.conf
   32 www-data  0:00 nginx: worker process
   33 www-data  0:00 nginx: cache manager process
   34 www-data  0:00 nginx: cache loader process
   67 www-data  0:00 bash
   74 www-data  0:00 ps aux
bash-5.0$ cat /proc/2/status | grep Cap
CapInh: 0000000000000400
CapPrm: 0000000000000400
CapEff: 0000000000000400
CapBnd: 0000000000000400
CapAmb: 0000000000000000
bash-5.0$ uname -a
Linux test-nginx-ingress-nginx-controller-695d7b6f98-l4cvj 5.10.25 #1 SMP Fri Jun 11 20:36:25 UTC 2021 x86_64 Linux
bash-5.0$ mount
kataShared on / type virtiofs (rw,relatime)

Should this be made a default since otherwise it breaks expected capabilities behavior in container images that do not use the root user?