kata-containers / runtime

Kata Containers version 1.x runtime (for version 2.x see https://github.com/kata-containers/kata-containers).
https://katacontainers.io/
Apache License 2.0
2.1k stars 375 forks source link

elasticsearch container fails to start due to `vm.max_map_count` too low #1342

Closed zeigerpuppy closed 3 years ago

zeigerpuppy commented 5 years ago

Description of problem

When starting an elasticsearch container using the image: docker.elastic.co/elasticsearch/elasticsearch-oss:6.1.3`, it fails to start due to the error:

[1]: max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]

This setting is usually on the docker host and changed by setting something like:

sysctl -w vm.max_map_count=524288

and made permanent in /etc/sysctl.conf

However, changing this setting on the kata-containers host does not have the desired effect (I presume because the change will need to occur on the kata KVM host for the container).

Expected result

It should be possible to set this variable in the kata config or have the default image have a larger value of vm.max_map_count. Then the elasticsearch instance would start without error.

Actual result

container fails to start


Meta details

Running kata-collect-data.sh version 1.3.1 (commit 258eae0) at 2019-03-08.13:23:31.638048064+1100.


Runtime is /usr/bin/kata-runtime.

kata-env

Output of "/usr/bin/kata-runtime kata-env":

[Meta]
  Version = "1.0.18"

[Runtime]
  Debug = false
  Path = "/usr/bin/kata-runtime"
  [Runtime.Version]
    Semver = "1.3.1"
    Commit = "258eae0"
    OCI = "1.0.1"
  [Runtime.Config]
    Path = "/usr/share/defaults/kata-containers/configuration.toml"

[Hypervisor]
  MachineType = "pc"
  Version = "QEMU emulator version 2.11.0\nCopyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers"
  Path = "/usr/bin/qemu-lite-system-x86_64"
  BlockDeviceDriver = "virtio-scsi"
  EntropySource = "/dev/urandom"
  Msize9p = 8192
  MemorySlots = 10
  Debug = false
  UseVSock = false

[Image]
  Path = "/usr/share/kata-containers/kata-containers-image_clearlinux_1.3.1_agent_c7fdd324cda.img"

[Kernel]
  Path = "/usr/share/kata-containers/vmlinuz-4.14.67.16-139.container"
  Parameters = ""

[Initrd]
  Path = ""

[Proxy]
  Type = "kataProxy"
  Version = "kata-proxy version 1.3.1-d364b2e"
  Path = "/usr/libexec/kata-containers/kata-proxy"
  Debug = false

[Shim]
  Type = "kataShim"
  Version = "kata-shim version 1.3.1-58f757d"
  Path = "/usr/libexec/kata-containers/kata-shim"
  Debug = false

[Agent]
  Type = "kata"

[Host]
  Kernel = "4.9.0-8-amd64"
  Architecture = "amd64"
  VMContainerCapable = true
  SupportVSocks = false
  [Host.Distro]
    Name = "Debian GNU/Linux"
    Version = "9"
  [Host.CPU]
    Vendor = "GenuineIntel"
    Model = "Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz"

[Netmon]
  Version = "kata-netmon version 1.3.1"
  Path = "/usr/libexec/kata-containers/kata-netmon"
  Debug = false
  Enable = false

Runtime config files

Runtime default config files

/etc/kata-containers/configuration.toml
/usr/share/defaults/kata-containers/configuration.toml

Runtime config file contents

Config file /etc/kata-containers/configuration.toml not found Output of "cat "/usr/share/defaults/kata-containers/configuration.toml"":

# Copyright (c) 2017-2018 Intel Corporation
#
# SPDX-License-Identifier: Apache-2.0
#

# XXX: WARNING: this file is auto-generated.
# XXX:
# XXX: Source file: "cli/config/configuration.toml.in"
# XXX: Project:
# XXX:   Name: Kata Containers
# XXX:   Type: kata

[hypervisor.qemu]
path = "/usr/bin/qemu-lite-system-x86_64"
kernel = "/usr/share/kata-containers/vmlinuz.container"
image = "/usr/share/kata-containers/kata-containers.img"
machine_type = "pc"

# Optional space-separated list of options to pass to the guest kernel.
# For example, use `kernel_params = "vsyscall=emulate"` if you are having
# trouble running pre-2.15 glibc.
#
# WARNING: - any parameter specified here will take priority over the default
# parameter value of the same name used to start the virtual machine.
# Do not set values here unless you understand the impact of doing so as you
# may stop the virtual machine from booting.
# To see the list of default parameters, enable hypervisor debug, create a
# container and look for 'default-kernel-parameters' log entries.
kernel_params = ""

# Path to the firmware.
# If you want that qemu uses the default firmware leave this option empty
firmware = ""

# Machine accelerators
# comma-separated list of machine accelerators to pass to the hypervisor.
# For example, `machine_accelerators = "nosmm,nosmbus,nosata,nopit,static-prt,nofw"`
machine_accelerators=""

# Default number of vCPUs per SB/VM:
# unspecified or 0                --> will be set to 1
# < 0                             --> will be set to the actual number of physical cores
# > 0 <= number of physical cores --> will be set to the specified number
# > number of physical cores      --> will be set to the actual number of physical cores
default_vcpus = 1

# Default maximum number of vCPUs per SB/VM:
# unspecified or == 0             --> will be set to the actual number of physical cores or to the maximum number
#                                     of vCPUs supported by KVM if that number is exceeded
# > 0 <= number of physical cores --> will be set to the specified number
# > number of physical cores      --> will be set to the actual number of physical cores or to the maximum number
#                                     of vCPUs supported by KVM if that number is exceeded
# WARNING: Depending of the architecture, the maximum number of vCPUs supported by KVM is used when
# the actual number of physical cores is greater than it.
# WARNING: Be aware that this value impacts the virtual machine's memory footprint and CPU
# the hotplug functionality. For example, `default_maxvcpus = 240` specifies that until 240 vCPUs
# can be added to a SB/VM, but the memory footprint will be big. Another example, with
# `default_maxvcpus = 8` the memory footprint will be small, but 8 will be the maximum number of
# vCPUs supported by the SB/VM. In general, we recommend that you do not edit this variable,
# unless you know what are you doing.
default_maxvcpus = 0

# Bridges can be used to hot plug devices.
# Limitations:
# * Currently only pci bridges are supported
# * Until 30 devices per bridge can be hot plugged.
# * Until 5 PCI bridges can be cold plugged per VM.
#   This limitation could be a bug in qemu or in the kernel
# Default number of bridges per SB/VM:
# unspecified or 0   --> will be set to 1
# > 1 <= 5           --> will be set to the specified number
# > 5                --> will be set to 5
default_bridges = 1

# Default memory size in MiB for SB/VM.
# If unspecified then it will be set 2048 MiB.
#default_memory = 2048
#
# Default memory slots per SB/VM.
# If unspecified then it will be set 10.
# This is will determine the times that memory will be hotadded to sandbox/VM.
#memory_slots = 10

# Disable block device from being used for a container's rootfs.
# In case of a storage driver like devicemapper where a container's
# root file system is backed by a block device, the block device is passed
# directly to the hypervisor for performance reasons.
# This flag prevents the block device from being passed to the hypervisor,
# 9pfs is used instead to pass the rootfs.
disable_block_device_use = false

# Block storage driver to be used for the hypervisor in case the container
# rootfs is backed by a block device. This is either virtio-scsi or
# virtio-blk.
block_device_driver = "virtio-scsi"

# Enable iothreads (data-plane) to be used. This causes IO to be
# handled in a separate IO thread. This is currently only implemented
# for SCSI.
#
enable_iothreads = false

# Enable pre allocation of VM RAM, default false
# Enabling this will result in lower container density
# as all of the memory will be allocated and locked
# This is useful when you want to reserve all the memory
# upfront or in the cases where you want memory latencies
# to be very predictable
# Default false
#enable_mem_prealloc = true

# Enable huge pages for VM RAM, default false
# Enabling this will result in the VM memory
# being allocated using huge pages.
# This is useful when you want to use vhost-user network
# stacks within the container. This will automatically
# result in memory pre allocation
#enable_hugepages = true

# Enable swap of vm memory. Default false.
# The behaviour is undefined if mem_prealloc is also set to true
#enable_swap = true

# This option changes the default hypervisor and kernel parameters
# to enable debug output where available. This extra output is added
# to the proxy logs, but only when proxy debug is also enabled.
#
# Default false
#enable_debug = true

# Disable the customizations done in the runtime when it detects
# that it is running on top a VMM. This will result in the runtime
# behaving as it would when running on bare metal.
#
#disable_nesting_checks = true

# This is the msize used for 9p shares. It is the number of bytes
# used for 9p packet payload.
#msize_9p = 8192

# If true and vsocks are supported, use vsocks to communicate directly
# with the agent and no proxy is started, otherwise use unix
# sockets and start a proxy to communicate with the agent.
# Default false
#use_vsock = true

# VFIO devices are hotplugged on a bridge by default.
# Enable hotplugging on root bus. This may be required for devices with
# a large PCI bar, as this is a current limitation with hotplugging on
# a bridge. This value is valid for "pc" machine type.
# Default false
#hotplug_vfio_on_root_bus = true

# If host doesn't support vhost_net, set to true. Thus we won't create vhost fds for nics.
# Default false
#disable_vhost_net = true
#
# Default entropy source.
# The path to a host source of entropy (including a real hardware RNG)
# /dev/urandom and /dev/random are two main options.
# Be aware that /dev/random is a blocking source of entropy.  If the host
# runs out of entropy, the VMs boot time will increase leading to get startup
# timeouts.
# The source of entropy /dev/urandom is non-blocking and provides a
# generally acceptable source of entropy. It should work well for pretty much
# all practical purposes.
#entropy_source= "/dev/urandom"

[factory]
# VM templating support. Once enabled, new VMs are created from template
# using vm cloning. They will share the same initial kernel, initramfs and
# agent memory by mapping it readonly. It helps speeding up new container
# creation and saves a lot of memory if there are many kata containers running
# on the same host.
#
# When disabled, new VMs are created from scratch.
#
# Default false
#enable_template = true

[proxy.kata]
path = "/usr/libexec/kata-containers/kata-proxy"

# If enabled, proxy messages will be sent to the system log
# (default: disabled)
#enable_debug = true

[shim.kata]
path = "/usr/libexec/kata-containers/kata-shim"

# If enabled, shim messages will be sent to the system log
# (default: disabled)
#enable_debug = true

[agent.kata]
# There is no field for this section. The goal is only to be able to
# specify which type of agent the user wants to use.

[netmon]
# If enabled, the network monitoring process gets started when the
# sandbox is created. This allows for the detection of some additional
# network being added to the existing network namespace, after the
# sandbox has been created.
# (default: disabled)
#enable_netmon = true

# Specify the path to the netmon binary.
path = "/usr/libexec/kata-containers/kata-netmon"

# If enabled, netmon messages will be sent to the system log
# (default: disabled)
#enable_debug = true

[runtime]
# If enabled, the runtime will log additional debug messages to the
# system log
# (default: disabled)
#enable_debug = true
#
# Internetworking model
# Determines how the VM should be connected to the
# the container network interface
# Options:
#
#   - bridged
#     Uses a linux bridge to interconnect the container interface to
#     the VM. Works for most cases except macvlan and ipvlan.
#
#   - macvtap
#     Used when the Container network interface can be bridged using
#     macvtap.
internetworking_model="macvtap"

# If enabled, the runtime will create opentracing.io traces and spans.
# (See https://www.jaegertracing.io/docs/getting-started).
# (default: disabled)
#enable_tracing = true

KSM throttler

version

Output of "/usr/libexec/cc-ksm-throttler/cc-ksm-throttler --version":

cc-ksm-throttler version 0.0.1

Output of "/usr/libexec/kata-ksm-throttler/kata-ksm-throttler --version":

kata-ksm-throttler version 1.3.0-6e903fb

systemd service

Image details

---
osbuilder:
  url: "https://github.com/kata-containers/osbuilder"
  version: "unknown"
rootfs-creation-time: "2018-10-22T21:13:25.475975441+0000Z"
description: "osbuilder rootfs"
file-format-version: "0.0.2"
architecture: "x86_64"
base-distro:
  name: "Clear"
  version: "25740"
  packages:
    default:
      - "iptables-bin"
      - "libudev0-shim"
      - "systemd"
    extra:

agent:
  url: "https://github.com/kata-containers/agent"
  name: "kata-agent"
  version: "1.3.1-c7fdd324cda8e2ef01203a86d97b03a392e6eb39"
  agent-is-init-daemon: "no"

Initrd details

No initrd


Logfiles

Runtime logs

/usr/bin/kata-collect-data.sh: line 244: journalctl: command not found No recent runtime problems found in system journal.

Proxy logs

/usr/bin/kata-collect-data.sh: line 244: journalctl: command not found No recent proxy problems found in system journal.

Shim logs

/usr/bin/kata-collect-data.sh: line 244: journalctl: command not found No recent shim problems found in system journal.

Throttler logs

/usr/bin/kata-collect-data.sh: line 244: journalctl: command not found No recent throttler problems found in system journal.


Container manager details

Have docker

Docker

Output of "docker version":

Client:
 Version:   17.12.0-ce
 API version:   1.35
 Go version:    go1.9.2
 Git commit:    c97c6d6
 Built: Wed Dec 27 20:11:19 2017
 OS/Arch:   linux/amd64

Server:
 Engine:
  Version:  17.12.0-ce
  API version:  1.35 (minimum version 1.12)
  Go version:   go1.9.2
  Git commit:   c97c6d6
  Built:    Wed Dec 27 20:09:54 2017
  OS/Arch:  linux/amd64
  Experimental: false

Output of "docker info":

Containers: 38
 Running: 25
 Paused: 0
 Stopped: 13
Images: 551
Server Version: 17.12.0-ce
Storage Driver: devicemapper
 Pool Name: docker--vg-docker--pool
 Pool Blocksize: 524.3kB
 Base Device Size: 42.95GB
 Backing Filesystem: ext4
 Udev Sync Supported: true
 Data Space Used: 61.96GB
 Data Space Total: 322.1GB
 Data Space Available: 260.1GB
 Metadata Space Used: 19.37MB
 Metadata Space Total: 109.1MB
 Metadata Space Available: 89.68MB
 Thin Pool Minimum Free Space: 32.2GB
 Deferred Removal Enabled: true
 Deferred Deletion Enabled: true
 Deferred Deleted Device Count: 0
 Library Version: 1.02.137 (2016-11-30)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc cc-runtime docker_runc kata-runtime
Default Runtime: kata-runtime
Init Binary: docker-init
containerd version: 89623f28b87a6004d4b785663257362d1658a729
runc version: 258eae0 (expected: b2567b37d7b75eb4cf325b77297b140ea686ce8f)
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.9.0-8-amd64
Operating System: Debian GNU/Linux 9 (stretch)
OSType: linux
Architecture: x86_64
CPUs: 40
Total Memory: 376.6GiB
Name: <REDACTED>
ID: <REDACTED>
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 195
 Goroutines: 171
 System Time: 2019-03-08T13:23:32.436277005+11:00
 EventsListeners: 0
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Registry Mirrors:
 http://docker-registry:5000/
Live Restore Enabled: false

WARNING: No swap limit support

Output of "systemctl show docker":

/usr/bin/kata-collect-data.sh: line 168: systemctl: command not found

No kubectl


Packages

Have dpkg Output of "dpkg -l|egrep "(cc-oci-runtimecc-runtimerunv|kata-proxy|kata-runtime|kata-shim|kata-ksm-throttler|kata-containers-image|linux-container|qemu-)"":

ii  kata-containers-image                   1.3.1-36                          amd64        Kata containers image
ii  kata-ksm-throttler                      1.3.1.git+6e903fb-37              amd64
ii  kata-linux-container                    4.14.67.16-139                    amd64        linux kernel optimised for container-like workloads.
ii  kata-proxy                              1.3.1+git.d364b2e-36              amd64
ii  kata-runtime                            1.3.1+git.258eae0-51              amd64
ii  kata-shim                               1.3.1+git.58f757d-37              amd64
ii  linux-container                         4.14.22-86                        amd64        linux kernel optimised for container-like workloads.
ii  qemu-kvm                                1:2.8+dfsg-6+deb9u5               amd64        QEMU Full virtualization on x86 hardware
ii  qemu-lite                               2.11.0+git.f886228056-52          amd64        linux kernel optimised for container-like workloads.
ii  qemu-system-common                      1:2.8+dfsg-6+deb9u5               amd64        QEMU full system emulation binaries (common files)
ii  qemu-system-x86                         1:2.8+dfsg-6+deb9u5               amd64        QEMU full system emulation binaries (x86)
ii  qemu-utils                              1:2.8+dfsg-6+deb9u5               amd64        QEMU utilities
ii  qemu-vanilla                            2.11.2+git.0982a56a55-50          amd64        linux kernel optimised for container-like workloads.

Have rpm Output of "rpm -qa|egrep "(cc-oci-runtimecc-runtimerunv|kata-proxy|kata-runtime|kata-shim|kata-ksm-throttler|kata-containers-image|linux-container|qemu-)"":


caoruidong commented 5 years ago

See https://github.com/kata-containers/documentation/blob/master/Limitations.md#docker-run-and-sysctl. But We are working on it

zeigerpuppy commented 5 years ago

Thanks for the pointer, although I think this issue may be a little different, as it's necessary to set vm.max_map_count on the host, not the guest container. In this case, the host would mean the KVM image, so a modification of /usr/share/kata-containers/kata-containers-image_clearlinux_1.6.0-rc1_agent_a2037c08531.img should be able to accomplish the fix without having to hook into the docker sysctl issues.

Out of interest, is there a simple way to loop mount and modify the .img file,? I have been struggling to mount it.

zeigerpuppy commented 5 years ago

answering my own question regarding mounting the image:

cd /usr/share/kata-containers
mkdir mount
kpartx -av kata-containers-image_clearlinux_1.6.0-rc1_agent_a2037c08531.img
mount /dev/mapper/loop0p1 ./mount/
jodh-intel commented 5 years ago

Or use losetup as kata-collect-data.sh does to read the image metadata:

grahamwhaley commented 5 years ago

I use losetup - makes it much much easier.

zeigerpuppy commented 5 years ago

I fixed this by modifying the kata-containers clearlinux image.

I added the following to the image in /usr/lib/sysctl.d/50-default.conf:

# fix for elasticsearch
vm.max_map_count=262144
grahamwhaley commented 5 years ago

@zeigerpuppy - nice work :-) Yeah, looks like docker sysctl only supports the namespaced sysctl's, and vm.* is not one of them. I don't believe we have any other method in place with Kata right now to have per-container 'Host VM' side modifications like this sysctl. Hmm, I wonder if any of the callback hooks are run at that level? I can't think of a nice way to add a method either, or not one that won't potentially introduce a big security hole ;-)

Now, for pods at least, we do have the ability to have a 'kernel-per-pod', that is, specify which kernel is used for which pod (and use the default if none specified). I'm not sure if we also support per-pod-images (for the rootfs). If we don't, maybe that is a solution for the k8s side at least?

amshinde commented 5 years ago

@zeigerpuppy In case you are using k8s, you can set the sysctll using a privileged init container to set the sysctl, your app containers could then run as non-privileged. Something like:

apiVersion: v1
kind: Pod
metadata:
  name: busybox-kata 
spec:
  runtimeClassName: kata-qemu
  securityContext:
    sysctls:
    - name: kernel.shm_rmid_forced
      value: "0"
  containers:
  - name: first-test-container
    image: debian
    command:
        - sleep
        - "3000"
  initContainers:
  - name: init-mysys
    securityContext:
      privileged: true
    image: busybox
    command: ['sh', '-c', 'echo "64000" > /proc/sys/vm/max_map_count']

Using this pattern, you can then set non-namespaced sysctls for Kata, without affecting the host/other pods. I have started a document explaining this yesterday.