hashicorp / nomad-driver-lxc

HashiCorp Nomad LXC driver plugin
Mozilla Public License 2.0
31 stars 19 forks source link

nomad-driver-lxc compatibility with LXC 4.0 #30

Open kwianeck opened 3 years ago

kwianeck commented 3 years ago

Acording to https://discuss.linuxcontainers.org/t/lxc-4-0-lts-has-been-released/7182, cgroup specification for container and it's monitor have been separated.

My observation is that driver is not requesting to lxc to fully cleanup container. Instead, when stopping task, lxc removes cgroup for container but leaves cgroups for lxc.monitor under /sys/fs/cgroup// in this way, within the time, we can expect hundreds of left objects for many lxc.monitors of containers which have been removed long time back

ex.

nomad job definition:
job "alpine2" {
  datacenters = ["DC1"]
  type = "service"

  group "lxc-alpine2" {
    count = 1

    task "lxc-alpine2" {
      driver = "lxc"
      config {
        log_level = "trace"
        verbosity = "verbose"
        template = "/usr/share/lxc/templates/lxc-alpine"
      }
      resources {
        cpu      = 500
        memory   = 256
      }
    }
  }
}

Output after alpine2 removal (nomad job stop alpine2)

nomad-lxc-client:/sys/fs/cgroup/devices# lxc-ls
alpine1
nomad-lxc-client:/sys/fs/cgroup/devices# ls | grep lxc.
lxc.monitor.alpine1
lxc.monitor.lxc-alpine2-b88164f4-99ce-c0d8-d8a1-d68df8762bab
lxc.monitor.lxc-container-58c0cd07-beae-5638-317f-34bde2622e06
lxc.monitor.lxc-container-992811d2-3ec6-995f-69f6-08bbfc4d1521
lxc.payload.alpine1
lxc.pivot
nomad-lxc-client:/sys/fs/cgroup/devices# 

as you can see, lxc.payload directory has gone (container's specific cgroups), however, lxc-monitor cgroups stay

using LXC version 4.0.6
...
--- Control groups ---
Cgroups: enabled

Cgroup v1 mount points: 
/sys/fs/cgroup/systemd
/sys/fs/cgroup/pids
/sys/fs/cgroup/blkio
/sys/fs/cgroup/perf_event
/sys/fs/cgroup/cpu,cpuacct
/sys/fs/cgroup/net_cls,net_prio
/sys/fs/cgroup/freezer
/sys/fs/cgroup/cpuset
/sys/fs/cgroup/memory
/sys/fs/cgroup/rdma
/sys/fs/cgroup/devices
/sys/fs/cgroup/hugetlb

Cgroup v2 mount points: 
/sys/fs/cgroup/unified

Cgroup v1 clone_children flag: enabled
Cgroup device: enabled
Cgroup sched: enabled
Cgroup cpu account: enabled
Cgroup memory controller: enabled
Cgroup cpuset: enabled
...
kwianeck commented 3 years ago

From what i can see, nomad-lxc-driver does not release handles to the most recently created container in /sys/fs/cgroup//lxc.monitor... as per below

├─lxc.monitor.b1-b74fc4d5-5dc6-715f-f4ab-9024b01ba7a6
│ ├─25364 /opt/nomad/data/plugins/nomad-driver-lxc
│ └─28310 [lxc monitor] /var/lib/lxc b1-b74fc4d5-5dc6-715f-f4ab-9024b01ba7a6
└─lxc.monitor.b2-c0881597-2453-9836-a739-c362bb2dd990
  └─27746 [lxc monitor] /var/lib/lxc b2-c0881597-2453-9836-a739-c362bb2dd990

b1 has been created after b2. as you can see, b2 is not occupied by driver's process. to properly cleanup the container (remove all its artifacts from nomad client) i need to either restart nomad which restarts also driver's process, or create another container to release driver's handles to container which will be removed later. i am not programmer so cannot see how it can be fixed in the driver's code

i used nomad 1.1.3 and 1.1.4. i recompiled the driver for both versions with 2 different pkg.in/go-lxc.v2 drivers (version from 2018 and 2021). no differences i guess the issue is in the driver's code and not in any dependecied library

kwianeck commented 3 years ago

hey,

is there anyone who has similar issue and found solution?

mccaddon commented 3 years ago

@kwianeck are you using Centos/Redhat/VzLinux? I had encountered reoccurring issues with lxc/lxd and cgroups on Centos (using lxd snap package) but they have went away on Ubuntu 20.04.

We are using LXD/LXC on Ubuntu 20.04, version 4.0.7 and during my testing I encountered issues that I couldn't resolve. The error complains about network type configuration, I am guessing that it should use the default lxc profile by default? I've included my default profile and other details. I hope it helps resolve issues with this plugin as I'd love to start using nomad for all my lxd/lxc containers!

$ cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.1 LTS (Focal Fossa)"

$ lxc --version
4.0.7

$ lxc profile show default
config: {}
description: Default LXD profile
devices:
  root:
    path: /
    pool: default
    type: disk
name: default
used_by: []

$ cat test.nomad
job "example-lxc" {
  datacenters = ["dc1"]
  type        = "service"

  group "example" {
    task "example" {
      driver = "lxc"

      config {
        log_level = "info"
        verbosity = "verbose"
        template  = "/usr/share/lxc/templates/lxc-busybox"
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}

# nomad agent -dev -bind 0.0.0.0 -log-level INFO -plugin-dir /opt/nomad/data/plugins
    2021-10-14T17:47:59.345Z [INFO]  client.driver_mgr.nomad-driver-lxc: starting lxc task: driver=lxc @module=lxc driver_cfg="{Template:/usr/share/lxc/templates/lxc-busybox Distro: Release: Arch: ImageVariant: ImageServer: GPGKeyID: GPGKeyServer: DisableGPGValidation:false FlushCache:false ForceCache:false TemplateArgs:[] LogLevel:info Verbosity:verbose Volumes:[]}" timestamp=2021-10-14T17:47:59.345Z
    2021-10-14T17:47:59.691Z [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=073fb260-00f9-3ae7-704b-c9a3f7a777d9 task=example error="rpc error: code = Unknown desc = error setting network type configuration: setting config item for the container failed"

Thank you.

h0tw1r3 commented 2 years ago

It looks like go-lxc only supports cgoups (not cgroups2). I found a few other incompatibilities and bugs while testing lxc 4 support. Will create a merge request and tag this issue "soon".

h0tw1r3 commented 2 years ago

37 was enough to bring up containers, but there were a few problems addressed in #38