hashicorp / nomad-driver-podman

A nomad task driver plugin for sandboxing workloads in podman containers
https://developer.hashicorp.com/nomad/plugins/drivers/podman
Mozilla Public License 2.0
224 stars 61 forks source link

systemd slice received as cgroup parent when using cgroupfs: invalid argument #274

Closed dylrich closed 10 months ago

dylrich commented 11 months ago

I get the following error with version 5.0.0 of this driver, but not 0.4.2:

rpc error: code = Unknown desc = failed to start task, could not create container: cannot create container, status code: 500: {"cause":"invalid argument","message":"systemd slice received as cgroup parent when using cgroupfs: invalid argument","response":500} 

Here is a sample job that doesn't work for me on the new version:

job "example-1" {
  datacenters = ["dc1"]

  group "hello" {
    task "hello" {
      driver = "podman"

      config {
        image   = "redis:3.2"
        command = "/bin/bash"
        args    = ["-c", "echo 'Hello World!' && sleep infinity"]
      }

      resources {
        cpu    = 2
        memory = 20
      }
    }
  }
}

My client nodes are running podman version 4.5.1 on alpine 3.18. I am using cgroups v2. I tried running a patched build but my in-driver logging didn't give enough detail and I got lost after we call ContainerCreate(). Speaking of, I think this Request struct may be unused? https://github.com/hashicorp/nomad-driver-podman/blob/2f2353c4e76f5342a97e113645730dfedbdf54fe/api/container_create.go#L49-L75

I was a little suspicious of this commit: https://github.com/hashicorp/nomad-driver-podman/commit/c7c9e97c514b4ec80eadc50f731472326d3941c5, but I didn't bisect or try everything before that one just yet. I am happy to test out any patches to fix this issue! Fortunately I only wanted to upgrade for the new tls_verify support, but I can do without that for now.

Procsiab commented 10 months ago

Hello there, I am able to reproduce the same issue on both an Alpine VM and an ARM Alpine device, both with the following environment:

jdoss commented 10 months ago

For what it is worth I cannot reproduce this on this environment:

$ nomad job run ../../gh-274.nomad 
==> 2023-08-27T10:33:01-05:00: Monitoring evaluation "798b785e"
    2023-08-27T10:33:01-05:00: Evaluation triggered by job "gh-274"
    2023-08-27T10:33:02-05:00: Evaluation within deployment: "dd9dd10f"
    2023-08-27T10:33:02-05:00: Allocation "811551fc" created: node "103810f9", group "hello"
    2023-08-27T10:33:02-05:00: Evaluation status changed: "pending" -> "complete"
==> 2023-08-27T10:33:02-05:00: Evaluation "798b785e" finished with status "complete"
==> 2023-08-27T10:33:02-05:00: Monitoring deployment "dd9dd10f"
  ✓ Deployment "dd9dd10f" successful

    2023-08-27T10:33:17-05:00
    ID          = dd9dd10f
    Job ID      = gh-274
    Job Version = 0
    Status      = successful
    Description = Deployment completed successfully

    Deployed
    Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
    hello       1        1       1        0          2023-08-27T15:43:16Z
$ nomad job status gh-274
ID            = gh-274
Name          = gh-274
Submit Date   = 2023-08-27T10:33:01-05:00
Type          = service
Priority      = 50
Datacenters   = testing
Namespace     = default
Node Pool     = testing
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
hello       0       0         1        0       0         0     0

Latest Deployment
ID          = dd9dd10f
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
hello       1        1       1        0          2023-08-27T15:43:16Z

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
811551fc  103810f9  hello       0        run      running  6m55s ago  6m41s ago

Looking at the running container it is using the nomad.slice correctly:

# podman inspect 91fe87aa411d |jq .[].State
{
  "OciVersion": "1.1.0-rc.1",
  "Status": "running",
  "Running": true,
  "Paused": false,
  "Restarting": false,
  "OOMKilled": false,
  "Dead": false,
  "Pid": 1338,
  "ConmonPid": 1336,
  "ExitCode": 0,
  "Error": "",
  "StartedAt": "2023-08-27T15:33:05.990065987Z",
  "FinishedAt": "0001-01-01T00:00:00Z",
  "Health": {
    "Status": "",
    "FailingStreak": 0,
    "Log": null
  },
  "CgroupPath": "/nomad.slice/libpod-91fe87aa411d1448b2089ee92bbbe2b23a4218c73801f37801558560205f12b1.scope",
  "CheckpointedAt": "0001-01-01T00:00:00Z",
  "RestoredAt": "0001-01-01T00:00:00Z"
}

I don't think this makes a difference but I had to modify the example job file that @dylrich provided to get it working on my cluster.

job "gh-274" {
  datacenters = ["testing"]
  node_pool   = "testing"

  group "hello" {
    task "hello" {
      driver = "podman"

      config {
        image   = "docker.io/library/redis:7"
        command = "/bin/bash"
        args    = ["-c", "echo 'Hello World!' && sleep infinity"]
      }

      resources {
        cpu    = 2
        memory = 20
      }
    }
  }
}
Procsiab commented 10 months ago

The difference between my and @dylrich test environments and the one @jdoss detailed above, is that on Alpine Linux the systemd component is missing altogether and we are using cgroupfs instead, while on CoreOS systemd is bundled with the OS. It is reasonable to conclude that the driver expects to work in an "official" environment, however may I ask if you plan to support this scenario?

jdoss commented 10 months ago

Here's my systemd unit that I use for Nomad on FCOS:

# systemctl cat --no-pager nomad.service 
# /etc/systemd/system/nomad.service
[Unit]
Description=Nomad
Documentation=https://www.nomadproject.io/docs/
Before=nomad-drain.service
After=network-online.target step-issue-cert@nomad.service podman.socket podman.service
Wants=network-online.target step-issue-cert@nomad.service podman.socket podman.service

[Service]
ExecStartPre=bash -c "mkdir -p /opt/nomad/{data,images,storage}"
ExecStartPre=podman rm --all
ExecStart=nomad agent \
  -config /etc/nomad/config \
  -data-dir /opt/nomad/data \
  -plugin-dir /etc/nomad/plugins
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
KillSignal=SIGINT
LimitNOFILE=infinity
LimitNPROC=infinity
TasksMax=infinity
Restart=on-failure
RestartSec=2
OOMScoreAdjust=-1000

[Install]
WantedBy=multi-user.target

# /usr/lib/systemd/system/service.d/10-timeout-abort.conf
# This file is part of the systemd package.
# See https://fedoraproject.org/wiki/Changes/Shorter_Shutdown_Timer.
#
# To facilitate debugging when a service fails to stop cleanly,
# TimeoutStopFailureMode=abort is set to "crash" services that fail to stop in
# the time allotted. This will cause the service to be terminated with SIGABRT
# and a coredump to be generated.
#
# To undo this configuration change, create a mask file:
#   sudo mkdir -p /etc/systemd/system/service.d
#   sudo ln -sv /dev/null /etc/systemd/system/service.d/10-timeout-abort.conf

[Service]
TimeoutStopFailureMode=abort
Procsiab commented 10 months ago

An update on this thread: I tried to figure out the reasoning behind the commit me and @dylrich were referencing above, and I came to the solution I propose in the PR #280; In the same environment I detailed in my first comment, I am not able to reproduce the issue anymore using the driver compiled from that PR. Let me know if I am missing something with that change to that condition for the cgroup parent passing, or if it is acceptable in a situation where systemd is not present.

shoenig commented 10 months ago

The way cgroups are being handles is being overhauled in Nomad 1.7 anyway; I'll be sure to follow up here when I get to fixing up the podman driver.