Closed dylrich closed 10 months ago
Hello there, I am able to reproduce the same issue on both an Alpine VM and an ARM Alpine device, both with the following environment:
3.18.2
2.38
4.6.1
1.6.1
0.5.1
Like dylrich, I am willingful to help debug this behaviour; hopefully I will have some time to compile and test the driver around the commit he is referencing (c7c9e97c514b4ec80eadc50f731472326d3941c5)For what it is worth I cannot reproduce this on this environment:
$ nomad job run ../../gh-274.nomad
==> 2023-08-27T10:33:01-05:00: Monitoring evaluation "798b785e"
2023-08-27T10:33:01-05:00: Evaluation triggered by job "gh-274"
2023-08-27T10:33:02-05:00: Evaluation within deployment: "dd9dd10f"
2023-08-27T10:33:02-05:00: Allocation "811551fc" created: node "103810f9", group "hello"
2023-08-27T10:33:02-05:00: Evaluation status changed: "pending" -> "complete"
==> 2023-08-27T10:33:02-05:00: Evaluation "798b785e" finished with status "complete"
==> 2023-08-27T10:33:02-05:00: Monitoring deployment "dd9dd10f"
✓ Deployment "dd9dd10f" successful
2023-08-27T10:33:17-05:00
ID = dd9dd10f
Job ID = gh-274
Job Version = 0
Status = successful
Description = Deployment completed successfully
Deployed
Task Group Desired Placed Healthy Unhealthy Progress Deadline
hello 1 1 1 0 2023-08-27T15:43:16Z
$ nomad job status gh-274
ID = gh-274
Name = gh-274
Submit Date = 2023-08-27T10:33:01-05:00
Type = service
Priority = 50
Datacenters = testing
Namespace = default
Node Pool = testing
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost Unknown
hello 0 0 1 0 0 0 0
Latest Deployment
ID = dd9dd10f
Status = successful
Description = Deployment completed successfully
Deployed
Task Group Desired Placed Healthy Unhealthy Progress Deadline
hello 1 1 1 0 2023-08-27T15:43:16Z
Allocations
ID Node ID Task Group Version Desired Status Created Modified
811551fc 103810f9 hello 0 run running 6m55s ago 6m41s ago
Looking at the running container it is using the nomad.slice
correctly:
# podman inspect 91fe87aa411d |jq .[].State
{
"OciVersion": "1.1.0-rc.1",
"Status": "running",
"Running": true,
"Paused": false,
"Restarting": false,
"OOMKilled": false,
"Dead": false,
"Pid": 1338,
"ConmonPid": 1336,
"ExitCode": 0,
"Error": "",
"StartedAt": "2023-08-27T15:33:05.990065987Z",
"FinishedAt": "0001-01-01T00:00:00Z",
"Health": {
"Status": "",
"FailingStreak": 0,
"Log": null
},
"CgroupPath": "/nomad.slice/libpod-91fe87aa411d1448b2089ee92bbbe2b23a4218c73801f37801558560205f12b1.scope",
"CheckpointedAt": "0001-01-01T00:00:00Z",
"RestoredAt": "0001-01-01T00:00:00Z"
}
I don't think this makes a difference but I had to modify the example job file that @dylrich provided to get it working on my cluster.
job "gh-274" {
datacenters = ["testing"]
node_pool = "testing"
group "hello" {
task "hello" {
driver = "podman"
config {
image = "docker.io/library/redis:7"
command = "/bin/bash"
args = ["-c", "echo 'Hello World!' && sleep infinity"]
}
resources {
cpu = 2
memory = 20
}
}
}
}
The difference between my and @dylrich test environments and the one @jdoss detailed above, is that on Alpine Linux the systemd
component is missing altogether and we are using cgroupfs
instead, while on CoreOS systemd
is bundled with the OS. It is reasonable to conclude that the driver expects to work in an "official" environment, however may I ask if you plan to support this scenario?
Here's my systemd unit that I use for Nomad on FCOS:
# systemctl cat --no-pager nomad.service
# /etc/systemd/system/nomad.service
[Unit]
Description=Nomad
Documentation=https://www.nomadproject.io/docs/
Before=nomad-drain.service
After=network-online.target step-issue-cert@nomad.service podman.socket podman.service
Wants=network-online.target step-issue-cert@nomad.service podman.socket podman.service
[Service]
ExecStartPre=bash -c "mkdir -p /opt/nomad/{data,images,storage}"
ExecStartPre=podman rm --all
ExecStart=nomad agent \
-config /etc/nomad/config \
-data-dir /opt/nomad/data \
-plugin-dir /etc/nomad/plugins
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
KillSignal=SIGINT
LimitNOFILE=infinity
LimitNPROC=infinity
TasksMax=infinity
Restart=on-failure
RestartSec=2
OOMScoreAdjust=-1000
[Install]
WantedBy=multi-user.target
# /usr/lib/systemd/system/service.d/10-timeout-abort.conf
# This file is part of the systemd package.
# See https://fedoraproject.org/wiki/Changes/Shorter_Shutdown_Timer.
#
# To facilitate debugging when a service fails to stop cleanly,
# TimeoutStopFailureMode=abort is set to "crash" services that fail to stop in
# the time allotted. This will cause the service to be terminated with SIGABRT
# and a coredump to be generated.
#
# To undo this configuration change, create a mask file:
# sudo mkdir -p /etc/systemd/system/service.d
# sudo ln -sv /dev/null /etc/systemd/system/service.d/10-timeout-abort.conf
[Service]
TimeoutStopFailureMode=abort
An update on this thread: I tried to figure out the reasoning behind the commit me and @dylrich were referencing above, and I came to the solution I propose in the PR #280; In the same environment I detailed in my first comment, I am not able to reproduce the issue anymore using the driver compiled from that PR. Let me know if I am missing something with that change to that condition for the cgroup parent passing, or if it is acceptable in a situation where systemd is not present.
The way cgroups are being handles is being overhauled in Nomad 1.7 anyway; I'll be sure to follow up here when I get to fixing up the podman driver.
I get the following error with version 5.0.0 of this driver, but not 0.4.2:
Here is a sample job that doesn't work for me on the new version:
My client nodes are running podman version 4.5.1 on alpine 3.18. I am using cgroups v2. I tried running a patched build but my in-driver logging didn't give enough detail and I got lost after we call
ContainerCreate()
. Speaking of, I think this Request struct may be unused? https://github.com/hashicorp/nomad-driver-podman/blob/2f2353c4e76f5342a97e113645730dfedbdf54fe/api/container_create.go#L49-L75I was a little suspicious of this commit: https://github.com/hashicorp/nomad-driver-podman/commit/c7c9e97c514b4ec80eadc50f731472326d3941c5, but I didn't bisect or try everything before that one just yet. I am happy to test out any patches to fix this issue! Fortunately I only wanted to upgrade for the new
tls_verify
support, but I can do without that for now.