Closed jdoss closed 1 year ago
Some more thoughts on what is going on here. When you reboot a node without draining podman quits and leaves all the containers behind. The driver, on startup, doesn't remove old containers and I think the allocation isn't able to start the containers that are exited anymore. IIRC the driver handled this correctly in the past.
To work around this, I created the following systemd unit which starts after multi-user.target
. It will make sure the node is enabled on boot with a restart every 10s while nomad comes online. When a shutdown happens systemd shuts down units in reverse order so it will drain all of the jobs off of it so the podman driver has a chance to clean any old containers with a ExecStop=podman rm --all
to ensure that there are no left over containers present which would prevent Nomad from starting when it comes back online.
[Unit]
Description=Drain jobs off of Nomad before shutdown
After=multi-user.target
[Service]
Type=oneshot
RemainAfterExit=true
Restart=on-failure
RestartSec=10s
ExecStartPre=nomad node eligibility -enable --self
ExecStart=/bin/true
ExecStop=nomad node drain -enable -self -deadline 10m
ExecStop=podman rm --all
TimeoutStopSec=630
[Install]
WantedBy=multi-user.target
Special thanks and shout out to @dghubble for his blog post https://www.psdn.io/posts/systemd-shutdown-unit/ about systemd unit ordering which helped me out a ton with getting the above unit working correctly.
Have a look at the recover_stopped = false
driver option. It could help in your situation.
Thanks for the report @jdoss, indeed I can reproduce the Nomad agent not starting. In fact, just running podman ps -a
after a reboot hangs (!) if there was a podman task running.
Jun 13 14:18:53 ip-172-31-19-192 systemd[1]: nomad.service: Main process exited, code=exited, status=1/FAILURE
Jun 13 14:18:53 ip-172-31-19-192 systemd[1]: nomad.service: Failed with result 'exit-code'.
Jun 13 14:18:53 ip-172-31-19-192 systemd[1]: nomad.service: Unit process 6159 (nomad) remains running after unit stopped.
Jun 13 14:18:53 ip-172-31-19-192 systemd[1]: Stopped Nomad.
Jun 13 14:18:53 ip-172-31-19-192 systemd[1]: nomad.service: Consumed 3.686s CPU time.
-- Boot 943b4b56d1e54483a8768ca57b6653c1 --
Jun 13 14:20:49 ip-172-31-19-192 systemd[1]: Started Nomad.
Jun 13 14:20:49 ip-172-31-19-192 nomad[1366]: ==> WARNING: mTLS is not configured - Nomad is not secure without mTLS!
Jun 13 14:20:49 ip-172-31-19-192 nomad[1366]: ==> Loaded configuration from /etc/nomad.d/nomad.hcl
Jun 13 14:20:49 ip-172-31-19-192 nomad[1366]: ==> Starting Nomad agent...
ubuntu@ip-172-31-19-192:~$ sudo podman ps -a
<hangs>
This is on the main branch of Nomad and the podman driver, and Ubuntu 22.04 with
ubuntu@ip-172-31-24-55:~$ podman version
Version: 3.4.4
API Version: 3.4.4
Go Version: go1.17.3
Built: Thu Jan 1 00:00:00 1970
OS/Arch: linux/amd64
Just pulling a stack trace of the driver while hung, looks like the POST request of a ContainerStart
incurred during RecoverTask
. Wonder if I can reproduce without Nomad.
Jun 13 17:43:42 ip-172-31-24-55 nomad[1410]: 2023-06-13T17:43:42.054Z [DEBUG] client.driver_mgr.nomad-driver-podman: goroutine 42 [select, 10 minutes]:: driver=podman
Jun 13 17:43:42 ip-172-31-24-55 nomad[1410]: 2023-06-13T17:43:42.054Z [DEBUG] client.driver_mgr.nomad-driver-podman: runtime.gopark(0xc00039a7a0?, 0x6?, 0x60?, 0xa4?, 0xc00039a5b4?): driver=podman
Jun 13 17:43:42 ip-172-31-24-55 nomad[1410]: 2023-06-13T17:43:42.054Z [DEBUG] client.driver_mgr.nomad-driver-podman: /opt/google/go/src/runtime/proc.go:381 +0xd6 fp=0xc0002c0418 sp=0xc0002c03f8 pc=0x43c536: driver=podman
Jun 13 17:43:42 ip-172-31-24-55 nomad[1410]: 2023-06-13T17:43:42.054Z [DEBUG] client.driver_mgr.nomad-driver-podman: runtime.selectgo(0xc0002c07a0, 0xc00039a5a8, 0x10ba216?, 0x0, 0xc00039a590?, 0x1): driver=podman
Jun 13 17:43:42 ip-172-31-24-55 nomad[1410]: 2023-06-13T17:43:42.054Z [DEBUG] client.driver_mgr.nomad-driver-podman: /opt/google/go/src/runtime/select.go:327 +0x7be fp=0xc0002c0558 sp=0xc0002c0418 pc=0x44c1de: driver=podman
Jun 13 17:43:42 ip-172-31-24-55 nomad[1410]: 2023-06-13T17:43:42.054Z [DEBUG] client.driver_mgr.nomad-driver-podman: net/http.(*persistConn).roundTrip(0xc00035e360, 0xc00027d180): driver=podman
Jun 13 17:43:42 ip-172-31-24-55 nomad[1410]: 2023-06-13T17:43:42.054Z [DEBUG] client.driver_mgr.nomad-driver-podman: /opt/google/go/src/net/http/transport.go:2638 +0x994 fp=0xc0002c0810 sp=0xc0002c0558 pc=0x71a154: driver=podman
Jun 13 17:43:42 ip-172-31-24-55 nomad[1410]: 2023-06-13T17:43:42.054Z [DEBUG] client.driver_mgr.nomad-driver-podman: net/http.(*Transport).roundTrip(0xc000166dc0, 0xc00016ca00): driver=podman
Jun 13 17:43:42 ip-172-31-24-55 nomad[1410]: 2023-06-13T17:43:42.054Z [DEBUG] client.driver_mgr.nomad-driver-podman: /opt/google/go/src/net/http/transport.go:603 +0x7fa fp=0xc0002c0a38 sp=0xc0002c0810 pc=0x70e05a: driver=podman
Jun 13 17:43:42 ip-172-31-24-55 nomad[1410]: 2023-06-13T17:43:42.054Z [DEBUG] client.driver_mgr.nomad-driver-podman: net/http.(*Transport).RoundTrip(0x200?, 0x122af20?): driver=podman
Jun 13 17:43:42 ip-172-31-24-55 nomad[1410]: 2023-06-13T17:43:42.054Z [DEBUG] client.driver_mgr.nomad-driver-podman: /opt/google/go/src/net/http/roundtrip.go:17 +0x19 fp=0xc0002c0a58 sp=0xc0002c0a38 pc=0x7028b9: driver=podman
Jun 13 17:43:42 ip-172-31-24-55 nomad[1410]: 2023-06-13T17:43:42.054Z [DEBUG] client.driver_mgr.nomad-driver-podman: net/http.send(0xc00016ca00, {0x122af20, 0xc000166dc0}, {0x8?, 0x1089f80?, 0x0?}): driver=podman
Jun 13 17:43:42 ip-172-31-24-55 nomad[1410]: 2023-06-13T17:43:42.054Z [DEBUG] client.driver_mgr.nomad-driver-podman: /opt/google/go/src/net/http/client.go:252 +0x5f7 fp=0xc0002c0c50 sp=0xc0002c0a58 pc=0x6d83f7: driver=podman
Jun 13 17:43:42 ip-172-31-24-55 nomad[1410]: 2023-06-13T17:43:42.054Z [DEBUG] client.driver_mgr.nomad-driver-podman: net/http.(*Client).send(0xc0000c8840, 0xc00016ca00, {0xc0002cbc20?, 0xffffffffffffffff?, 0x0?}): driver=podman
Jun 13 17:43:42 ip-172-31-24-55 nomad[1410]: 2023-06-13T17:43:42.054Z [DEBUG] client.driver_mgr.nomad-driver-podman: /opt/google/go/src/net/http/client.go:176 +0x9b fp=0xc0002c0cc8 sp=0xc0002c0c50 pc=0x6d7c7b: driver=podman
Jun 13 17:43:42 ip-172-31-24-55 nomad[1410]: 2023-06-13T17:43:42.054Z [DEBUG] client.driver_mgr.nomad-driver-podman: net/http.(*Client).do(0xc0000c8840, 0xc00016ca00): driver=podman
Jun 13 17:43:42 ip-172-31-24-55 nomad[1410]: 2023-06-13T17:43:42.054Z [DEBUG] client.driver_mgr.nomad-driver-podman: /opt/google/go/src/net/http/client.go:716 +0x8fb fp=0xc0002c0ed0 sp=0xc0002c0cc8 pc=0x6d9f7b: driver=podman
Jun 13 17:43:42 ip-172-31-24-55 nomad[1410]: 2023-06-13T17:43:42.054Z [DEBUG] client.driver_mgr.nomad-driver-podman: net/http.(*Client).Do(...): driver=podman
Jun 13 17:43:42 ip-172-31-24-55 nomad[1410]: 2023-06-13T17:43:42.054Z [DEBUG] client.driver_mgr.nomad-driver-podman: /opt/google/go/src/net/http/client.go:582: driver=podman
Jun 13 17:43:42 ip-172-31-24-55 nomad[1410]: 2023-06-13T17:43:42.054Z [DEBUG] client.driver_mgr.nomad-driver-podman: github.com/hashicorp/nomad-driver-podman/api.(*API).Do(0x1001e60?, 0xc0002c9a10?): driver=podman
Jun 13 17:43:42 ip-172-31-24-55 nomad[1410]: 2023-06-13T17:43:42.054Z [DEBUG] client.driver_mgr.nomad-driver-podman: /home/shoenig/Go/podman/api/api.go:82 +0x1d fp=0xc0002c0ef0 sp=0xc0002c0ed0 pc=0xe3ad9d: driver=podman
Jun 13 17:43:42 ip-172-31-24-55 nomad[1410]: 2023-06-13T17:43:42.054Z [DEBUG] client.driver_mgr.nomad-driver-podman: github.com/hashicorp/nomad-driver-podman/api.(*API).PostWithHeaders(0xc0000c8780, {0x1233b00, 0xc0004791d0}, {0xc000042cc0?, 0x1?}, {0x0, 0x0}, 0x4e2c20>
Jun 13 17:43:42 ip-172-31-24-55 nomad[1410]: 2023-06-13T17:43:42.054Z [DEBUG] client.driver_mgr.nomad-driver-podman: /home/shoenig/Go/podman/api/api.go:115 +0x2af fp=0xc0002c0ff0 sp=0xc0002c0ef0 pc=0xe3b3af: driver=podman
Jun 13 17:43:42 ip-172-31-24-55 nomad[1410]: 2023-06-13T17:43:42.054Z [DEBUG] client.driver_mgr.nomad-driver-podman: github.com/hashicorp/nomad-driver-podman/api.(*API).Post(...): driver=podman
Jun 13 17:43:42 ip-172-31-24-55 nomad[1410]: 2023-06-13T17:43:42.054Z [DEBUG] client.driver_mgr.nomad-driver-podman: /home/shoenig/Go/podman/api/api.go:103: driver=podman
Jun 13 17:43:42 ip-172-31-24-55 nomad[1410]: 2023-06-13T17:43:42.054Z [DEBUG] client.driver_mgr.nomad-driver-podman: github.com/hashicorp/nomad-driver-podman/api.(*API).ContainerStart(0xc0003ac0a0?, {0x1233b00, 0xc0004791d0}, {0xc0003a01c0, 0x40}): driver=podman
Jun 13 17:43:42 ip-172-31-24-55 nomad[1410]: 2023-06-13T17:43:42.054Z [DEBUG] client.driver_mgr.nomad-driver-podman: /home/shoenig/Go/podman/api/container_start.go:17 +0x149 fp=0xc0002c1250 sp=0xc0002c0ff0 pc=0xe3d589: driver=podman
Jun 13 17:43:42 ip-172-31-24-55 nomad[1410]: 2023-06-13T17:43:42.054Z [DEBUG] client.driver_mgr.nomad-driver-podman: main.(*Driver).RecoverTask(0xc00022c1e0, 0xc0001752c0): driver=podman
Jun 13 17:43:42 ip-172-31-24-55 nomad[1410]: 2023-06-13T17:43:42.054Z [DEBUG] client.driver_mgr.nomad-driver-podman: /home/shoenig/Go/podman/driver.go:363 +0x96e fp=0xc0002c19f0 sp=0xc0002c1250 pc=0xe48ece: driver=podman
Jun 13 17:43:42 ip-172-31-24-55 nomad[1410]: 2023-06-13T17:43:42.054Z [DEBUG] client.driver_mgr.nomad-driver-podman: github.com/hashicorp/nomad/plugins/drivers.(*driverPluginServer).RecoverTask(0xc000012198, {0xc000174c80?, 0x51af66?}, 0x0?): driver=podman
Jun 13 17:43:42 ip-172-31-24-55 nomad[1410]: 2023-06-13T17:43:42.054Z [DEBUG] client.driver_mgr.nomad-driver-podman: /scratch/gocache/pkg/mod/github.com/hashicorp/nomad@v1.5.0-beta.1.0.20230412191336-74b16da272a6/plugins/drivers/server.go:102 +0x3e fp=0xc0002c1>
Jun 13 17:43:42 ip-172-31-24-55 nomad[1410]: 2023-06-13T17:43:42.055Z [DEBUG] client.driver_mgr.nomad-driver-podman: github.com/hashicorp/nomad/plugins/drivers/proto._Driver_RecoverTask_Handler({0x1066580?, 0xc000012198}, {0x1233ba8, 0xc0002660c0}, 0xc000226380, 0x0): >
Jun 13 17:43:42 ip-172-31-24-55 nomad[1410]: 2023-06-13T17:43:42.055Z [DEBUG] client.driver_mgr.nomad-driver-podman: /scratch/gocache/pkg/mod/github.com/hashicorp/nomad@v1.5.0-beta.1.0.20230412191336-74b16da272a6/plugins/drivers/proto/driver.pb.go:4462 +0x170 f>
Jun 13 17:43:42 ip-172-31-24-55 nomad[1410]: 2023-06-13T17:43:42.055Z [DEBUG] client.driver_mgr.nomad-driver-podman: google.golang.org/grpc.(*Server).processUnaryRPC(0xc000536000, {0x123ad60, 0xc000105380}, 0xc0001f6a20, 0xc000191650, 0x18e4fb0, 0x0): driver=podman
Jun 13 17:43:42 ip-172-31-24-55 nomad[1410]: 2023-06-13T17:43:42.055Z [DEBUG] client.driver_mgr.nomad-driver-podman: /scratch/gocache/pkg/mod/google.golang.org/grpc@v1.54.0/server.go:1345 +0xdf3 fp=0xc0002c1e48 sp=0xc0002c1a68 pc=0xb7e4d3: driver=podman
Jun 13 17:43:42 ip-172-31-24-55 nomad[1410]: 2023-06-13T17:43:42.055Z [DEBUG] client.driver_mgr.nomad-driver-podman: google.golang.org/grpc.(*Server).handleStream(0xc000536000, {0x123ad60, 0xc000105380}, 0xc0001f6a20, 0x0): driver=podman
Jun 13 17:43:42 ip-172-31-24-55 nomad[1410]: 2023-06-13T17:43:42.055Z [DEBUG] client.driver_mgr.nomad-driver-podman: /scratch/gocache/pkg/mod/google.golang.org/grpc@v1.54.0/server.go:1722 +0xa36 fp=0xc0002c1f68 sp=0xc0002c1e48 pc=0xb835f6: driver=podman
Jun 13 17:43:42 ip-172-31-24-55 nomad[1410]: 2023-06-13T17:43:42.055Z [DEBUG] client.driver_mgr.nomad-driver-podman: google.golang.org/grpc.(*Server).serveStreams.func1.2(): driver=podman
Jun 13 17:43:42 ip-172-31-24-55 nomad[1410]: 2023-06-13T17:43:42.055Z [DEBUG] client.driver_mgr.nomad-driver-podman: /scratch/gocache/pkg/mod/google.golang.org/grpc@v1.54.0/server.go:966 +0x98 fp=0xc0002c1fe0 sp=0xc0002c1f68 pc=0xb7be38: driver=podman
Jun 13 17:43:42 ip-172-31-24-55 nomad[1410]: 2023-06-13T17:43:42.055Z [DEBUG] client.driver_mgr.nomad-driver-podman: runtime.goexit(): driver=podman
Jun 13 17:43:42 ip-172-31-24-55 nomad[1410]: 2023-06-13T17:43:42.055Z [DEBUG] client.driver_mgr.nomad-driver-podman: /opt/google/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc0002c1fe8 sp=0xc0002c1fe0 pc=0x46c521: driver=podman
Jun 13 17:43:42 ip-172-31-24-55 nomad[1410]: 2023-06-13T17:43:42.055Z [DEBUG] client.driver_mgr.nomad-driver-podman: created by google.golang.org/grpc.(*Server).serveStreams.func1: driver=podman
Jun 13 17:43:42 ip-172-31-24-55 nomad[1410]: 2023-06-13T17:43:42.055Z [DEBUG] client.driver_mgr.nomad-driver-podman: /scratch/gocache/pkg/mod/google.golang.org/grpc@v1.54.0/server.go:964 +0x28a: driver=podman
Whelp this thing is like getting properly wedged on the Podman side. Just issuing podman start <id>
on the dead container from the CLI will hang. Even if we set a context with a short timeout on the Request
passed into .Do
, the call never returns (!). Not being able to forcefully timeout an HTTP client is a new one for me.
One dumb / easy thing we can do here is just set recover_stopped
to default false
instead of true
, avoiding the problem altogether. I'll keep digging a bit more to see if we can't get the http request to correctly fail, but probably won't spend too much more time on something that really seems to be a Podman bug.
Seems very similar to https://github.com/containers/podman/issues/11283 which despite being closed doesn't seem to have been investigated or fixed.
If you reboot a node without draining the jobs first the driver leaves Exited containers which then prevents nomad from starting backup cleanly after the server reboots and come back online.
I rebooted the client node and the server came back online and nomad started but it was not starting jobs. I stopped
nomad
via systemd andnomad-driver-podman
was left hung. You cannot list running containers withpodman ps -a
while the driver is in the hung state. Killing thenomad-driver-podman
andconmon
process and then runningpodman rm --all
to clear out the previous containers left by the reboot allows Nomad to start correctly.In the meantime, I added a systemd job to drain the nomad jobs before shutdown, but I feel
nomad-driver-podman
used to deal with this issue correctly before version v0.4.2