Closed shoenig closed 1 year ago
Spot check,
2 nodes
ubuntu@ip-172-31-25-137:~$ nomad node status
ID Node Pool DC Name Class Drain Eligibility Status
ded51f46 default dc1 ip-172-31-19-192 <none> false eligible ready
faf9c60b default dc1 ip-172-31-24-55 <none> false eligible ready
One simple redis job running
Allocations
ID Node ID Task Group Version Desired Status Created Modified
6942efcd faf9c60b cache 0 run running 51m1s ago 50m45s ago
ubuntu@ip-172-31-19-192:~$ sudo podman ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
ubuntu@ip-172-31-24-55:~$ sudo podman ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
fc34996abfdd docker.io/library/redis:7 redis-server 50 minutes ago Up 50 minutes ago redis-6942efcd-afcf-61cd-2aae-36b3ce105e17
reboot the node with the redis alloc
ubuntu@ip-172-31-24-55:~$ sudo reboot
Connection to ec2-54-236-5-125.compute-1.amazonaws.com closed by remote host.
the other node now has a new redis alloc
ubuntu@ip-172-31-19-192:~$ sudo podman ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
27c33d5ab8ad docker.io/library/redis:7 redis-server 1 second ago Up 2 seconds ago redis-39ce1e95-1d40-f8f7-a8be-9386b68aba59
the other node comes back up, nomad service succesfully starts, and no lingering dead podman container exists
ubuntu@ip-172-31-24-55:~$ sudo service nomad start
ubuntu@ip-172-31-24-55:~$ sudo podman ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
ubuntu@ip-172-31-24-55:~$
Any clues why Podman would hang?
My guess is something to do with the Go http client ignoring timeouts when talking to UDS, but I don't know.
would recover_stopped cause the rebooted client to restart leaving two running (one on each client)
No, the podman task leftover on the rebooted client would remain in the exited
state.
The use of
recover_stopped
may cause the Nomad agent to hang on startup after node reboot, as the plugin tries to start an exited podman task. Podman itself will hang forever in this state, and the http client on the Nomad side is also unable to timeout in this case. The result is a permanently hung Nomad agent, until someone force kills either Nomad or Podman.Also emit a log warning that
recover_stopped
should not be used. We leave it in place for compatibility.Fixes #229