Closed akamensky closed 1 year ago
Hi @akamensky! That's intentional! The Nomad client is designed so that you can upgrade it in place without having to restart all the workloads. The executors get reparented to PID1 but when the client starts back up it can reattach to the "task handle" and continue to manage the workload without interruption. The exception to this behavior is when you're running the agent with the -dev
flag, which cleans up after itself because that's intended for development use cases.
If you want to stop the workloads when you shutdown a Nomad client, you can use the leave_on_interrupt
/leave_on_terminate
option along with drain_on_shutdown
, or you can drain the workloads manually.
Thanks for clarification @tgross if the intention is for the process to keep running that would make sense.
The executors get reparented to PID1 but when the client starts back up it can reattach to the "task handle" and continue to manage the workload without interruption
If it actually reattached (even without parenting the process, but for example by talking to it via unix socket or such) that would be fine. That's not what I observe however. Instead what I see is:
Trying this with the configuration options on agent set to:
leave_on_interrupt = true
leave_on_terminate = true
client {
drain_on_shutdown {
deadline = "60s"
force = false
ignore_system_jobs = false
}
}
drain_on_shutdown
only works if either of leave_on_interrupt, leave_on_terminate is set to true. It does nothing with those are set to falseleave_on_...
options set to true on agent, once the agent comes back online -- no tasks are ever assigned to it again until it is manually marked as "eligible" on servers.While I understand the reason behind leave_on_...
which makes sense if the node is intentionally taken down for the maintenance. I think that:
drain_on_shutdown
should work independent of the leave_on_...
settings. Defaulting to no drain is also fine.Nomad agent process comes back, and instead of re-attaching to executor it kills all and starts a new process
Was the agent offline long enough that the server rescheduled the workloads? If not, that's likely a bug. The client logs and server logs will have more details which would be helpful if you could share.
drain_on_shutdown
should work independent of theleave_on_...
settings. Defaulting to no drain is also fine.
The intent is that it's used for turning down the node, and that you'd use a different signal for ordinary in-place upgrades of the binary. There's another open feature request around allowing the node to mark itself eligible again though.
Nomad version
Operating system and Environment details
Fedora 36 Server Nomad deployed using YUM repository.
Issue
Nomad job/task keeps running after the Nomad service has been stopped
Reproduction steps
exec
driver withcommand = "sleep"
andargs = ["infinity"]
systemctl stop nomad
on that nodeExpected Result
Actual Result
Job file (if appropriate)
Screenshots:
Before
systemctl stop nomad
:After
systemctl stop nomad
:Nomad logs
PS: After starting Nomad again it does cleanup the orphaned processes. But it should do that on shutdown instead.