Open louievandyke opened 7 months ago
The repro here isn't a minimal repro and has a lot of moving parts, so let's boil it down to the essentials:
leader = true
, and shutdown_delay = "20s"
lifecycle.hook = "prestart"
and lifecycle.sidecar=true
lifecycle.hook = "prestart"
and lifecycle.sidecar=false
(doesn't restart at all, as documented)Something important to note here is that most of these fields control the order we start tasks. Only "leader" has any controls on when tasks are shut down. The leader=true
field docs say:
Specifies whether the task is the leader task of the task group. If set to
true
, when the leader task completes, all other tasks within the task group will be gracefully shutdown. The shutdown process starts by applying theshutdown_delay
if configured. It then stops the the leader task first, followed by non-sidecar and non-poststop tasks, and finally sidecar tasks. Once this process completes, post-stop tasks are triggered. See the lifecycle documentation for a complete description of task lifecycle management.
This is all strictly true and works as described. If the leader task were to be shut down, we'd see the other tasks shut down in that order. But nomad job restart
restarts all the tasks unless the -task
option is used:
-task=
: Specify the task to restart. Can be specified multiple times. If groups are also specified the task must exist in at least one of them. If no task is set only tasks that are currently running are restarted. For example, non-sidecar tasks that already ran are not restarted unless -all-tasks is used instead. This option cannot be used with -all-tasks or -reschedule.
The leader
flag never ends up being consulted because those tasks are already stopped. (I would not be shocked if there was a race condition here though where it's possible for one of the sidecar tasks to start back up quickly enough to get shut down when the leader completes shutdown.)
So for cases where we're shutting down all the tasks, what we probably want to do is see if there's a leader=true
flag set on any of the running tasks and stop only that task via the RPC, so that the other tasks can stop in the expected order. I'll mark this for roadmapping.
I apologize for sharing the spec where I hadn't specified the leader (I had been tweaking it during testing), but this is great info to be aware of as I hadn't targeted -task.
But nomad job restart restarts all the tasks unless the -task option is used: So for cases where we're shutting down all the tasks, what we probably want to do is see if there's a leader=true flag set on any of the running tasks and stop only that task via the RPC, so that the other tasks can stop in the expected order. I'll mark this for roadmapping.
Nomad version
Output from
nomad version
Operating system and Environment details
Issue
When initiating a job restart command the tasks are restarted without applying lifecycle rules (pre-start tasks first...etc) nor leader flag.
Reproduction steps
have one task groups with prestart tasks, leader tasks and shutdown_delay tasks.
restart the job eg.
nomad job restart sleep-while-lifecycle
Expected Result
The lifecycle of the task group should be honored
Actual Result
Leader and sidecar tasks are sent a signal at the same time
Job file (if appropriate)
Nomad Server logs (if appropriate)
I’ve added the log outputs from the tasks and you can see they receive the signal all at the same time when I initiate a job restart
Fri Dec 1 18:24:54 UTC 2023 - Starting. SLEEP_SECS=2
Nomad Client logs (if appropriate)