Closed djenriquez closed 2 years ago
@drewbailey this is related to the work in #6746, I don't think I'm missing anything in my job config, but yea, it does not seem to be working as expected, I'm not sure why.
Hey @djenriquez thanks for reporting, There seems to be a few things going on so I wanted to share a reproduction job file to discuss the different scenarios around the reproduction file.
shutdown_delay
applies to a service, so it will not be registered if a service doesn't exist. The serviceHook handles waiting for the delay in it's preKilling stepedit: I haven't been able to reproduce the deployment issue, it seems to be waiting on the tasks shutdown_delay and the new alloc will wait in pending until the time has elapsed, could you confirm that the shutdown_delays are working as expected as long as a service is also being registered?
edit 2:
Regarding bullet 2 and shutdown_delay
being tied to service registration, we will be treating this as a bug and allow for shutdown_dela
y to run regardless of service registration since they are not explicitly tied together in the job spec.
Hi @drewbailey, yes, sorry there might have been some confusion in my reporting. The shutdown_delay
does go into effect if the allocation's definition defines one at the taskgroup level. I guess I was expecting the task's shutdown_delay
of 60s to go into effect, which is why I reported in the deployment that the shutdown signal was sent only 10s, and not 60 (60s was the task shutdown delay, 10s was the taskgroup shutdown delay).
However, you clarified that shutdown_delay
only goes into effect IF the task has a service. Since we register services at the taskgroup level and not the task level (because of network namespacing), the task's shutdown_delay
will never go into effect. Is this a true statement?
Does your edit no. 2 allow for tasks' shutdown_delay to go into effect regardless of service registration?
Interestingly, I'm looking at your repro.hcl
and I see that both your task and taskgroup have service stanzas defined. I didn't realize you can define services both at the taskgroup and task level. How would you define checks for tasks in this case? Checks require PortLabel
which are not available in the context of a task if a network namespace is defined, right?
2. Currently for tasks, `shutdown_delay` applies to a service, so it will [not be registered if a service doesn't exist](https://github.com/hashicorp/nomad/blob/3284a34b4289d449970fa510d66c14135277eb66/client/allocrunner/taskrunner/task_runner_hooks.go#L99). The serviceHook handles waiting for the delay in it's[ preKilling step](https://github.com/hashicorp/nomad/blob/3284a34b4289d449970fa510d66c14135277eb66/client/allocrunner/taskrunner/service_hook.go#L134)
It would be great to have shutdown_delay
honoured in any type of task. My use case is that I have batch jobs that run really fast and along with them I have a filebeat sidecar to send the logs to logstash. I have shutdown_delay
set in the filebeat
task to give it some time to read and push the logs but that does not happen and filebeat is killed immediately after the lead task is finished.
@danlsgiga a really really bad hack around right for now would actually be to set the shutdown signal as something that wont terminate the process, then use the kill_timeout
as your actual shutdown. Yea I know, but it'll work.
Edit: Nevermind, you said batch job. So that'll run to completion by itself.
yeah, didn't think about that option. Thanks for that... but I prefer to wait for the fix because... you know... if I set that hack, it will be there forever 😄
Also having the same issue, does shutdown_delay only apply to service type jobs? I have a system job which is not using the shutdown_delay
Hey there
Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this.
Thanks!
Having the same issue. Nomad version: v1.1.2 Job type is service. Deploying with canaries (with canaries count matching the count of group), shutdown_delay at the task level is set to 2min, but right after canaries promotion I see allocs in a completed state.
UPD shutdown_delay at group level works though
UPD shutdown_delay at group level does not work as expected. After canary promotion instead of: 1) deregistering services from consul 2) waiting shutdown_delay 3) sending shutdown delay
I have: 1) waiting shutdown_delay 2) deregistering services from consul 3) sending shutdown delay
That is a huge disappointment since shutdown_delay is vital for us to let external LBs to update configs with consul template.
@tgross Sorry to bother you, but is there anything to do with this issue? I am willing to help with any investigations if needed
@sashayakovtseva this issue probably should have been closed when https://github.com/hashicorp/nomad/pull/7663 was merged for 0.11.1.
What you're describing isn't what this ticket is about (not respecting shutdown_delay
at all unless there were also service
blocks). So that why your request got a bit lost. Can you open a new issue describing your problem, along with a reproduction if possible?
I've double checked this issue again. Task level shutdown_delay
works, group level shutdown_delay
does not. But this is not a problem for me anymore, I got behaviour that I needed. Thanks and sorry for my mistake again.
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Nomad version
Server and Clients both on v0.10.4.
Operating system and Environment details
Issue
ShutDownDelay as defined in the Nomad docs is not being considered during deployments. Below is a jobfile that describes two tasks within a taskgroup with different ShutDownDelays. However, during deployment, it doesn't seem like they are considered based on two pieces of evidence:
Also, it's worth mentioning, when the TaskGroup's shutdown delay is updated, job plans do not detect the change: When changing the TaskGroup's ShutdownDelay from
null
to60000000000
, the plan shows this:Lastly, even though the tasks have shutdowndelays, shutdowndelays seem to be completely ignored until the group's shutdowndelay was defined, allocations were getting their shutdown signal immediately. After only adding a 10s shutdowndelay to the group did I notice delays, but it did not trickle down to the tasks. Reverting it back to null kept the allocations with the 10s before i reverted it back to
null
.Reproduction steps
Meta
map)Throughout all of this, watch as the task's shutdown delays are never considered.
Job file (if appropriate)