hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.92k stars 1.95k forks source link

Restarting allocations does not seems to respect lifecycle and shudown_delay constraints #10578

Open scyd-cb opened 3 years ago

scyd-cb commented 3 years ago

Nomad version

Nomad 1.0.2

Operating system and Environment details

CentOS 8

Issue

when restarting running allocation via GUI or CLI:

  1. shudown_delay for the task is not applicable when it is killed
  2. logic seems to be stopping and starting for each task without applying lifecycle rules (pre-start tasks first...etc) nor leader flag.

Reproduction steps

  1. have one task groups with prestarts tasks, leader tasks and shutdown_delay tasks.
  2. restarts the allocation

Expected Result

Expecting Nomad to :

  1. Stop all the tasks applying shutdown_delay if specified (like a standard allocation stop)
  2. Once all tasks stopped/dead , start the tasks applying lifecycle rules and leader tag.

Actual Result

no order in restarting tasks.

Job file (if appropriate)

Screen Shot 2021-05-12 at 9 50 04 PM

If possible please post relevant logs in the issue.

2021-05-13T04:53:29.705Z [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 task=main_task2 reason= delay=0s
    client.alloc_runner.task_runner: running exited hook: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 task=prestart_with_shutdown_delay_30_leader name=stats_hook start="2021-05-13 04:53:26.707639186 +0000 UTC m=+35.630485729"
    client.alloc_runner.task_runner: finished exited hooks: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 task=prestart_with_shutdown_delay_30_leader name=stats_hook end="2021-05-13 04:53:26.7076722 +0000 UTC m=+35.630518742" duration=33.013µs
    client.alloc_runner.task_runner: running exited hook: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 task=prestart_with_shutdown_delay_30_leader name=consul_services start="2021-05-13 04:53:26.707691538 +0000 UTC m=+35.630538077"
    client.alloc_runner.task_runner: finished exited hooks: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 task=prestart_with_shutdown_delay_30_leader name=consul_services end="2021-05-13 04:53:26.707709892 +0000 UTC m=+35.630556439" duration=18.362µs
    client.alloc_runner.task_runner: finished exited hooks: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 task=prestart_with_shutdown_delay_30_leader end="2021-05-13 04:53:26.707724459 +0000 UTC m=+35.630571004" duration=187.402µs
    client.alloc_runner.task_runner: restarting task: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 task=prestart_with_shutdown_delay_30_leader reason= delay=0s
    client.alloc_runner.task_runner: setting task state: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 task=prestart_with_shutdown_delay_30_leader state=pending event=Restarting
    client.alloc_runner.task_runner: running pre kill hooks: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 task=prestart start="2021-05-13 04:53:26.709188246 +0000 UTC m=+35.632034802"
    client.alloc_runner.task_runner: running prekill hook: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 task=prestart name=consul_services start="2021-05-13 04:53:26.709235553 +0000 UTC m=+35.632082116"
    client.alloc_runner.task_runner: finished prekill hook: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 task=prestart name=consul_services end="2021-05-13 04:53:26.709265597 +0000 UTC m=+35.632112194" duration=30.078µs
    client.alloc_runner.task_runner: finished pre kill hooks: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 task=prestart end="2021-05-13 04:53:26.709286298 +0000 UTC m=+35.632132852" duration=98.05µs
    client.alloc_runner: handling task state update: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 done=false
    client.alloc_runner.task_runner: running prestart hooks: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 task=prestart_with_shutdown_delay_30_leader start="2021-05-13 04:53:26.710412782 +0000 UTC m=+35.633259319"
    client.alloc_runner.task_runner: skipping done prestart hook: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 task=prestart_with_shutdown_delay_30_leader name=validate
    client.alloc_runner.task_runner: running prestart hook: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 task=prestart_with_shutdown_delay_30_leader name=task_dir start="2021-05-13 04:53:26.710592955 +0000 UTC m=+35.633439509"
    client.alloc_runner.task_runner: finished prestart hook: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 task=prestart_with_shutdown_delay_30_leader name=task_dir end="2021-05-13 04:53:26.710622261 +0000 UTC m=+35.633468809" duration=29.3µs
    client.alloc_runner.task_runner: running prestart hook: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 task=prestart_with_shutdown_delay_30_leader name=logmon start="2021-05-13 04:53:26.711265971 +0000 UTC m=+35.634112520"

2021-05-13T04:53:27.708Z [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 task=prestart reason= delay=0s
:                                                              :
:                                                              :                        
2021-05-13T04:53:28.705Z [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 task=main_task1 reason= delay=0s
:                                                              :
:                                                              :
2021-05-13T04:53:29.705Z [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 task=main_task2 reason= delay=0s
:                                                              :
:                                                              :
2021-05-13T04:53:30.705Z [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 task=main_task3 reason= delay=0s
:                                                              :
:                                                              :
2021-05-13T04:53:31.706Z [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=fa89cf36-e47b-c9ea-a9c8-c1d4ffcef1d0 task=prestart_with_shutdown_delay_30 reason= delay=0s

This prevent us from restarting our allocations, since it needs to be started and shutdown in some orders.

Thanks for reviewing my ticket.

drewbailey commented 3 years ago

Hi @scyd,

This looks related/duplicate of #9464 #9841. I'll leave this issue open since I'm not sure if the others are using shutdown_delay which is something we'll take a look at when addressing those issues.

scyd-cb commented 3 years ago

@drewbailey thanks for accepting this issue, for summary, allocation restart should honored those 3 features: