hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.81k stars 1.94k forks source link

Add ability to restart all running tasks/allocs of a job #698

Closed supernomad closed 1 year ago

supernomad commented 8 years ago

So I would love the ability to restart tasks, at the very least restart an entire job, but preferably single allocations. This is very useful for when a particular allocation or job happens to get in a bad state.

I am thinking something like nomad restart <job> or nomad alloc-restart <alloc-id>.

One of my specific use cases, is I have a cluster of rabbitmq nodes, and at some point one of the nodes gets partitioned from the rest of the cluster. I would like to restart that specific node (allocation in nomad parlance), or be able to preform a rolling restart to the entire cluster (job in nomad parlance).

Does this sound useful?

dadgar commented 8 years ago

Its not a bad idea! In the mean time if you just want to restart the job you can stop and then run it again.

mkabischev commented 8 years ago

I think it will be good feature. Now i can stop and then run job, but it won`t be graceful.

gpaggi commented 8 years ago

+1 Another use case: most of our services read their configuration either from static files or consul and when there are any changes in the properties the services need to be rolling-restarted. Stopping and starting the job would cause a service interruption and a blue/green deployment for a configuration change is a bit over kill.

@Supernomad did you get a chance to look into it?

jtuthehien commented 8 years ago

+1 for this feature

c4milo commented 8 years ago

This is much needed in order to effectively reload configurations without having downtimes. As mentioned above, blue/green doesn't really scale well when you have too many tasks and it is sort of unpredictable since it depends on the specific app being deployed playing well with multiple versions of it running at the same time.

liclac commented 8 years ago

I'd very much like to see this, for a slightly different use case:

I have something running as a system job (in this case, a wrapper script that essentially does docker pull ... && docker run ..., it needs to mount a host directory to work, this is a workaround for #150). To roll out an update, I currently need to change a dummy environment variable, or Nomad won't know anything changed.

mohitarora commented 8 years ago

+1

dennybaa commented 8 years ago

Why not, guys please add it, should be trivial.

jippi commented 7 years ago

👍 on this feature as well :)

xyzjace commented 7 years ago

:+1: For us, too.

ashald commented 7 years ago

We would be happy to see this feature as well. Sometimes... services just need a manual restart. :( Would be nice if it was possible to restart individual tasks or task groups.

rokka-n commented 7 years ago

Having rolling "restart" option is a very valid case for tasks/jobs.

jippi commented 7 years ago

What i've done as a hack is to have a key_or_default inline template{} stanza in the task stanza for each of these keys, simply writing them to some random temp file

that each got a change_type = restart or signal with the appropriate change_signal value

so i can do manual rolling restart of any nomad task by simply changing or creating one of those consul keys in my cluster programatically... at my own pace to do a controlled restart too :)

writing to consul KV /apps/${NOMAD_JOB_NAME} will restart all tasks in the job writing to consul KV /apps/${NOMAD_JOB_NAME}/${NOMAD_TASK_NAME} will restart all tasks within a job writing to consul KV /apps/${NOMAD_JOB_NAME}/${NOMAD_TASK_NAME}/${NOMAD_ALLOC_INDEX} will restart one specific task index within the job

ashald commented 7 years ago

@jippi that's super smart! Thanks, I guess I'll use that for the time being. :)

But that level of control is something that would be great to see in Nomad's native API.

P.S.: That reminds me about my hack/workaround to secure any resource in Nginx (e.g., Nomad API) using Consul ACL tokens with auth_request to some read-only api endpoints. :D

pznamensky commented 7 years ago

Would be useful for us too.

dansteen commented 7 years ago

This would also be useful for the new deployment stuff. The ability to re-trigger a deployment would be great.

JewelPengin commented 7 years ago

Throwing in my +1 but also my non-consul based brute force way:

export NOMAD_ADDR=http://[server-ip]:[admin-port]

curl $NOMAD_ADDR/v1/job/:jobId | jq '.TaskGroups[0].Count = 0 | {"Job": .}' | curl -X POST -d @- $NOMAD_ADDR/v1/job/:jobId

sleep 5

curl $NOMAD_ADDR/v1/job/:jobId | jq '.TaskGroups[0].Count = 1 | {"Job": .}' | curl -X POST -d @- $NOMAD_ADDR/v1/job/:jobId

It requires the jq binary to be installed (which I would highly recommend anyway), but it will first grab the job, modify the task group count to 0, post it back to update, then all over again back to 1 (or whatever number is needed).

Again, kinda brute force and not as elegant as @jippi's, but it works if I need to get something done quickly.

danielwpz commented 7 years ago

Really useful feature! Please do it :D

sullivanchan commented 7 years ago

I have do some verification follow @jippi suggestion, and data = "{{ key apps/app1/app1/${NOMAD_ALLOC_INDEX} }}" in template stanza, but job start always pending, seems env just get by https://www.nomadproject.io/docs/job-specification/template.html#inline-template {{ env "ENV_VAR" }}, i want to know how to integrate env variable into key string, does anybody have the same question?

mildred commented 7 years ago

This is standard golang template:

          {{keyOrDefault (printf "apps/app1/app1/%s" (env "NOMAD_ALLOC_INDEX")) ""}}
mildred commented 7 years ago

I suggest you use keyOrDefault instead of just key which will prevent your service to start unless the key exists in consul.

thevilledev commented 6 years ago

As a workaround I've been using Nomad's meta stanza to control restarts. Meta keys are populated as environment variables to tasks, so whenever meta block is changed all related tasks (or task groups) are restarted. Meta blocks can be defined on the top-level of the job, per task-group or per task.

For example to restart all tasks in all task groups you could run this:

$ nomad inspect some-job | \
jq --arg d "$(date)" '.Job.Meta={restarted_at: $d}' | \
curl -X POST -d @- nomad.service.consul:4646/v1/jobs

This follows update stanza as well.

maihde commented 6 years ago

I have made a first pass at implementing this, you can find my changes here.

Basically, I've added a -restart flag to nomad run. For example:

nomad run -restart myjob.nomad

When the -restart flag is applied it triggers an update, the same as if you would have changed the meta block, so you get the benefits of canaries and rolling restarts without having to actually change the job file.

If there is agreement that this implementation is going down the right path, I will go the the trouble of writing tests and making sure it works for system scheduler, parameterized jobs, etc.

jovandeginste commented 6 years ago

Why not implement this without the need for a plan? Basically, nomad restart myjobname (which should use the current plan)

As a sysop, I sometimes need to force a restart of a job, but I don't have the plan (and don't want to go through nomad inspect | parse)

rkettelerij commented 6 years ago

Agreeing with @jovandeginste here. A restart shouldn't need a job definition in my option, since the job definition is already known inside Nomad.

jovandeginste commented 6 years ago

I do see the case to re-submit an existing job with a plan that may or may not have changed but always wanting to force a restart (of the whole job) while submitting. So both are interesting options.

maihde commented 6 years ago

I agree with your suggestion to allow ‘nomad job restart JOBID’ withoutbhacing the plan. I will start working on that next.

I also agree with your observation that there are use-cases for both.

Thanks for the feedback!

On Fri, Mar 2, 2018 at 4:59 AM Jo Vandeginste notifications@github.com wrote:

I do see the case to re-submit an existing job with a plan that may or may not have changed but always wanting to force a restart (of the whole job) while submitting. So both are interesting options.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hashicorp/nomad/issues/698#issuecomment-369877474, or mute the thread https://github.com/notifications/unsubscribe-auth/AAOCIpH0uG7E0wlB9s55mtNP59pW7juuks5taRgIgaJpZM4HKaGX .

maihde commented 6 years ago

I just pushed to my fork code the adds support for nomad restart JOBID in addition to nomad run -restart JOBFILE. This new code should address the request from @jovandeginste.

rkettelerij commented 6 years ago

@maihde looks great, are you planning to make a PR from your fork?

maihde commented 6 years ago

Here it is (https://github.com/hashicorp/nomad/pull/3949)

marcosnils commented 6 years ago

Its not a bad idea! In the mean time if you just want to restart the job you can stop and then run it again

@dadgar Is there a way to do this but without having downtime?. Stopping and running the job won't honor the update stanza.

maihde commented 6 years ago

@marcosnils the workaround I've used is placing something in the meta stanza that can be changed as described in this post.

https://github.com/hashicorp/nomad/issues/698#issuecomment-367789499

Of course this is kinda annoying, hence the reason I made the pull-request that added the restart behavior directly.

upccup commented 5 years ago

hope is coming soon

camerondavison commented 5 years ago

looks like https://github.com/hashicorp/nomad/pull/5502 is out for allocs 🎉

tgross commented 4 years ago

Doing some issue cleanup: this was released in Nomad 0.9.2. https://github.com/hashicorp/nomad/blob/master/CHANGELOG.md#092-june-5-2019

multani commented 4 years ago

@tgross Actually ... not exactly (unless I missed something!)

Although it's cool to be able to restart specific allocations (which was in 0.9.2), it would be very cool if there was a simple way to restart all the allocations, while ensuring the restart/upgrade properties of the job.

In our case, I think almost every time we had to force restart a particular allocation, it was because actually all of them were stuck in some kind of buggy behavior and we ended restarting all of them nevertheless. I can definitely be scripted, but it would also make sense (IMO!) in terms of "UI" (web or CLI) to have something simple to restart the whole job :+1:

rkettelerij commented 4 years ago

In our case, I think almost every time we had to force restart a particular allocation, it was because actually all of them were stuck in some kind of buggy behavior and we ended restarting all of them nevertheless. I can definitely be scripted, but it would also make sense (IMO!) in terms of "UI" (web or CLI) to have something simple to restart the whole job 👍

I second that.

tgross commented 4 years ago

Fair enough. I'll re-open. And although re-running nomad job run ends up restarting all the allocations, it isn't quite the same as it reschedules them as well.

joec4i commented 4 years ago

Fair enough. I'll re-open. And although re-running nomad job run ends up restarting all the allocations, it isn't quite the same as it reschedules them as well.

I just want to mention that nomad job run would only re-deploy canaries if canary deployment is enabled and there are canary allocations. It'd be great if the job-level restart is supported.

analytically commented 4 years ago

I'd second this - I run (stateful) Airflow as Docker containers (web-workers-scheduler) where the DAG files are mounted as volumes (using artifact stanza) and we'd like to restart all allocations from our CI upon a git push.

taiidani commented 4 years ago

I ran into this problem because I'm using the "exec" driver and SSHing subsequent updates to my binary. Sending another run won't restart the process because the job specification hasn't changed.

Would love a run -restart option so that I don't need to build 2 separate workflows for initial provision + subsequent code deploys!

sbrl commented 4 years ago

Just run into this issue too. My use-case is that I want to restart jobs / tasks in order to update to a newer version of a Docker container.

For context, I'm attempting to setup the following workflow:

  1. Check to see if container needs rebuilding
  2. If necessary, rebuild docker container
  3. If docker container was rebuilt, restart all dependent Nomad jobs / tasks
yishan-lin commented 4 years ago

On our radar - thanks all for the input!

scorsi commented 3 years ago

Can we hope see that feature implemented in Nomad 1.0 ? :)

mxab commented 3 years ago

I currently do:

nomad job status my-job-name | grep -E 'run\s+running' | awk '{print $1}' | xargs -t -n 1 nomad alloc restart

use ... | xargs -P 5 .... to run 5 restarts in parallel

geokollias commented 3 years ago

Any update on this issue would be great! Thank you!

Oloremo commented 3 years ago

looking forward to this as well

datadexer commented 3 years ago

same here!

OmarQunsul commented 3 years ago

I am also surprised this feature doesn't exist. In Docker Swarm for example docker service update --force SERVICE_NAME. I was expecting something under the job command nomad job restart, that restarts each alloc without downtime on the whole job

jpasichnyk commented 3 years ago

+1 for this feature. We just moved to nomad 1.x and are trying to move to the built in Nomad UI (from HashiUI - https://github.com/jippi/hashi-ui), and having the ability to restart a job from here would be great. Sometimes we have application instances that go unhealthy from a system perspective but are still running fine in docker. In this case we don't want to force restart them as depending on the reason they are unhealthy they may not be able to safely restart. Restarting the whole job via a rolling restart is a great way to fix this state, but there is no way to do it for us other than building a new container version and promoting a new job over the existing job (even if the bits being deployed are identical). HashiUI can restart via rolling restart or a stop/start. Nomad UI and CLI should support doing this as well.