Open hynek opened 7 years ago
@hynek Yeah this is an interesting one. Since the deployment is failed, the scheduler is avoiding placing new instances of the job because it assumes they will fail as well. Trying to protect you from potentially taking out your job. But in this case you really want to tell the scheduler, do it anyways.
Potentially need a nomad run -force
command to override.
@dadgar we need this too! This is what happened:
Our task fails due to broken connection to underlying database and causes the allocation to be in failure state. A nomad job run
wouldn't allow me to bring it back (after fixing the underlying database issue). I have to stop the task, wait for it to be stopped and rerun the job.
I'd like to be able to restart the job without killing all tasks.
When docker's storage is on a NAS that happens to freeze during a deployment, the deployment will fail (wouldn't expect otherwise). After fixing the NAS I'd like to re-deploy without having to alter the job file, which is not possible for now.
In short: being able to restart job allocation without killing all tasks would prevent downtime when issues originate from other sources.
Is there a way to even do this currently? I tried all nomad deployment
commands and it said it cannot do something with the terminal deployment (can't resume terminal deployment, etc.)
@gregory112 deployments
that are complete won't ever get run again. Depending on your specific circumstance the nomad alloc stop
command may be able to help you out here by forcing a reschedule of a broken allocation.
I give +1 for the nomad job run -force then as it does really help in case there are numerous of allocations that fail, especially those that have more than one instances. We use CI server to deploy most of the jobs and so manually interacting with allocations and stopping them is quite a chore
If my understanding is correct, a deployment with auto_revert disabled, on a job spec that only reschedules (and doesn't restart), on a long enough timeline will result in the number of running tasks in that deployment becoming 0.
@dadgar -
Since the deployment is failed, the scheduler is avoiding placing new instances of the job because it assumes they will fail as well. Trying to protect you from potentially taking out your job. But in this case you really want to tell the scheduler, do it anyways.
Is this called out in the docs anywhere? I just found out this behaviour is the source of some long running problems I'm experiencing, and don't want anyone else to have the same issues.
@lattwood as it turns out we were just talking about that internally and we definitely want to put together a doc that brings together all of deployments, reschedule
, restart
, and update
blocks.
Nomad version
Nomad v0.6.0
Operating system and Environment details
Ubuntu Xenial running in LXD
Issue
So this is a bit more obscure than I initially thought.
We had a bit of a rough time (hilariously, consul running an LXD container hanging up a whole metal server) and I had to kill off a node in the middle of a deployment because it just hang while supposedly downloading a docker container.
At this point it had already placed an alloc on another node however I can't make it to re-try to place the second one again. Both nomad run and nomad plan just pretend like everything is fine.
I was able to fill up clients that we lost during an outage tonight so it seems to be specific to the failed deployment?
I could only fix it by forcing a change in the plan (I just rebuilt the container).
Nomad Status
Nomad plan
Nomad run
Nomad alloc-status for the broken alloc
Nomad eval-status
Job file (if appropriate)