Open pznamensky opened 3 years ago
Hi @pznamensky! This is definitely a tricky one. If it weren't for stopping the old version of the job, would it be ok in your case to run the old job and new job concurrently? Because you might be able to workaround this by using a parameterized job and then dispatching it.
Using this example job:
job "example" {
datacenters = ["dc1"]
type = "batch"
parameterized {
payload = "optional"
}
group "worker" {
task "worker" {
driver = "docker"
config {
image = "busybox:1"
command = "/bin/sh"
args = ["-c", "echo 'this looks like work'; sleep 300"]
}
resources {
cpu = 255
memory = 128
}
dispatch_payload {
file = "something.txt"
}
}
}
}
I run the job and then dispatch it multiple times:
$ nomad job dispatch example
Dispatched Job ID = example/dispatch-1613491256-bd343422
Evaluation ID = 86227431
==> Monitoring evaluation "86227431"
Evaluation triggered by job "example/dispatch-1613491256-bd343422"
Allocation "d127d7b9" created: node "0565ecff", group "worker"
==> Monitoring evaluation "86227431"
Allocation "d127d7b9" status changed: "pending" -> "running" (Tasks are running)
Evaluation status changed: "pending" -> "complete"
==> Evaluation "86227431" finished with status "complete"
$ nomad job dispatch example
Dispatched Job ID = example/dispatch-1613491259-62564958
Evaluation ID = 4f8b9ebc
==> Monitoring evaluation "4f8b9ebc"
Evaluation triggered by job "example/dispatch-1613491259-62564958"
Allocation "7ec275ea" created: node "0565ecff", group "worker"
==> Monitoring evaluation "4f8b9ebc"
Allocation "7ec275ea" status changed: "pending" -> "running" (Tasks are running)
Evaluation status changed: "pending" -> "complete"
==> Evaluation "4f8b9ebc" finished with status "complete"
This results in multiple dispatched children of the job running concurrently:
$ nomad job status
ID Type Priority Status Submit Date
example batch/parameterized 50 running 2021-02-16T16:00:42Z
example/dispatch-1613491256-bd343422 batch 50 running 2021-02-16T16:00:56Z
example/dispatch-1613491259-62564958 batch 50 running 2021-02-16T16:00:59Z
If I then edit the job and re-run it, it only changes the "parent" job, and I can dispatch yet a third concurrent job:
nomad job run ./bin/jobs/dispatch.nomad
Job registration successful
$ nomad job dispatch example
Dispatched Job ID = example/dispatch-1613491290-8161a16d
Evaluation ID = 43cc1b22
==> Monitoring evaluation "43cc1b22"
Evaluation triggered by job "example/dispatch-1613491290-8161a16d"
Allocation "79ac2d69" created: node "0565ecff", group "worker"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "43cc1b22" finished with status "complete"
$ nomad job status
ID Type Priority Status Submit Date
example batch/parameterized 50 running 2021-02-16T16:01:50Z
example/dispatch-1613491256-bd343422 batch 50 running 2021-02-16T16:00:56Z
example/dispatch-1613491259-62564958 batch 50 running 2021-02-16T16:00:59Z
example/dispatch-1613491290-8161a16d batch 50 running 2021-02-16T16:01:30Z
Hi @tgross! Unfortunately, in our case, we can't run multiple jobs at the same time. But your workaround looks quite interesting. I believe someone can find it useful with a similar problem.
Ok. Unfortunately there's not a better way to do that as far as I can tell. The shutdown_delay
is itself a workaround, so I'm going to reword the title of this issue as a feature request for "one at a time" batch scheduling for the roadmap.
Nomad version
Nomad v0.12.9 (45c139e53f2407a44b1290385b5818b46ea3a62c)
Issue
We've got a long-running batch job (it works around several days), which we definitely won't like to stop during deploy. But it looks like there is no straightforward way not to stop currently running batch job after submitting a new version. So we're trying to work around it with
shutdown_delay
(since it is supported in batch jobs: https://github.com/hashicorp/nomad/issues/7271). And it works ... but not as accurate as we'd like. The main problem is thatshutdown_delay
doesn't honor task state and block a new version execution despite and old one already finished. For instance:shutdown_delay = 3d
startedshutdown_delay
didn't over, a new job version has to wait yet another dayWould be perfect if
shutdown_delay
take into account that a batch job was already finished, and start a new job version immediately after finishing a current one.Reproduction steps
start the following job (the docker container will be stopped after 120s):
submit a new version of the job (just change
JOB_VERSION
to something else):Some closing thoughts:
update {}
stanza instead of workaround withshutdown_delay
- if you think so, please ask me to fill a new issue