one-at-a-time batch scheduling

Nomad version

Nomad v0.12.9 (45c139e53f2407a44b1290385b5818b46ea3a62c)

Issue

We've got a long-running batch job (it works around several days), which we definitely won't like to stop during deploy. But it looks like there is no straightforward way not to stop currently running batch job after submitting a new version. So we're trying to work around it with shutdown_delay (since it is supported in batch jobs: https://github.com/hashicorp/nomad/issues/7271). And it works ... but not as accurate as we'd like. The main problem is that shutdown_delay doesn't honor task state and block a new version execution despite and old one already finished. For instance:

our long-running job with shutdown_delay = 3d started
after one day we submitted a new job version
after yet another one day our long-running job had finished
but since shutdown_delay didn't over, a new job version has to wait yet another day

Would be perfect if shutdown_delay take into account that a batch job was already finished, and start a new job version immediately after finishing a current one.

Reproduction steps

start the following job (the docker container will be stopped after 120s):

job "example" {
datacenters = ["dc1"]
type = "batch"

group "example" {
task "example" {
  driver = "docker"
  env {
      DELAY_BEFORE_EXIT_SEC = "120s"
      EXIT_CODE = "0"

      JOB_VERSION = "0"
  }
  shutdown_delay = "10m"
  config {
    # the source code: https://github.com/pznamensky/docker-fail-image/tree/1.0
    image = "pznamensky/fail:1.0"
  }

  resources {
    cpu    = 500
    memory = 256
  }
}
}
}

wait until the job is started

submit a new version of the job (just change JOB_VERSION to something else):

job "example" {
datacenters = ["dc1"]
type = "batch"

group "example" {
task "example" {
  driver = "docker"
  env {
      DELAY_BEFORE_EXIT_SEC = "120s"
      EXIT_CODE = "0"

      JOB_VERSION = "1"
  }
  shutdown_delay = "10m"
  config {
    # the source code: https://github.com/pznamensky/docker-fail-image/tree/1.0
    image = "pznamensky/fail:1.0"
  }

  resources {
    cpu    = 500
    memory = 256
  }
}
}
}

wait 120 seconds - the job must be finished
but nomad thinks it is still running and a new version will be started after 8 minutes (10m - 120s)

Some closing thoughts:

probably I missing something and this type of deployments should be done in another way, but I don't have any ideas
probably it would be better to implement this deployment logic inside update {} stanza instead of workaround with shutdown_delay - if you think so, please ask me to fill a new issue

Hi @pznamensky! This is definitely a tricky one. If it weren't for stopping the old version of the job, would it be ok in your case to run the old job and new job concurrently? Because you might be able to workaround this by using a parameterized job and then dispatching it.

Using this example job:

job "example" {
  datacenters = ["dc1"]

  type = "batch"

  parameterized {
    payload = "optional"
  }

  group "worker" {

    task "worker" {
      driver = "docker"

      config {
        image   = "busybox:1"
        command = "/bin/sh"
        args    = ["-c", "echo 'this looks like work'; sleep 300"]
      }

      resources {
        cpu    = 255
        memory = 128
      }

      dispatch_payload {
        file = "something.txt"
      }
    }

  }
}

I run the job and then dispatch it multiple times:

$ nomad job dispatch example
Dispatched Job ID = example/dispatch-1613491256-bd343422
Evaluation ID     = 86227431

==> Monitoring evaluation "86227431"
    Evaluation triggered by job "example/dispatch-1613491256-bd343422"
    Allocation "d127d7b9" created: node "0565ecff", group "worker"
==> Monitoring evaluation "86227431"
    Allocation "d127d7b9" status changed: "pending" -> "running" (Tasks are running)
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "86227431" finished with status "complete"

$ nomad job dispatch example
Dispatched Job ID = example/dispatch-1613491259-62564958
Evaluation ID     = 4f8b9ebc

==> Monitoring evaluation "4f8b9ebc"
    Evaluation triggered by job "example/dispatch-1613491259-62564958"
    Allocation "7ec275ea" created: node "0565ecff", group "worker"
==> Monitoring evaluation "4f8b9ebc"
    Allocation "7ec275ea" status changed: "pending" -> "running" (Tasks are running)
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "4f8b9ebc" finished with status "complete"

This results in multiple dispatched children of the job running concurrently:

$ nomad job status
ID                                    Type                 Priority  Status   Submit Date
example                               batch/parameterized  50        running  2021-02-16T16:00:42Z
example/dispatch-1613491256-bd343422  batch                50        running  2021-02-16T16:00:56Z
example/dispatch-1613491259-62564958  batch                50        running  2021-02-16T16:00:59Z

If I then edit the job and re-run it, it only changes the "parent" job, and I can dispatch yet a third concurrent job:

nomad job run ./bin/jobs/dispatch.nomad
Job registration successful

$ nomad job dispatch example
Dispatched Job ID = example/dispatch-1613491290-8161a16d
Evaluation ID     = 43cc1b22

==> Monitoring evaluation "43cc1b22"
    Evaluation triggered by job "example/dispatch-1613491290-8161a16d"
    Allocation "79ac2d69" created: node "0565ecff", group "worker"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "43cc1b22" finished with status "complete"

$ nomad job status
ID                                    Type                 Priority  Status   Submit Date
example                               batch/parameterized  50        running  2021-02-16T16:01:50Z
example/dispatch-1613491256-bd343422  batch                50        running  2021-02-16T16:00:56Z
example/dispatch-1613491259-62564958  batch                50        running  2021-02-16T16:00:59Z
example/dispatch-1613491290-8161a16d  batch                50        running  2021-02-16T16:01:30Z

Hi @tgross! Unfortunately, in our case, we can't run multiple jobs at the same time. But your workaround looks quite interesting. I believe someone can find it useful with a similar problem.

Ok. Unfortunately there's not a better way to do that as far as I can tell. The shutdown_delay is itself a workaround, so I'm going to reword the title of this issue as a feature request for "one at a time" batch scheduling for the roadmap.

hashicorp / nomad