hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.92k stars 1.95k forks source link

`job plan` on a system job with a stopped allocation returns wrong exit code #20502

Open jamesooo opened 6 months ago

jamesooo commented 6 months ago

Nomad version

Nomad v1.5.17
BuildDate 2024-04-16T10:06:37Z
Revision 1d9f249e552e570d3484dd00901e31351f7edc6f

Operating system and Environment details

Ubuntu Focal

Issue

It seems that the behavior of nomad plan for system jobs has changed between 1.3 and 1.5 so that allocations for system jobs which are stopped individually do not register that a change in allocations is required.

Reproduction steps

Start a Nomad job

$ nomad job run serv.hcl
==> 2024-04-30T19:35:07Z: Monitoring evaluation "ed08e485"
    2024-04-30T19:35:07Z: Evaluation triggered by job "serv"
    2024-04-30T19:35:08Z: Allocation "4bc6e1df" created: node "58b3c381", group "serv"
    2024-04-30T19:35:08Z: Allocation "d2cdc694" created: node "3e0df965", group "serv"
    2024-04-30T19:35:08Z: Allocation "d3642f20" created: node "6975ee9a", group "serv"
    2024-04-30T19:35:08Z: Evaluation status changed: "pending" -> "complete"
==> 2024-04-30T19:35:08Z: Evaluation "ed08e485" finished with status "complete"

Stop a single allocation of that job

admin@delegate:~$ nomad job status serv
...
Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created  Modified
4bc6e1df  58b3c381  serv        2        run      running   10s ago  10s ago
d2cdc694  3e0df965  serv        2        run      running   10s ago  10s ago
d3642f20  6975ee9a  serv        2        run      running   10s ago  10s ago
admin@delegate:~$ nomad alloc stop 4bc6e1df
==> 2024-04-30T19:35:26Z: Monitoring evaluation "fd183a0c"
    2024-04-30T19:35:26Z: Evaluation triggered by job "serv"
    2024-04-30T19:35:27Z: Evaluation status changed: "pending" -> "complete"
==> 2024-04-30T19:35:27Z: Evaluation "fd183a0c" finished with status "complete"

Confirm allocation is stopped

admin@build-cluster-data:~$ nomad job status serv
...
ID        Node ID   Task Group  Version  Desired  Status    Created    Modified
4bc6e1df  58b3c381  serv        2        stop     complete  42s ago    18s ago
d2cdc694  3e0df965  serv        2        run      running   42s ago    42s ago
d3642f20  6975ee9a  serv        2        run      running   42s ago    42s ago

Note because the job is type = "system" no new allocation is started to take the place of the stopped one.

Now run plan against the hcl file and observe the exit status

admin@build-cluster-data:~$ nomad job plan serv.hcl 
Job: "serv"
Task Group: "serv" (1 create, 2 ignore)
  Task: "serv"

Scheduler dry-run:
- All tasks successfully allocated.

Job Modify Index: 131568
To submit the job with version verification run:

nomad job run -check-index 131568 serv.hcl

When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.
admin@build-cluster-data:~$ echo $?
0

Previous Nomad versions have returned 1 here indicating that running the allocation would place a new allocation, however run will still place the new allocation

admin@build-cluster-data:~$ nomad job run serv.hcl 
==> 2024-04-30T19:52:22Z: Monitoring evaluation "a149d2a2"
    2024-04-30T19:52:22Z: Evaluation triggered by job "serv"
    2024-04-30T19:52:23Z: Allocation "05e457e9" created: node "58b3c381", group "serv"
    2024-04-30T19:52:23Z: Evaluation status changed: "pending" -> "complete"
==> 2024-04-30T19:52:23Z: Evaluation "a149d2a2" finished with status "complete"
admin@build-cluster-data:~$ nomad job status serv
...
ID        Node ID   Task Group  Version  Desired  Status    Created    Modified
05e457e9  58b3c381  serv        2        run      running   8s ago      4s ago
4bc6e1df  58b3c381  serv        2        stop     complete  50s ago    26s ago
d2cdc694  3e0df965  serv        2        run      running   50s ago    50s ago
d3642f20  6975ee9a  serv        2        run      running   50s ago    50s ago

Expected Result

Plan should return the exit code 1 to indicate that run will create an allocation

Actual Result

Plan instead outputs exit code 0 indicating that no changes are required to meet the jobspec

Job file (if appropriate)

job "serv" {
  datacenters = ["default"]
  type = "system"

  group "serv" {

    task "serv" {
      driver = "docker"
      config {
        image = "ubuntu:focal"
        command = "/usr/bin/sleep"
        args ["3600"]
      }
    }
  }
}
tgross commented 4 months ago

Hi @jamesooo! That job plan doesn't return the expected error code but job run works as suspected points to a problem in how we're reporting the diff, rather than a scheduler bug (fortunately!). Going back through the unsupported versions changelog, I find a potential culprit in https://github.com/hashicorp/nomad/pull/14492 where the exit code was changed, but I would not expect the diff type to be None in the case you've described either.

I'm going to edit the title on this slightly and mark it for further investigation.