batch jobs get rescheduled due to certain constraints

shantanugadgil commented 5 years ago

Nomad version

Nomad v0.10.0-rc1 (c49bf41779d2ee462f46e2f58328e58c51970a83)

Operating system and Environment details

Ubuntu 16.04 + updates

Issue

When I have only a single constraint (the kernel linux thing) on a batch job, and I do nomad run ... multiple times, the job is not updated on subsequent runs because nothing has changed. :+1:

When I have two constraints (the kernel linux and node class) on a batch job, and I do nomad run ... multiple times, the job behaves is updated on subsequent runs despite nothing changing. :-1:

Subsequent nomad run ... invocations on a job file that has not changed should be a noop (no update, no new alloc, etc).

Reproduction steps

run the job mentioned below with the node.class commented out; you'll see that the job's task doesn't execute, even if you rerun the job. (Do "nomad run uptime.nomad" multiple times.)
Un-comment the node.class constraint and resubmit the job. If you do "nomad run uptime.nomad", you'll observe that the task runs every time; even if there was one-and-only-one machine matching the constraint.

Job file (if appropriate)


job "j_uptime" {
  datacenters = ["dc1"]

  type = "batch"

  constraint {
    attribute = "${attr.kernel.name}"
    value     = "linux"
  }

#  constraint {
#    attribute = "${node.class}"
#    value = "my-builder-class"
#  }

  group "g_uptime" {
    count = 1
    task "t_uptime" {
      driver = "raw_exec"
      config {
        command = "/usr/bin/uptime"
      }
    }
  }
}

Expected behavior

Nomad should understand that the job had completed successfully (exit 0) and not re-run the task.

cgbaker commented 5 years ago

thanks, @shantanugadgil ... looking at this.

cgbaker commented 5 years ago

@shantanugadgil , some clarification, please. When you say the following:

If you do "nomad run uptime.nomad", you'll observe that the task runs every time you mean that the first instance of nomad run uptime.nomad after adding the constraint will run the job? but additional invocations of nomad run do not run the job, correct?

what i'm seeing is that any change to the job file causes the task to run again, but i'm not getting repeated executions of the task without changing the file.

shantanugadgil commented 5 years ago

@cgbaker the steps I did are as follows:

Scenario 1:

using the above job file (with the commented lines as-they-are), run the job.
wait for a few seconds (say 10s) then do "nomad run uptime.nomad" again. You'll observe that the task doesn't actually "re-run", i.e no new allocation is there. This is as expected!

Scenario 2:

stop and purge the existing job.
Using the same job file, uncomment the node.class constraint. (of course, set the node_class as per your setup)
run the job (which now has two constraints)
you'll see that the task executes. (so far so good)
now "run" the job again, i.e. execute "nomad run uptime.nomad" again. This time, you'll see that a new allocation occurs, i.e. the task runs again (which is not what I expected it to do).

So, I think, adding that particular constraint of ${node.class} made the task be allocated again.

cgbaker commented 5 years ago

ok, thank you. that's what i thought you were saying.

having said that, i wasn't able to reproduce scenario 2; my first run with the node.class constraint creates an allocation; follow-up invocations of run do not create an allocation:

$ nomad-0.10.0-rc1 run repro.nomad
==> Monitoring evaluation "048473ce"
    Evaluation triggered by job "j_uptime"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "048473ce" finished with status "complete"

$ nomad-0.10.0-rc1 run repro.nomad
==> Monitoring evaluation "19ee38ac"
    Evaluation triggered by job "j_uptime"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "19ee38ac" finished with status "complete"

$ nomad-0.10.0-rc1 run repro.nomad
==> Monitoring evaluation "7a77e1bc"
    Evaluation triggered by job "j_uptime"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "7a77e1bc" finished with status "complete"

My job status is as expected; one version, one allocation:

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created    Modified
b31f406d  8b21b2b9  g_uptime    0        run      complete  2m25s ago  2m24s ago

Can you paste your server logs?

shantanugadgil commented 5 years ago

Until I scrub the server logs ... here are the commands and more outputs ...

BTW; this is an upgraded cluster from beta1. Also; I have 1 server and 2 nodes, client section is also enabled on the server (

*** ALL jobs are purged at this point ...

user@host:~/sources/nomad_jobs$ nomad run uptime.nomad
==> Monitoring evaluation "80a2b37b"
    Evaluation triggered by job "j_uptime"
    Allocation "21b38ff7" created: node "747a3313", group "g_uptime"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "80a2b37b" finished with status "complete"

*** wait for some time 

user@host:~/sources/nomad_jobs$ nomad run uptime.nomad
==> Monitoring evaluation "9f854593"
    Evaluation triggered by job "j_uptime"
    Allocation "0f5a41e0" created: node "747a3313", group "g_uptime"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "9f854593" finished with status "complete"

*** run again; wait included in the command ...

user@host:~/sources/nomad_jobs$ sleep 30; nomad run uptime.nomad
==> Monitoring evaluation "b842e095"
    Evaluation triggered by job "j_uptime"
    Allocation "2dedb994" created: node "747a3313", group "g_uptime"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "b842e095" finished with status "complete"

*** logs of all three allocations

user@host:~/sources/nomad_jobs$ nomad logs 21b38ff7
 10:25:43 up 1 day,  4:25,  3 users,  load average: 0.12, 0.07, 0.02
user@host:~/sources/nomad_jobs$ nomad logs 0f5a41e0
 10:26:08 up 1 day,  4:26,  3 users,  load average: 0.08, 0.06, 0.01
user@host:~/sources/nomad_jobs$ nomad logs 2dedb994
 10:26:58 up 1 day,  4:27,  3 users,  load average: 0.03, 0.05, 0.01

shantanugadgil commented 5 years ago

@cgbaker

nomad_debug.log

cgbaker commented 5 years ago

can i see the output from nomad job status j_uptime ?

shantanugadgil commented 5 years ago

@cgbaker apologies, I missed to send the outputs sooner. here is a brand new batch:

user@host:~/sources/nomad_jobs$ nomad run uptime.nomad
==> Monitoring evaluation "ff84cfcd"
    Evaluation triggered by job "j_uptime"
    Allocation "162a8387" created: node "747a3313", group "g_uptime"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "ff84cfcd" finished with status "complete"

user@host:~/sources/nomad_jobs$ nomad logs 162a8387
 10:53:48 up 2 days, 7 min,  3 users,  load average: 0.04, 0.06, 0.03

user@host:~/sources/nomad_jobs$ nomad run uptime.nomad
==> Monitoring evaluation "f26193dd"
    Evaluation triggered by job "j_uptime"
    Allocation "1d931cb1" created: node "747a3313", group "g_uptime"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "f26193dd" finished with status "complete"

user@host:~/sources/nomad_jobs$ nomad logs 1d931cb1
 10:54:04 up 2 days, 8 min,  3 users,  load average: 0.03, 0.06, 0.03

user@host:~/sources/nomad_jobs$ nomad run uptime.nomad
==> Monitoring evaluation "5eb40148"
    Evaluation triggered by job "j_uptime"
    Allocation "5e409393" created: node "747a3313", group "g_uptime"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "5eb40148" finished with status "complete"

user@host:~/sources/nomad_jobs$ nomad job status j_uptime
ID            = j_uptime
Name          = j_uptime
Submit Date   = 2019-10-13T10:54:44-07:00
Type          = batch
Priority      = 50
Datacenters   = dc1
Status        = dead
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
g_uptime    0       0         0        0       3         0

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created  Modified
5e409393  747a3313  g_uptime    2        run      complete  3s ago   3s ago
1d931cb1  747a3313  g_uptime    1        run      complete  44s ago  43s ago
162a8387  747a3313  g_uptime    0        run      complete  59s ago  59s ago

The log file on the server for this run is attached to this post again.

nomad_debug_2.log

shantanugadgil commented 5 years ago

Hi @cgbaker were you able to reproduce this on a clustered setup. I am in the process of creating a 3-4 node Vagrant based CentOS 7 setup to see if I can recreate this.

shantanugadgil commented 5 years ago

@cgbaker I want to confirm that this is happening on my Vagrant (VirtualBox) setup, which uses CentOS 7 as the OS. Vagrantfile: https://github.com/shantanugadgil/hashistack/blob/master/scripts/Vagrantfile.compact

*** The Vagrantfile sets up the server configs fine, the client configs need to be setup after the first "vagrant up"

Do let me know if you are able to reproduce the issue.

N.B. I am no Ruby/Vagrant expert and the Vagrantfile has been put together from various sources. 😎

Cheers, Shantanu Gadgil

shantanugadgil commented 4 years ago

poke 👉

Just wondering if anyone else was able to repro the issue using the Vagrantfile.compact file (above) or on their own clustered setup using 0.10.1 GA ?

cgbaker commented 4 years ago

hi @shantanugadgil , thank you for the vagrantfile. we haven't replicated it yet, but we've got this on our board for 0.10.x; we won't forget it!

stale[bot] commented 4 years ago

Hey there

Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this.

Thanks!

shantanugadgil commented 4 years ago

/comment

stale[bot] commented 4 years ago

Hey there

Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this.

Thanks!

shantanugadgil commented 4 years ago

/need-verification-with-latest-version

schmichael commented 3 years ago

Sorry for the silence @shantanugadgil !

I just wanted to clarify why this has not been a priority for us:

Right now any jobspec change causes a new version of the job to be created and run. The benefit here is that nomad run ... is generally easy to reason about: if something has changed, it is run. If nothing has changed, nothing is updated.

nomad plan ... will output (1 create, 1 ignore) in this case to indicate that the existing allocation is ignored because it's terminal, but a new allocation will be created.

What is your use case around altering a job, submitting it, and wanting it to not be run again?

We could differentiate between "meaningful" and "meaningless" jobspec changes, but any logic we introduce will necessarily be harder to understand than our current naive "if it changed, run it again."

shantanugadgil commented 3 years ago

HI @schmichael I think my original bug description may have been ambiguous or confusing.

The issue I was observing was: When I have only a single constraint (the kernel linux thing), and I do nomad run ... multiple times, the job behaves in one way.

When I have two constraints (the kernel linux and node class), and I do nomad run ... multiple times, the job behaves in a different way.

When I add a second constraint, I fully expect the job to re-run the first time, which it does. But, it should not do anything for subsequent runs (with no further change in the job), correct?

shantanugadgil commented 3 years ago

Nomad version:

Nomad v1.0.1 (c9c68aa55a7275f22d2338f2df53e67ebfcb9238)

UPTIME1

job "uptime1" {
  datacenters = ["dc1"]

  type = "batch"

  constraint {
    attribute = "${attr.kernel.name}"
    value     = "linux"
  }

  group "mygroup" {
    count = 1

    task "mytask" {
      driver = "raw_exec"
      config {
        command = "/usr/bin/uptime"
      }
    }
  }
}


$ nomad run uptime1.nomad 
==> Monitoring evaluation "963d04d6"
    Evaluation triggered by job "uptime1"
==> Monitoring evaluation "963d04d6"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "963d04d6" finished with status "complete"

$ nomad run uptime1.nomad 
==> Monitoring evaluation "4d21fa75"
    Evaluation triggered by job "uptime1"
==> Monitoring evaluation "4d21fa75"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "4d21fa75" finished with status "complete"

$ nomad run uptime1.nomad 
==> Monitoring evaluation "886272ef"
    Evaluation triggered by job "uptime1"
==> Monitoring evaluation "886272ef"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "886272ef" finished with status "complete"

$ nomad run uptime1.nomad 
==> Monitoring evaluation "0fdaf19a"
    Evaluation triggered by job "uptime1"
==> Monitoring evaluation "0fdaf19a"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "0fdaf19a" finished with status "complete"

$ nomad run uptime1.nomad 
==> Monitoring evaluation "6a17a108"
    Evaluation triggered by job "uptime1"
==> Monitoring evaluation "6a17a108"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "6a17a108" finished with status "complete"

notice: no new allocation, which, I think, is appropriate.

UPTIME2

job "uptime2" {
  datacenters = ["dc1"]

  type = "batch"

  constraint {
    attribute = "${attr.kernel.name}"
    value     = "linux"
  }

  constraint {
    attribute = "${node.class}"
    value     = "git-builder"
  }

  group "mygroup" {
    count = 1

    task "mytask" {
      driver = "raw_exec"
      config {
        command = "/usr/bin/uptime"
      }
    }
  }
}


$ nomad run uptime2.nomad 
==> Monitoring evaluation "28b3a697"
    Evaluation triggered by job "uptime2"
==> Monitoring evaluation "28b3a697"
    Allocation "29e59367" created: node "6e2b10f5", group "mygroup"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "28b3a697" finished with status "complete"

$ nomad run uptime2.nomad 
==> Monitoring evaluation "5c91d8bf"
    Evaluation triggered by job "uptime2"
==> Monitoring evaluation "5c91d8bf"
    Allocation "477c5f9c" created: node "abbc3eb4", group "mygroup"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "5c91d8bf" finished with status "complete"

$ nomad run uptime2.nomad 
==> Monitoring evaluation "fd6ca454"
    Evaluation triggered by job "uptime2"
==> Monitoring evaluation "fd6ca454"
    Allocation "6a51dfb8" created: node "abbc3eb4", group "mygroup"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "fd6ca454" finished with status "complete"

$ nomad run uptime2.nomad 
==> Monitoring evaluation "e545d228"
    Evaluation triggered by job "uptime2"
==> Monitoring evaluation "e545d228"
    Allocation "75021be3" created: node "abbc3eb4", group "mygroup"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "e545d228" finished with status "complete"

$ nomad run uptime2.nomad 
==> Monitoring evaluation "19dc083b"
    Evaluation triggered by job "uptime2"
==> Monitoring evaluation "19dc083b"
    Allocation "b4421d2f" created: node "abbc3eb4", group "mygroup"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "19dc083b" finished with status "complete"

notice: new allocation triggered everytime, which I think, is incorrect.

btw: for both the nomad run ... testing, all I am doing is 'waiting a few seconds', pressing UP, ENTER.

schmichael commented 3 years ago

Thanks @shantanugadgil - that makes sense.

When I have only a single constraint (the kernel linux thing), and I do nomad run ... multiple times, the job behaves in one way.

When I have two constraints (the kernel linux and node class), and I do nomad run ... multiple times, the job behaves in a different way.

Sounds like a bug to me as it breaks the "if nothing changes Nomad won't reschedule" behavior I mentioned above.

I'm going to lightly edit the original issue to make the issue+repro+solution easier to find.

tgross commented 3 years ago

Hi, so unfortunately I was unable to reproduce this behavior on 1.0.3-dev HEAD or the released 1.0.1, on either a single-node Vagrant cluster or an E2E cluster deployed to AWS. (I wasn't able to get your Vagrant stack up, @shantanugadgil... there's something I'm missing about the context you're running it in I think but it was easier for me to just launch a real cluster.)

My configuration diff from the E2E cluster:

$ git diff
diff --git a/e2e/terraform/.terraform.lock.hcl b/e2e/terraform/.terraform.lock.hcl
old mode 100755
new mode 100644
diff --git a/e2e/terraform/config/dev-cluster/nomad/client-linux/client.hcl b/e2e/terraform/config/dev-cluster/nomad/client-linux/client.hcl
index 439aa72f3..455c30bb7 100644
--- a/e2e/terraform/config/dev-cluster/nomad/client-linux/client.hcl
+++ b/e2e/terraform/config/dev-cluster/nomad/client-linux/client.hcl
@@ -3,6 +3,8 @@ plugin_dir = "/opt/nomad/plugins"
 client {
   enabled = true

+  node_class = "test-class"
+
   options {
     # Allow jobs to run as root
     "user.denylist" = ""

And the resulting cluster:

$ nomad server members
Name                    Address       Port  Status  Leader  Protocol  Build  Datacenter  Region
ip-172-31-0-6.global    172.31.0.6    4648  alive   false   2         1.0.1  dc1         global
ip-172-31-3-99.global   172.31.3.99   4648  alive   true    2         1.0.1  dc1         global
ip-172-31-7-238.global  172.31.7.238  4648  alive   false   2         1.0.1  dc1         global

$ nomad node status
ID        DC   Name              Class       Drain  Eligibility  Status
09075547  dc1  ip-172-31-13-234  test-class  false  eligible     ready
99912af3  dc1  ip-172-31-14-81   test-class  false  eligible     ready

The first job which we know should behave correctly:

job "uptime1" {
  datacenters = ["dc1"]

  type = "batch"

  constraint {
    attribute = "${attr.kernel.name}"
    value     = "linux"
  }

  group "mygroup" {
    task "mytask" {
      driver = "raw_exec"
      config {
        command = "/usr/bin/uptime"
      }
    }
  }
}

Run the job and wait for it to complete:

$ nomad job run ./uptime1.nomad
==> Monitoring evaluation "e9953a07"
    Evaluation triggered by job "uptime1"
==> Monitoring evaluation "e9953a07"
    Allocation "f806eabf" created: node "7fdc43e8", group "mygroup"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "e9953a07" finished with status "complete"

$ nomad alloc status f806
...
Recent Events:
Time                  Type        Description
2021-01-27T18:12:48Z  Terminated  Exit Code: 0
2021-01-27T18:12:48Z  Started     Task started by client
2021-01-27T18:12:48Z  Task Setup  Building Task Directory
2021-01-27T18:12:48Z  Received    Task received by client

Then run the job again, unchanged, and observe we get no new allocation, just as expected.

$ nomad job run ./uptime1.nomad
==> Monitoring evaluation "9823beb6"
    Evaluation triggered by job "uptime1"
==> Monitoring evaluation "9823beb6"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "9823beb6" finished with status "complete"

$ nomad job status uptime1
...
Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created  Modified
f806eabf  7fdc43e8  mygroup     0        run      complete  23s ago  22s ago

So far so good. Now the second job:

job "uptime2" {
  datacenters = ["dc1"]

  type = "batch"

  constraint {
    attribute = "${attr.kernel.name}"
    value     = "linux"
  }

  constraint {
    attribute = "${node.class}"
    value     = "test-class"
  }

  group "mygroup" {
    task "mytask" {
      driver = "raw_exec"
      config {
        command = "/usr/bin/uptime"
      }
    }
  }
}

Run it and wait for it to complete:

$ nomad job run ./uptime2.nomad
==> Monitoring evaluation "660f615e"
    Evaluation triggered by job "uptime2"
    Allocation "25fbbd2b" created: node "09075547", group "mygroup"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "660f615e" finished with status "complete"

$ nomad alloc status 25f
...
Recent Events:
Time                       Type        Description
2021-01-27T13:35:19-05:00  Terminated  Exit Code: 0
2021-01-27T13:35:19-05:00  Started     Task started by client
2021-01-27T13:35:19-05:00  Task Setup  Building Task Directory
2021-01-27T13:35:19-05:00  Received    Task received by client

Re-run the job unchanged, and note no new allocations are run:

$ nomad job run ./uptime2.nomad
==> Monitoring evaluation "a2cd7234"
    Evaluation triggered by job "uptime2"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "a2cd7234" finished with status "complete"

$ nomad job status uptime2
...
Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created  Modified
25fbbd2b  09075547  mygroup     0        run      complete  17s ago  16s ago

I've tried a couple different combinations of valid constraints for the cluster and was unable to reproduce. @shantanugadgil I'm sorry this issue got left open so long but I'm not really sure what else to explore here without more info. The log files that were previously provided are all client logs and not server logs, so there might be more info available if we could figure out what the scheduler thinks is going on.

shantanugadgil commented 3 years ago

Some differences: (none of this should really matter, but mentioning anyway)

my setup has a single Nomad server, not three.
my VMs are "on premise"; VMware and/or ProxmoxVE, and not on AWS.
my infra get "always updated", be it RC of GA versions of Consul and Nomad. So binaries have been periodically updating. (No recent events have occurred where I have needed to wipe out the server state for any reason, so the same state has continued since the opening of the ticket)

Let me try and capture some server logs and see if I could attach them here.

shantanugadgil commented 3 years ago

Forgot to mention, I have since updated to v1.0.2

Q: what would be a correct way to capture the logs on the server? (I have done the following on the server node; ran the following cmd in one shell and "nomad run" the job 3 times in another shell)

$ nomad monitor -log-level TRACE
2021-01-27T11:37:26.799-0800 [TRACE] client: next heartbeat: period=17.713792849s

2021-01-27T11:37:27.708-0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=122.016µs
2021-01-27T11:37:27.710-0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=1.637791ms

2021-01-27T11:37:31.395-0800 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:53004
2021-01-27T11:37:32.488-0800 [TRACE] nomad.job: job mutate results: mutator=canonicalize warnings=[] error=<nil>
2021-01-27T11:37:32.488-0800 [TRACE] nomad.job: job mutate results: mutator=connect warnings=[] error=<nil>
2021-01-27T11:37:32.488-0800 [TRACE] nomad.job: job mutate results: mutator=expose-check warnings=[] error=<nil>
2021-01-27T11:37:32.488-0800 [TRACE] nomad.job: job mutate results: mutator=constraints warnings=[] error=<nil>
2021-01-27T11:37:32.488-0800 [TRACE] nomad.job: job validate results: validator=connect warnings=[] error=<nil>
2021-01-27T11:37:32.488-0800 [TRACE] nomad.job: job validate results: validator=expose-check warnings=[] error=<nil>
2021-01-27T11:37:32.488-0800 [TRACE] nomad.job: job validate results: validator=validate warnings=[] error=<nil>
2021-01-27T11:37:32.501-0800 [DEBUG] worker: dequeued evaluation: eval_id=8d8efd36-293f-949d-0f31-a699ff14e961
2021-01-27T11:37:32.501-0800 [DEBUG] http: request complete: method=PUT path=/v1/jobs duration=13.229461ms
2021-01-27T11:37:32.501-0800 [TRACE] worker.batch_sched.binpack: NewBinPackIterator created: eval_id=8d8efd36-293f-949d-0f31-a699ff14e961 job_id=uptime2 namespace=default algorithm=binpack
2021-01-27T11:37:32.501-0800 [DEBUG] worker.batch_sched: reconciled current state with desired state: eval_id=8d8efd36-293f-949d-0f31-a699ff14e961 job_id=uptime2 namespace=default results="Total changes: (place 1) (destructive 0) (inplace 0) (stop 0)
Desired Changes for "mygroup": (place 1) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 4) (canary 0)"
2021-01-27T11:37:32.502-0800 [DEBUG] http: request complete: method=GET path=/v1/evaluation/8d8efd36-293f-949d-0f31-a699ff14e961 duration=107.568µs
2021-01-27T11:37:32.502-0800 [DEBUG] http: request complete: method=GET path=/v1/evaluation/8d8efd36-293f-949d-0f31-a699ff14e961/allocations duration=81.43µs
2021-01-27T11:37:32.521-0800 [DEBUG] worker: submitted plan for evaluation: eval_id=8d8efd36-293f-949d-0f31-a699ff14e961
2021-01-27T11:37:32.521-0800 [DEBUG] worker.batch_sched: setting eval status: eval_id=8d8efd36-293f-949d-0f31-a699ff14e961 job_id=uptime2 namespace=default status=complete
2021-01-27T11:37:32.539-0800 [DEBUG] worker: updated evaluation: eval="<Eval "8d8efd36-293f-949d-0f31-a699ff14e961" JobID: "uptime2" Namespace: "default">"
2021-01-27T11:37:32.539-0800 [DEBUG] worker: ack evaluation: eval_id=8d8efd36-293f-949d-0f31-a699ff14e961
2021-01-27T11:37:33.503-0800 [DEBUG] http: request complete: method=GET path=/v1/evaluation/8d8efd36-293f-949d-0f31-a699ff14e961 duration=140.739µs
2021-01-27T11:37:33.503-0800 [DEBUG] http: request complete: method=GET path=/v1/evaluation/8d8efd36-293f-949d-0f31-a699ff14e961/allocations duration=141.833µs
2021-01-27T11:37:34.739-0800 [TRACE] nomad.job: job mutate results: mutator=canonicalize warnings=[] error=<nil>
2021-01-27T11:37:34.739-0800 [TRACE] nomad.job: job mutate results: mutator=connect warnings=[] error=<nil>
2021-01-27T11:37:34.739-0800 [TRACE] nomad.job: job mutate results: mutator=expose-check warnings=[] error=<nil>
2021-01-27T11:37:34.739-0800 [TRACE] nomad.job: job mutate results: mutator=constraints warnings=[] error=<nil>
2021-01-27T11:37:34.739-0800 [TRACE] nomad.job: job validate results: validator=connect warnings=[] error=<nil>
2021-01-27T11:37:34.739-0800 [TRACE] nomad.job: job validate results: validator=expose-check warnings=[] error=<nil>
2021-01-27T11:37:34.739-0800 [TRACE] nomad.job: job validate results: validator=validate warnings=[] error=<nil>
2021-01-27T11:37:34.752-0800 [DEBUG] http: request complete: method=PUT path=/v1/jobs duration=13.581588ms
2021-01-27T11:37:34.753-0800 [DEBUG] worker: dequeued evaluation: eval_id=962b31d3-f234-6572-03ce-5fde330a62b7
2021-01-27T11:37:34.753-0800 [TRACE] worker.batch_sched.binpack: NewBinPackIterator created: eval_id=962b31d3-f234-6572-03ce-5fde330a62b7 job_id=uptime2 namespace=default algorithm=binpack
2021-01-27T11:37:34.753-0800 [DEBUG] worker.batch_sched: reconciled current state with desired state: eval_id=962b31d3-f234-6572-03ce-5fde330a62b7 job_id=uptime2 namespace=default results="Total changes: (place 1) (destructive 0) (inplace 0) (stop 0)
Desired Changes for "mygroup": (place 1) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 5) (canary 0)"
2021-01-27T11:37:34.754-0800 [DEBUG] http: request complete: method=GET path=/v1/evaluation/962b31d3-f234-6572-03ce-5fde330a62b7 duration=232.393µs
2021-01-27T11:37:34.754-0800 [DEBUG] http: request complete: method=GET path=/v1/evaluation/962b31d3-f234-6572-03ce-5fde330a62b7/allocations duration=80.57µs
2021-01-27T11:37:34.769-0800 [DEBUG] worker: submitted plan for evaluation: eval_id=962b31d3-f234-6572-03ce-5fde330a62b7
2021-01-27T11:37:34.769-0800 [DEBUG] worker.batch_sched: setting eval status: eval_id=962b31d3-f234-6572-03ce-5fde330a62b7 job_id=uptime2 namespace=default status=complete
2021-01-27T11:37:34.793-0800 [DEBUG] worker: updated evaluation: eval="<Eval "962b31d3-f234-6572-03ce-5fde330a62b7" JobID: "uptime2" Namespace: "default">"
2021-01-27T11:37:34.793-0800 [DEBUG] worker: ack evaluation: eval_id=962b31d3-f234-6572-03ce-5fde330a62b7
2021-01-27T11:37:35.755-0800 [DEBUG] http: request complete: method=GET path=/v1/evaluation/962b31d3-f234-6572-03ce-5fde330a62b7 duration=139.371µs
2021-01-27T11:37:35.756-0800 [DEBUG] http: request complete: method=GET path=/v1/evaluation/962b31d3-f234-6572-03ce-5fde330a62b7/allocations duration=183.652µs
2021-01-27T11:37:36.803-0800 [TRACE] nomad.job: job mutate results: mutator=canonicalize warnings=[] error=<nil>
2021-01-27T11:37:36.803-0800 [TRACE] nomad.job: job mutate results: mutator=connect warnings=[] error=<nil>
2021-01-27T11:37:36.803-0800 [TRACE] nomad.job: job mutate results: mutator=expose-check warnings=[] error=<nil>
2021-01-27T11:37:36.803-0800 [TRACE] nomad.job: job mutate results: mutator=constraints warnings=[] error=<nil>
2021-01-27T11:37:36.803-0800 [TRACE] nomad.job: job validate results: validator=connect warnings=[] error=<nil>
2021-01-27T11:37:36.803-0800 [TRACE] nomad.job: job validate results: validator=expose-check warnings=[] error=<nil>
2021-01-27T11:37:36.803-0800 [TRACE] nomad.job: job validate results: validator=validate warnings=[] error=<nil>
2021-01-27T11:37:36.815-0800 [DEBUG] http: request complete: method=PUT path=/v1/jobs duration=12.314817ms
2021-01-27T11:37:36.815-0800 [DEBUG] worker: dequeued evaluation: eval_id=55b43296-7bbf-d920-3795-e852ef63c96c
2021-01-27T11:37:36.815-0800 [TRACE] worker.batch_sched.binpack: NewBinPackIterator created: eval_id=55b43296-7bbf-d920-3795-e852ef63c96c job_id=uptime2 namespace=default algorithm=binpack
2021-01-27T11:37:36.815-0800 [DEBUG] worker.batch_sched: reconciled current state with desired state: eval_id=55b43296-7bbf-d920-3795-e852ef63c96c job_id=uptime2 namespace=default results="Total changes: (place 1) (destructive 0) (inplace 0) (stop 0)
Desired Changes for "mygroup": (place 1) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 6) (canary 0)"
2021-01-27T11:37:36.816-0800 [DEBUG] http: request complete: method=GET path=/v1/evaluation/55b43296-7bbf-d920-3795-e852ef63c96c duration=82.055µs
2021-01-27T11:37:36.816-0800 [DEBUG] http: request complete: method=GET path=/v1/evaluation/55b43296-7bbf-d920-3795-e852ef63c96c/allocations duration=57.883µs
2021-01-27T11:37:36.833-0800 [DEBUG] worker: submitted plan for evaluation: eval_id=55b43296-7bbf-d920-3795-e852ef63c96c
2021-01-27T11:37:36.833-0800 [DEBUG] worker.batch_sched: setting eval status: eval_id=55b43296-7bbf-d920-3795-e852ef63c96c job_id=uptime2 namespace=default status=complete
2021-01-27T11:37:36.845-0800 [DEBUG] worker: updated evaluation: eval="<Eval "55b43296-7bbf-d920-3795-e852ef63c96c" JobID: "uptime2" Namespace: "default">"
2021-01-27T11:37:36.845-0800 [DEBUG] worker: ack evaluation: eval_id=55b43296-7bbf-d920-3795-e852ef63c96c
2021-01-27T11:37:37.709-0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=162.612µs
2021-01-27T11:37:37.713-0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=109.054µs
2021-01-27T11:37:37.817-0800 [DEBUG] http: request complete: method=GET path=/v1/evaluation/55b43296-7bbf-d920-3795-e852ef63c96c duration=169.489µs
2021-01-27T11:37:37.818-0800 [DEBUG] http: request complete: method=GET path=/v1/evaluation/55b43296-7bbf-d920-3795-e852ef63c96c/allocations duration=169.296µs
2021-01-27T11:37:41.396-0800 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:53042

tgross commented 3 years ago

That works. Ok, so check out these lines from the log:

Desired Changes for "mygroup": (place 1) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 4) (canary 0)" Desired Changes for "mygroup": (place 1) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 5) (canary 0)" Desired Changes for "mygroup": (place 1) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 6) (canary 0)"

Notice how the ignore count keeps increasing but we're getting a place anyways? That's definitely weird. This what it looks like on my reproduction:

Desired Changes for "mygroup": (place 1) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 0) (canary 0)" Desired Changes for "mygroup": (place 0) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 1) (canary 0)"

Can you provide the nomad alloc status :alloc_id output for the two most recent allocations of the job? I suspect that'll give us more information on what it thinks might be changing here.

shantanugadgil commented 3 years ago

server_trace_1.log alloc_status_1.log

tgross commented 3 years ago

Ok, so the allocation status just shows what appears to be an identical allocation, and the server logs are showing nothing unusual.

I looked at your server.hcl and client.hcl files and didn't see anything weird there.

The job files you shared are the exact jobs you're running and not a redacted version? I'm worried there's a block that's being treated differently between them that's not in the job I'm running.

Maybe the verbose plan will give us some more clues? Can you run nomad plan -verbose uptime2.nomad? It should show something like:

$ nomad plan -verbose uptime2.nomad
Job: "uptime2"
Task Group: "mygroup" (1 ignore)
  Task: "mytask"

Scheduler dry-run:
- All tasks successfully allocated.

Job Modify Index: 10
To submit the job with version verification run:

nomad job run -check-index 10 uptime2.nomad

When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.

If that doesn't tell us anything, then for a somewhat extreme approach, if you were to patch the job diffing function on the server to dump the diff object to the logs it might give us more info. But really that should all be in the plan diff.

diff --git a/nomad/structs/diff.go b/nomad/structs/diff.go
index 6a08e7c4e..5f56c52a8 100644
--- a/nomad/structs/diff.go
+++ b/nomad/structs/diff.go
@@ -6,6 +6,7 @@ import (
        "sort"
        "strings"

+       "github.com/davecgh/go-spew/spew"
        "github.com/hashicorp/nomad/helper/flatmap"
        "github.com/mitchellh/hashstructure"
 )
@@ -166,6 +167,7 @@ func (j *Job) Diff(other *Job, contextual bool) (*JobDiff, error) {
                }
        }

+       spew.Dump(diff)
        return diff, nil
 }

shantanugadgil commented 3 years ago

@tgross I forgot to answer the questions you asked previously:

the uptime2 is the same (minor differences like explicit count of 1 and the node_class is different)

Also, when I run the following ...

$ nomad plan -verbose uptime2.nomad
Job: "uptime2"
Task Group: "mygroup" (1 create, 9 ignore)
  Task: "mytask"

Scheduler dry-run:
- All tasks successfully allocated.

Job Modify Index: 460297
To submit the job with version verification run:

nomad job run -check-index 460297 uptime2.nomad

When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.

... I see the monitor logs triggering the following ...

2021-01-28T11:23:03.981-0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=160.25µs
2021-01-28T11:23:04.117-0800 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:44114
2021-01-28T11:23:08.213-0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=170.534µs
2021-01-28T11:23:09.235-0800 [TRACE] nomad.job: job mutate results: mutator=canonicalize warnings=[] error=<nil>
2021-01-28T11:23:09.235-0800 [TRACE] nomad.job: job mutate results: mutator=connect warnings=[] error=<nil>
2021-01-28T11:23:09.235-0800 [TRACE] nomad.job: job mutate results: mutator=expose-check warnings=[] error=<nil>
2021-01-28T11:23:09.235-0800 [TRACE] nomad.job: job mutate results: mutator=constraints warnings=[] error=<nil>
2021-01-28T11:23:09.235-0800 [TRACE] nomad.job: job validate results: validator=connect warnings=[] error=<nil>
2021-01-28T11:23:09.235-0800 [TRACE] nomad.job: job validate results: validator=expose-check warnings=[] error=<nil>
2021-01-28T11:23:09.235-0800 [TRACE] nomad.job: job validate results: validator=validate warnings=[] error=<nil>
2021-01-28T11:23:09.236-0800 [TRACE] nomad.job.batch_sched.binpack: NewBinPackIterator created: eval_id=5cd9dfa7-b3d7-cda5-bad5-694c6256b60b job_id=uptime2 namespace=default algorithm=binpack
2021-01-28T11:23:09.236-0800 [DEBUG] nomad.job.batch_sched: reconciled current state with desired state: eval_id=5cd9dfa7-b3d7-cda5-bad5-694c6256b60b job_id=uptime2 namespace=default results="Total changes: (place 1) (destructive 0) (inplace 0) (stop 0)
Desired Changes for "mygroup": (place 1) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 9) (canary 0)"
2021-01-28T11:23:09.236-0800 [DEBUG] nomad.job.batch_sched: setting eval status: eval_id=5cd9dfa7-b3d7-cda5-bad5-694c6256b60b job_id=uptime2 namespace=default status=complete
2021-01-28T11:23:09.236-0800 [DEBUG] http: request complete: method=PUT path=/v1/job/uptime2/plan duration=875.326µs
2021-01-28T11:23:13.982-0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=client duration=150.419µs
2021-01-28T11:23:14.118-0800 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:44124
2021-01-28T11:23:18.214-0800 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=108.325µs
2021-01-28T11:23:21.682-0800 [TRACE] client: next heartbeat: period=13.191028761s

shantanugadgil commented 3 years ago

@tgross please ignore my previous conclusion about starting the monitor on the agents... that doesn't seem to be related to anything. 🐠

... but the following does: If I restart the Nomad Server after the batch job has reached complete, then the Nomad Server remembers the job's state properly.

[root][nomad0][~/sources/jobs]
$ nomad stop -purge uptime2 
==> Monitoring evaluation "57e28c9a"
    Evaluation triggered by job "uptime2"
==> Monitoring evaluation "57e28c9a"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "57e28c9a" finished with status "complete"

[root][nomad0][~/sources/jobs]
$ nomad run uptime2.nomad 
==> Monitoring evaluation "05fb3a01"
    Evaluation triggered by job "uptime2"
==> Monitoring evaluation "05fb3a01"
    Allocation "94d159af" created: node "6e2b10f5", group "mygroup"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "05fb3a01" finished with status "complete"

[root][nomad0][~/sources/jobs]
$ nomad run uptime2.nomad 
==> Monitoring evaluation "43700b8f"
    Evaluation triggered by job "uptime2"
==> Monitoring evaluation "43700b8f"
    Allocation "d29c7b44" created: node "abbc3eb4", group "mygroup"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "43700b8f" finished with status "complete"

[root][nomad0][~/sources/jobs]
$ systemctl restart nomad

[root][nomad0][~/sources/jobs]
$ nomad run uptime2.nomad 
==> Monitoring evaluation "f646920e"
    Evaluation triggered by job "uptime2"
==> Monitoring evaluation "f646920e"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "f646920e" finished with status "complete"

[root][nomad0][~/sources/jobs]
$ nomad run uptime2.nomad 
==> Monitoring evaluation "c34b5dce"
    Evaluation triggered by job "uptime2"
==> Monitoring evaluation "c34b5dce"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "c34b5dce" finished with status "complete"

[root][nomad0][~/sources/jobs]
$ nomad run uptime2.nomad 
==> Monitoring evaluation "966926ba"
    Evaluation triggered by job "uptime2"
==> Monitoring evaluation "966926ba"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "966926ba" finished with status "complete"

tgross commented 3 years ago

Ok, thanks @shantanugadgil. I'll keep digging and see if I can come up with a possible way this could be triggered.

tgross commented 3 years ago

Going to make sure this gets surfaced for roadmapping but so far there's no reproduction.

hashicorp / nomad