hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.89k stars 1.95k forks source link

Nomad server reports 0 allocations placed, but is wrong #9054

Open benbuzbee opened 4 years ago

benbuzbee commented 4 years ago

Nomad version

0.12.1

Operating system and Environment details

unix kernel 5.4.70; debian

Issue

We are using the API to schedule a batch job. This batch job is new - nomad has never seen it before in its current state - i know this because it has randomly generated data inside of it - though there may have been a previous job with the same name.

The API returns the following Evaluation structure indicating "no allocations were placed"

2020-10-08T19:44:55.335Z:  {ID:53563aed-6b2d-b607-4d17-27788affbb93 Priority:50 Type:batch TriggeredBy:job-register Namespace:default JobID:browser-131m9-10100-10199 JobModifyIndex:892821 NodeID: NodeModifyIndex:0 DeploymentID: Status:complete StatusDescription: Wait:0s WaitUntil:0001-01-01 00:00:00 +0000 UTC NextEval: PreviousEval: BlockedEval: FailedTGAllocs:map[] ClassEligibility:map[] EscapedComputedClass:false QuotaLimitReached: AnnotatePlan:false QueuedAllocations:map[browsers:0] SnapshotIndex:892828 CreateIndex:892821 ModifyIndex:892830 CreateTime:1602186294990126740 ModifyTime:1602186295310028622}

The nomad server DEBUG logs agree, saying it ignored 2 allocs and placed 0

2020-10-08T19:44:55.309Z [DEBUG] worker: dequeued evaluation: eval_id=53563aed-6b2d-b607-4d17-27788affbb93
2020-10-08T19:44:55.309Z [DEBUG] worker.batch_sched: setting eval status: eval_id=53563aed-6b2d-b607-4d17-27788affbb93 job_id=browser-131m9-10100-10199 namespace=default status=complete
2020-10-08T19:44:55.309Z [DEBUG] worker.batch_sched: reconciled current state with desired state: eval_id=53563aed-6b2d-b607-4d17-27788affbb93 job_id=browser-131m9-10100-10199 namespace=default results="Total changes: (place 0) (destructive 0) (inplace 0) (stop 0)
worker: dequeued evaluation: eval_id=53563aed-6b2d-b607-4d17-27788affbb93
worker.batch_sched: reconciled current state with desired state: eval_id=53563aed-6b2d-b607-4d17-27788affbb93 job_id=browser-131m9-10100-10199 namespace=default results="Total changes: (place 0) (destructive 0) (inplace 0) (stop 0) Desired Changes for "browsers": (place 0) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 2) (canary 0)"
worker.batch_sched: setting eval status: eval_id=53563aed-6b2d-b607-4d17-27788affbb93 job_id=browser-131m9-10100-10199 namespace=default status=complete

(these logs are confusing, some have timestamps some dont, and there seem to be 2 very similar ones for a few lines) Question 1: Why don't the logs or the API tell me WHY it wasn't placed? The current information is not very useful.

HOWEVER, I could confirm in the nomad client logs that it WAS placed, through some trickery. There is no proof, but you will have to trust me that I can tell by piecing together, that this run was allocation e8699e91-91e6-e610-6d8b-0f18ba4996dd Question 2: Is there some way to correlate the eval with an allocation in the server or client logs? I cannot find any.

client.alloc_runner.task_runner.task_hook.logmon: starting plugin: alloc_id=e8699e91-91e6-e610-6d8b-0f18ba4996dd task=browser path=/usr/bin/nomad args=[/usr/bin/nomad, logmon]
2020-10-08T19:44:55.313Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: waiting for RPC address: alloc_id=e8699e91-91e6-e610-6d8b-0f18ba4996dd task=browser path=/usr/bin/nomad

it actually runs fine for a few minutes before it stops for some reason, but I dont care about that right now

Main bug: What happened? Why does the server thing no alloc was placed, when in fact I can see that one was?

Possibly related, but maybe not

The node exited drain mode less than a second before this happened

020-10-08T19:44:54.909Z [DEBUG] http: request complete: method=PUT path=/v1/node/aec2d08a-c72c-1166-eaca-1fbb79a373a2/drain duration=36.181106ms
020-10-08T19:44:55.030Z [DEBUG] http: request complete: method=PUT path=/v1/node/aec2d08a-c72c-1166-eaca-1fbb79a373a2/eligibility duration=1.323457ms
tgross commented 4 years ago

Hi @benbuzbee!

This batch job is new - nomad has never seen it before in its current state - i know this because it has randomly generated data inside of it - though there may have been a previous job with the same name.

The last bit about the previous job name may be the clue here. It might be helpful to know what you mean by "randomly generated data inside of it" to figure out why Nomad isn't seeing this as an update to the previous version of the job. The List Job Versions API may have more detail for you.

Question 1: Why don't the logs or the API tell me WHY it wasn't placed? The current information is not very useful.

It's not being scheduled because Nomad doesn't see what's been registered as an update. Trying to say why it doesn't see an update would be essentially listing every field in the job and noting it hasn't changed. (You can imagine the log entries: "job name hasn't changed, template hasn't changed, taskgroup[0].task[0].resource.cpu hasn't changed, artifact hasn't changed", etc, etc.)

these logs are confusing, some have timestamps some dont, and there seem to be 2 very similar ones for a few lines)

Are these logs getting post-processed by a remote log collector? (Like an ELK stack or the like.) If not, can you share your exact journalctl command line for fetching the logs? That's very strange, especially in stripping off Nomad's own timestamps and info values.

HOWEVER, I could confirm in the nomad client logs that it WAS placed, through some trickery. There is no proof, but you will have to trust me that I can tell by piecing together, that this run was allocation e8699e91-91e6-e610-6d8b-0f18ba4996dd Question 2: Is there some way to correlate the eval with an allocation in the server or client logs? I cannot find any.

I believe you want List Allocations For Evaluation, which maps from eval to allocation(s), or Read Allocation, which includes the EvalID field. Can you confirm that the alloc that was placed has that EvalID?

Main bug: What happened? Why does the server thing no alloc was placed, when in fact I can see that one was?

You may be able to determine more information by checking List Job Versions to get the definition of the previous version of the job (as understood by the scheduler) vs the one you have.

benbuzbee commented 4 years ago

Thanks Tim for the very detailed response.

Random data: We generate a new cert with a random GUID in the SAN every time we schedule a job

Would it make sense for the API to return something that just indicates "Status: Evaluation resulted in no updates because the job has not changed"? Maybe you are saying that is implied because there are no FailedTGAllocs in the response?

The logs are processed by logstash :( we can ignore that problem for now since I guess it's a bit of a tangent; if it is interesting to you I can try to get better data for this specifically

As far as this:

I believe you want List Allocations For Evaluation, which maps from eval to allocation(s), or Read Allocation, which includes the EvalID field. Can you confirm that the alloc that was placed has that EvalID?

That is hard for this case since the event is past and was caught in the logs - Nomad no longer seems to remember the evaluation. I was hoping it would make sense to add to the nomad-client logs the evaluation that created an allocation, for easier client <> server correlation.

On the "drain" theory I mentioned I will try to repro it and get better data, but for this occurance I can't easily

Finally, the way we use the API is we purge the batch job before we reschedule it. I know we're doing things a little oddly. This is because we're using the API for lots of dynamic scheduling - I could give you more details on a call sometime if you are interested, but it means its hard to look at past versions.

Ok summary for now:

tgross commented 4 years ago

Would it make sense for the API to return something that just indicates "Status: Evaluation resulted in no updates because the job has not changed"? Maybe you are saying that is implied because there are no FailedTGAllocs in the response?

Right, so you'll poll on Read Evaluation until the status is complete, and then at that point the FailedTGAllocs tells you if something went wrong. If there is nothing there, then it there were no changes.

If it does mean "no changes detected" maybe you can point me to some code where I can add some logs to our own build to maybe gather more information like (1) where does it "create" an allocation and send it to a client and (2) where does it decide to reply to the client with "no changes detected"

Sure! The starting point for you is the computeJobAllocs method in the scheduler. That's what ultimately what turns evals into allocations.

For detecting whether there are any changes that need to be made, the eval passes into the reconciler decides whether new allocations or deployments need to be created. The reporting line you see in your logs is constructed in (*reconcileResults) GoString, which gets called when we log the results at DEBUG level in the generic_sched for non-system jobs. We log everything useful that's in that reconcileResults struct, so to get more granular you'll probably need to add log lines in the Compute() function's logic.

Once we get the results back from Compute, the scheduler will call computePlacements, and that's where brand new allocations are created.

(Just a heads up that this is one of the more computationally expensive areas of Nomad, so we try to keep IO to a minimum. So if you add logging that isn't just temporary hacks, do some benchmarking to make sure it's not going to drag scheduling performance down.)

If you want to dig into the scheduler workflow further, I'd recommend my colleague @schmichael's excellent deep-dive on Nomad Under the Hood.

Does it make sense to add to the debug client logs the evalid that created an allocation for easier correlation

It'd be fine to do but it's not obvious to me where would be best to hook that in. Immediately after its creation? Or do we wait until after the allocation has been placed? But literally anywhere an allocation is getting logged, you could log its EvalID field safely because there will always be an eval ID there.

You may also be interested in keeping an eye on the Event Stream feature my colleague @drewbailey is working on in https://github.com/hashicorp/nomad/pull/9013.

benbuzbee commented 4 years ago

Edit: Hi @tgross , I spammed you a bit here as my mind was changed. If you want to ignore all the spam I think I finally found a real repro of a real bug here on my last comment here: https://github.com/hashicorp/nomad/issues/9054#issuecomment-708087851

Update so far:

I haven't been able to repro this locally, but I can repro it in our servers by toggling drain

sudo nomad node drain -enable -yes daea6d66-b3bf-dff9-ada0-3be7145040e9
sudo nomad node drain -disable -yes daea6d66-b3bf-dff9-ada0-3be7145040e9

Our code will see the node leaving the drain state and try to schedule a job on it immediately and is told that there are no allocations for the evaluation that resulted, details included for timestamps:

2020-10-13T21:16:39.335Z    Oct 13, 2020 @ 21:16:39.335659000   Scheduling browser_host job
2020-10-13T21:16:39.554Z | Oct 13, 2020 @ 21:16:39.554301000 | no allocations for evaluation: &{ID:89cf608e-c849-06ef-257c-1f42b0ef5001 Priority:50 Type:batch TriggeredBy:job-register Namespace:default JobID:browser-131m7-10100-10199 JobModifyIndex:910991 NodeID: NodeModifyIndex:0 DeploymentID: Status:complete StatusDescription: Wait:0s WaitUntil:0001-01-01 00:00:00 +0000 UTC NextEval: PreviousEval: BlockedEval: FailedTGAllocs:map[] ClassEligibility:map[] EscapedComputedClass:false QuotaLimitReached: AnnotatePlan:false QueuedAllocations:map[browsers:0] SnapshotIndex:910994 CreateIndex:910991 ModifyIndex:910997 CreateTime:1602623799344793081 ModifyTime:1602623799533941334}

and so we fail. However in spite of this, alloc 71e27cff-9c80-6353-d23e-d1516059a0f91 is placed

sudo nomad status 71e27cff
ID                  = 71e27cff-9c80-6353-d23e-d1516059a0f9
Eval ID             = d4a85794
Name                = browser-131m7-10100-10199.browsers[0]
Node ID             = daea6d66
Node Name           = 131m7

Oh but what is that eval ID? d4a85794?

sudo nomad status -verbose d4a85794
ID                 = d4a85794-4a61-5932-9500-c361e705acdf
Create Time        = 2020-10-13T21:16:39Z
Modify Time        = 2020-10-13T21:16:39Z
Status             = complete
Status Description = complete
Type               = batch
TriggeredBy        = node-update
Node ID            = daea6d66-b3bf-dff9-ada0-3be7145040e9
Priority           = 50
Placement Failures = false
Previous Eval      = <none>
Next Eval          = <none>
Blocked Eval       = <none>

Triggered by node update? What about the one that failed?

bbuzbee@131m2:~$ sudo nomad status -verbose 89cf608e
ID                 = 89cf608e-c849-06ef-257c-1f42b0ef5001
Create Time        = 2020-10-13T21:16:39Z
Modify Time        = 2020-10-13T21:16:39Z
Status             = complete
Status Description = complete
Type               = batch
TriggeredBy        = job-register
Job ID             = browser-131m7-10100-10199
Priority           = 50
Placement Failures = false
Previous Eval      = <none>
Next Eval          = <none>
Blocked Eval       = <none>

Triggered by job-register - the timestamps are the same so lets go to server logs for better details

2020-10-13T21:16:39.501Z     -  worker.batch_sched: reconciled current state with desired state: eval_id=d4a85794-4a61-5932-9500-c361e705acdf job_id=browser-131m7-10100-10199 namespace=default results="Total changes: (place 1) (destructive 0) (inplace 0) (stop 0)
Desired Changes for "browsers": (place 1) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 4) (canary 0)"
2020-10-13T21:16:39.535Z     -  worker.batch_sched: reconciled current state with desired state: eval_id=89cf608e-c849-06ef-257c-1f42b0ef5001 job_id=browser-131m7-10100-10199 namespace=default results="Total changes: (place 0) (destructive 0) (inplace 0) (stop 0)
Desired Changes for "browsers": (place 0) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 5) (canary 0)"

I am not sure what to make of this so far, but it seems to me that the server creates 2 evals - one form our API call (89cf) which it decides doesn't get any allocation, and another magic one (d4a85) which it creates itself and that is the one that gets the allocations

benbuzbee commented 4 years ago

Is this the race? I'm just guessing, no science here: 2 things create evaluations - the node exiting the drain state (creates a node-update eval) and the API that created it in the first place (creates a job-register eval)

The eval broker sorts them according to priority

func (p PendingEvaluations) Less(i, j int) bool {
    if p[i].JobID != p[j].JobID && p[i].Priority != p[j].Priority {
        return !(p[i].Priority < p[j].Priority)
    }
    return p[i].CreateIndex < p[j].CreateIndex
}

This line: if p[i].JobID != p[j].JobID && p[i].Priority != p[j].Priority {

will evaluate to false, because jobId != jobId is false

So it will sort them by createindex

{"CreateIndex":910988,"CreateTime":1602623799147310587,"ID":"d4a85794-4a61-5932-9500-c361e705acdf","JobID":"browser-131m7-10100-10199","ModifyIndex":910994,"ModifyTime":1602623799514387339,"Namespace":"default","NodeID":"daea6d66-b3bf-dff9-ada0-3be7145040e9","NodeModifyIndex":910987,"Priority":50,"QueuedAllocations":{"browsers":0},"SnapshotIndex":910991,"Status":"complete","TriggeredBy":"node-update","Type":"batch"}
{"CreateIndex":910991,"CreateTime":1602623799344793081,"ID":"89cf608e-c849-06ef-257c-1f42b0ef5001","JobID":"browser-131m7-10100-10199","JobModifyIndex":910991,"ModifyIndex":910997,"ModifyTime":1602623799533941334,"Namespace":"default","Priority":50,"QueuedAllocations":{"browsers":0},"SnapshotIndex":910994,"Status":"complete","TriggeredBy":"job-register","Type":"batch"}

The smaller create index is first in the min heap so we the winner is d4a85794, which means it runs first, and by the time the second one runs, the job hasn't changed because they were both created from the same job!

What I can't figure out is how the job can exist for the eval d4a85794 to consider before it is created by the eval 89cf608e which is what we get from the job register api, except maybe I just don't understand how the node-update eval is created per-joib

benbuzbee commented 4 years ago

Digging further I realized the code has a version check that makes job registration and eval creation atomic! i can repro the problem 100% with this diff

-               if eval != nil && ServersMeetMinimumVersion(j.srv.Members(), minJobRegisterAtomicEvalVersion, false) {
+               if false && eval != nil && ServersMeetMinimumVersion(j.srv.Members(), minJobRegisterAtomicEvalVersion, false) {
                        args.Eval = eval
                        submittedEval = true
                }

-
+       time.Sleep(7 * time.Second)
        // used for multiregion start
        args.Job.JobModifyIndex = reply.JobModifyIndex

 func (n *Node) createNodeEvals(nodeID string, nodeIndex uint64) ([]string, uint64, error) {
+       time.Sleep(2 * time.Second)
        // Snapshot the state

So im guessing we have a bug on our end where we fall into the submittedEval = false case. I am deploying a change to validate that :/ sorry

benbuzbee commented 4 years ago

Victory! I reprod it with 2 sleeps, here is the diff with explanation. I think this is a real bug?

How to repro

  1. Write code that schedules a job on a node as soon as it goes from ineligibile to eligible
  2. Apply this diff to force timing
  3. Toggle drain state of node on and off a few times

Eventually the eval returned from the job registration endpoint will contain no allocs even though it did result in allocs

diff --git a/nomad/eval_broker.go b/nomad/eval_broker.go
index c30eb3cfe..7d23a9746 100644
--- a/nomad/eval_broker.go
+++ b/nomad/eval_broker.go
@@ -344,6 +344,7 @@ SCAN:
                if timeoutTimer != nil {
                        timeoutTimer.Stop()
                }
+               time.Sleep(2 * time.Second)
                return eval, token, nil
        }

diff --git a/nomad/job_endpoint.go b/nomad/job_endpoint.go
index e4fac539f..ea1295f4f 100644
--- a/nomad/job_endpoint.go
+++ b/nomad/job_endpoint.go
@@ -352,6 +352,7 @@ func (j *Job) Register(args *structs.JobRegisterRequest, reply *structs.JobRegis
                }

                // Commit this update via Raft
+               time.Sleep(500 * time.Millisecond)
                fsmErr, index, err := j.srv.raftApply(structs.JobRegisterRequestType, args)
                if err, ok := fsmErr.(error); ok && err != nil {
                        j.logger.Error("registering job failed", "error", err, "fsm", true)

Analysis

  1. Job J is scheduled on node N as alloc A1
  2. Node N is drained, alloc A1 enters the "Completed state"
  3. Node N is undrained, in func (n *Node) createNodeEvals(nodeID string, nodeIndex uint64) it checks for all allocations it used to have, finds alloc A1, and queues an evaluation E1 for job J (version 1)
  4. The code we wrote sees the change to node N and schedules version 2 of job J
  5. The job register endpoint queues an evaluation E2 for job J (version 2)
  6. The eval broker runs, sorts by modify index, pops off E1 which references J, queues alloc A2 on node N
  7. The eval broker pops off E2 which references J, but realizes J hasn't changed since E1 ran so it returns nothing
  8. Ben gets confused because he doesn't understand how the job post could have failed to create an allocation

We force this timing by delaying the dequeue by 2 seconds and delaying the job enqueue by 500ms

Suggestions

This seems buggy to me, specifically the node scheduling an eval that just references the job by name and not caring if the job may have since been purged completely and replaced. LMK if you disagree.

I think the cleanest solution is one of

  1. Queue an evaluation along with a JobModifyIndex and change the eval broker to ignore evaluations for old versions of a job. I do not know if this has any edges so I am scared of it but it seems like the best fix.
  2. Ignore allocs in createNodeEvals that are in the completed/dead/etc state. Again I dont know if this has unintentional side effects but it seems strange to me that these dead allocs are re-queued for evaluation on start
tgross commented 4 years ago

Thanks so much for this thorough investigation @benbuzbee!

I walked thru your analysis, wrote most of a reply explaining why I didn't think it was quite right, double-checked the code again, and then realized you're totally right. 😀 Nice work here.

This is interesting because it demonstrates a crack in the atomicity of registering+eval that we introduced in 0.12.1 in https://github.com/hashicorp/nomad/pull/8435. Which is that just because we wrote the eval to raft atomically with the job registration doesn't guarantee that the eval isn't going to lose a race against another eval that processes the same job.

  1. Queue an evaluation along with a JobModifyIndex and change the eval broker to ignore evaluations for old versions of a job. I do not know if this has any edges so I am scared of it but it seems like the best fix.

At first glance that seems right, but it turns out we don't include the job Version intentionally because we need the eval to be able to tell us to clean up old versions of the job.

  1. Ignore allocs in createNodeEvals that are in the completed/dead/etc state. Again I dont know if this has unintentional side effects but it seems strange to me that these dead allocs are re-queued for evaluation on start

This is definitely what I'm thinking as well and I'm surprised this isn't checking if the allocation is terminal before creating the eval, but I'll need to spend some time trying to figure out all the edge cases here.


It might take a little bit for me to circle back to that createNodeEvals logic, so I want to make sure you think you have a reasonable workaround in the meantime. I suspect you can avoid this problem by giving the node a few seconds to finish processing the node registration RPC before sending the new job registration. Other than waiting for the node-update eval to be marked complete I'm not sure what would be a good trigger to watch.

And although I don't think it helps in this case, make sure you're using the EnforceIndex field in the API. If you're submitting jobs programmatically with any kind of speed, it's a good way to avoid write skew behaviors.

benbuzbee commented 4 years ago

We can work around it by adding a random GUID to our job name so it doesn't match the job name the node finds when it wakes up. Appreciate your patience with this and feel free to let me know if you encounter any bumps when you get back to it - thanks!

notnoop commented 3 years ago

Hi @benbuzbee . Wow - lots of good and subtle details you analyzed here; sorry for taking so long to follow up. I will need chew on information for awhile, but wanted to share my initial thoughts (read: gut feelings) to share where I am and get your pushback.

First I want to clarify the following:

  1. The eval broker runs, sorts by modify index, pops off E1 which references J, queues alloc A2 on node N
  2. The eval broker pops off E2 which references J, but realizes J hasn't changed since E1 ran so it returns nothing

Is A2 running J (Version 2) in this case? If so, I am not sure where the bug lays, and it might be that we provide a faulty mental model of evals.

Job Evals is the mechanism to trigger a scheduler to reconcile desired vs actual allocations for a job and whether nomad should perform any follow up steps (e.g. starting new ones, shutting down, etc). For the most part, the scheduler does not factor in the cause of an eval (e.g. job register vs deregister, vs manual force) and it does a "full" review of the state to determine next actions.

I suspect you can also trigger this condition is job eval command with the modification here. If you force eval the job, then immediately update it - the scheduler may place the job allocation as a result of the manual force eval rather than the job register eval as well.

The atomicity we aimed for job registration+eval for in #8435 is about avoiding the scenario where a job registration update succeeds but eval enqueueing failing resulting in non schedule. As @tgross noted, it does not preclude the job registration placement happening due to other interleaved evals (e.g. due to drain, force-eval api, failed alloc requiring rescheduling).

Interleaved evals can cause interesting other UX. If a client submits multiple updates to a job in rapid succession (e.g. Versions 2,3, and 4), it is possible for V2 eval to be actually placing V4 allocations, and have V3/V4 evals as no-op evals. If multiple users use the CLI to update the same Job, the users may get confusing eval info depending on ordering of Raft commit messages and broker dequeue activity!

At this point, I want to preserve this scheduler behavior. To be honest, mostly because the change is risky as you point out regarding the first suggestion.

Though, clearly the UX is non-ideal and I'd like to explore how much we can improve the UX while minimizing scheduler changes. I'll chew on this with teammates and will appreciate feedback.