Closed dhv closed 5 years ago
Hi, thanks for reporting this issue. There is quite a bit at play here so I want to make sure I understand how to reproduce this issue.
Steps to reproduce: 1) Submit a job (constrained to a single node) for which the allocation succeeds 2) Update the job (constrained to the same node) which is expected to succeed 3) Force garbage collect the old allocation from Step 1 3) The expected-to-succeed allocation from Step 2 becomes stuck in pending after attempting to set up the task directory.
Is there a specific reason this job is constrained to a single host?
At first glance, this looks similar to https://github.com/hashicorp/nomad/issues/3121 but further information will be helpful.
This is not limited to jobs which are constrained to single hosts. I just deployed a system scheduled monitoring job which stays in "pending" state indefinitely on a few clients. This is different from #3121 in that it is not specific to long image downloads, and does not resolve unless I restart the nomad client itself. It clears immediately once nomad restarts, always works.
Allocations
ID Node ID Task Group Version Desired Status Created Modified
06a73888 68501606 monitoring 1 run pending 1h23m ago 1h23m ago
2d6792c8 2877565f monitoring 1 run pending 1h23m ago 1h23m ago
30803133 ecc9c006 monitoring 1 run pending 1h23m ago 1h23m ago
314e0f54 925b929a monitoring 1 run running 1h23m ago 1h22m ago
5b567fb9 182fd8e3 monitoring 1 run running 1h23m ago 1h23m ago
69e3deac 70e6d284 monitoring 1 run running 1h23m ago 1h22m ago
70676651 ae2ec5ca monitoring 1 run pending 1h23m ago 1h23m ago
82b17a73 69f0e2d4 monitoring 1 run running 1h23m ago 1h23m ago
8ff65f9a 335994ec monitoring 1 run running 1h23m ago 1h22m ago
91bcf6ac 4ef48180 monitoring 1 run running 1h23m ago 1h23m ago
b2c80f09 87842616 monitoring 1 run running 1h23m ago 1h22m ago
d0e8f661 140244bb monitoring 1 run running 1h23m ago 1h22m ago
dd6b4946 49874349 monitoring 1 run running 1h23m ago 1h23m ago
df5786ad 06999874 monitoring 1 run running 1h23m ago 1h22m ago
e1deec2b e38b936c monitoring 1 run running 1h23m ago 1h23m ago
f7eb1533 f37fb995 monitoring 1 run running 1h23m ago 1h23m ago
nomad alloc-status -region REGION 06a73888
ID = 06a73888
Eval ID = 88334edb
Name = REGION-cadvisor.monitoring[0]
Node ID = 68501606
Job ID = REGION-cadvisor
Job Version = 1
Client Status = pending
Client Description = <none>
Desired Status = run
Desired Description = <none>
Created = 1h24m ago
Modified = 1h24m ago
nomad alloc-status -region REGION 2d6792c8
ID = 2d6792c8
Eval ID = 88334edb
Name = REGION-cadvisor.monitoring[0]
Node ID = 2877565f
Job ID = REGION-cadvisor
Job Version = 1
Client Status = pending
Client Description = <none>
Desired Status = run
Desired Description = <none>
Created = 1h24m ago
Modified = 1h24m ago
nomad alloc-status -region REGION 30803133
ID = 30803133
Eval ID = 88334edb
Name = REGION-cadvisor.monitoring[0]
Node ID = ecc9c006
Job ID = REGION-cadvisor
Job Version = 1
Client Status = pending
Client Description = <none>
Desired Status = run
Desired Description = <none>
Created = 1h24m ago
Modified = 1h24m ago
I am trying to narrow down the reproduction steps. Let me know what other logs or config I can dump to help.
Hey, try seeing if your nomad Raft log indexes are in sync on all servers:
nomad agent-info | grep last_log_index
We encountered this as well in our cluster and it seemed to be the reason. https://github.com/hashicorp/nomad/issues/3227
@dhv Next time this happens would you be willing to kill the particular nomad agent that isn't starting the allocations with a SIGQUIT (3) so we can get a stack dump.
@dhv Are the allocations getting stuck always Docker driver? What else do they have in common, are they all using templates, vault, service definitions, etc? Could you potentially share the job file of a few that have gotten stuck
I've got something similar with nomad 8.3. We often get jobs stuck in "pending". I'll check the Raft log indexes and I'll try SIGQUIT.
Bumped into this as well. Restarting the agent seems to help.
@dadgar I am running 0.8.3
and have an allocation that is stuck in pending, but the node that it is allocated to is down
so I cannot run SIGQUIT
any idea what else could get it unstuck?
Allocations
ID Node ID Task Group Version Desired Status Created Modified
cc6b1a9c 9b9f60bb ___-group 0 run pending 19h49m ago 19h49m ago
bd7b6760 6ed39d79 ___-group 0 stop lost 19h50m ago 19h49m ago
ID = 9b9f60bb
Name = ip-172-32-15-26
Class = worker
DC = ___
Drain = false
Eligibility = eligible
Status = down
Driver Status = docker,exec,raw_exec,rkt
Node Events
Time Subsystem Message
2018-06-15T23:12:39Z Cluster Node Registered
Allocated Resources
CPU Memory Disk IOPS
0/120000 MHz 0 B/156 GiB 0 B/462 GiB 0/0
Allocation Resource Utilization
CPU Memory
0/120000 MHz 0 B/156 GiB
Hi @camerondavison!
What does nomad node status 9b9f60bb
for that node that's down return for you?
Strange that the client status didn't get updated to lost
when your node went down.
Could you try running nomad job eval <jobid>
. That should force a scheduler evaluation and if the node is marked as down that'll transition any pending allocs to lost.
what I posted above was the node status sorry if that was not clear.
$ nomad node-status 9b9f60bb
error fetching node stats: Unexpected response code: 404 (No path to node)
ID = 9b9f60bb
Name = ip-172-32-15-26
Class = worker
DC = ____
Drain = false
Eligibility = eligible
Status = down
Driver Status = docker,exec,raw_exec,rkt
Node Events
Time Subsystem Message
2018-06-15T23:12:39Z Cluster Node Registered
Allocated Resources
CPU Memory Disk IOPS
0/120000 MHz 0 B/156 GiB 0 B/462 GiB 0/0
Allocation Resource Utilization
CPU Memory
0/120000 MHz 0 B/156 GiB
error fetching node stats: actual resource usage not present
Allocations
ID Node ID Task Group Version Desired Status Created Modified
cc6b1a9c 9b9f60bb ____-group 0 run pending 20h48m ago 20h48m ago
And the eval
$ nomad alloc-status -verbose cc6b1a9c
ID = cc6b1a9c-ebe1-ddbf-8131-163b4c6fb138
Eval ID = 311cd23e-b2ed-91fc-0fce-b79d9fad326a
Name = ___-job/dispatch-1529435710-54c7a659.____-group[0]
Node ID = 9b9f60bb-a79d-3133-431e-b7d5e287003d
Job ID = ___-job/dispatch-1529435710-54c7a659
Job Version = 0
Client Status = pending
Client Description = <none>
Desired Status = run
Desired Description = <none>
Created = 2018-06-19T14:15:32-05:00
Modified = 2018-06-19T14:15:32-05:00
Evaluated Nodes = 3
Filtered Nodes = 0
Exhausted Nodes = 1
Allocation Time = 75.566µs
Failures = 0
Couldn't retrieve stats: Unexpected response code: 404 (No path to node)
Placement Metrics
* Resources exhausted on 1 nodes
* Class "worker" exhausted on 1 nodes
* Dimension "cpu" exhausted on 1 nodes
* Score "9b9f60bb-a79d-3133-431e-b7d5e287003d.binpack" = 10.829454
* Score "089c9fa5-9e2c-8da0-7ea1-a7b633ca3e5f.binpack" = 6.519062
$ nomad eval-status -verbose 311cd23e-b2ed-91fc-0fce-b79d9fad326a
ID = 311cd23e-b2ed-91fc-0fce-b79d9fad326a
Status = complete
Status Description = complete
Type = batch
TriggeredBy = node-update
Node ID = 6ed39d79-a7b5-5134-57dc-32233450a276
Priority = 50
Placement Failures = false
Previous Eval = <none>
Next Eval = <none>
Blocked Eval = <none>
That looks like a potential bug in the scheduler because it didn't mark the desired state of cc6b1a9c
as stop after the node update that marked that node as down. I'll dig into this some more and update this thread.
Could you also post the output of nomad job status
. Specifically, I wanted to know if it did actually create a replacement allocation for the alloc that ended up in this pending state.
Did running another evaluate on the job like I mentioned above with the nomad job eval <jobid>
do anything?
oh sorry @preetapan I did not realize that you wanted me to force a new eval of the job. I thought that you just wanted the eval status. Looks like after 24 hours maybe a garbage collector or something ran and it got re-evaled by itself.
note
$ nomad node-status 9b9f60bb
No node(s) with prefix "9b9f60bb" found
nomad job status ___-job/dispatch-1529435710-54c7a659
ID = ___-job/dispatch-1529435710-54c7a659
Name = ___-job/dispatch-1529435710-54c7a659
Submit Date = 2018-06-19T14:15:10-05:00
Type = batch
Priority = 50
Datacenters = ___
Status = dead
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
___-group 0 0 0 1 0 2
Allocations
ID Node ID Task Group Version Desired Status Created Modified
873a418f c050b1ac ___-group 0 run failed 6m38s ago 5m30s ago
cc6b1a9c 9b9f60bb ___-group 0 stop lost 1d14m ago 6m38s ago
bd7b6760 6ed39d79 ___-group 0 stop lost 1d14m ago 1d14m ago
The failure was expected, but not the lost
@camerondavison I opened another issue for this #4437. The other issues on this ticket indicate something on the client kept jobs at pending, but the situation you described today is different. Didn't want to keep piling onto this ticket. Lets continue the discussion on that ticket.
If you still have that alloc and job, can you also add the full json response from our API for the alloc, the job, and any evaluations for that job (via nomad job -evals jobid) to #4437. Sorry for the back and forth on this one!
This issue will be auto-closed because there hasn't been any activity for a few months. Feel free to open a new one if you still experience this problem :+1:
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Nomad version
v0.7.1
Operating system and Environment details
Ubuntu 16.04 Xenial
Issue
Some allocations will stay stuck in pending and the deployment fails. Usually it shows "Building Task Directory" event or the "Downloading image" event from docker. I've confirmed on the client that the image has finished downloading. Only way I've found to get it unstuck is to restart the nomad client.
Reproduction steps
No deterministic way to reproduce other than to keep submitting jobs. Might have to do with the constraints being used on this particular job, but I have seen it happen on other jobs as well.
With this particular job, I have it pinned to a specific host with a constraint. The new alloc first gets stuck in the state shown below because there's an older version of the same job. Then, I force garbage collection and it's able to clean up the old allocation. It then proceeds to "Building Task Directory" and gets stuck there, until I restart the nomad client.
Nomad Server logs (if appropriate)
Nomad Client logs (if appropriate)
Edit: Realized I copy and pasted an older alloc from the one below, but I can reproduce this behavior for the same allocation. The main issue is that it gets stuck after the "Building Task Directory" event.
Force gc, then:
Logs:
Then, restart nomad client , task gets unstuck and it works fine:
Logs:
Job file (if appropriate)
I will update with more info as I encounter this.