Open justenwalker opened 7 years ago
I was able to finally GC these nodes.
There were a few allocations that were pending on those down nodes, and apparently the scheduler was not re-allocating them. Those pending allocations were /periodic-<TS>
children of a batch
job type.
First, I tried stopping the periodic jobs directly with the /v1/job/<ID>
endpoint. this didn't do anything, but it did complete successfully.
Next, I tried forcing an evaluation of the node with /v1/node/<ID>/evaluate
endpoint - which seems to have completed the pending allocation and allowed the node to be GC'd.
@justenwalker Do you have any of the logs from the work around step. The logs you gave in the first post don't seem to have anything showing the GC's you ran.
Why were the allocations on the node's in pending status? Do you client logs?
@justenwalker Also I don't think this is a bug. We only GC nodes once all allocations that were placed on the node are also garbage collected. With batch jobs the allocations are GC'd one shot because the scheduler needs to know about previous allocations. So until those jobs stop the node will be kept around even if there is nothing on it.
I don't have client logs, those nodes were destroyed along with their logs. The servers still had them marked as pending though - for whatever reason.
I don't think any allocation should be pending on a down node though - that seems like a bug.
Going to rename the issue.
Nomad version
Nomad v0.5.6
Operating system and Environment details
3-node server cluster:
Issue
I've shut down about 20 nomad client nodes. Most of them were cleaned up by the garbage collector, but a couple of them are not going away. Tried to force a GC with
/v1/system/gc
but it still won't go away.Client nodes are/were Windows 2012 R2 Datacenter Edition on same nomad version.
Reproduction steps
This is how it happened for me, but may not happen every time:
/v1/system/gc
(Optional)Nomad Server logs (if appropriate)
nomad.log.zip