Open HINT-SJ opened 3 years ago
Hi @HINT-SJ! It looks like there's two issues here: the UI issue and the stale allocation that won't be GC'd. I'm going to break off the UI issue over to https://github.com/hashicorp/nomad/issues/9658 so that the UI folks can focus their attention there, and we can discuss the GC issue in this thread, if that's ok.
For the GC issue, can you get debug logs from the cluster while you do a nomad system gc
? It might help see if there's a persistent error with that alloc.
Hey @tgross so a nomad monitor -log-level=TRACE
and parallel nomad system gc
showed:
2020-12-18T13:32:29.741Z [DEBUG] worker: dequeued evaluation: eval_id=58c211c2-d90c-fa65-aacf-0d18965b9922
2020-12-18T13:32:29.741Z [DEBUG] core.sched: forced job GC
2020-12-18T13:32:29.741Z [DEBUG] core.sched: forced eval GC
2020-12-18T13:32:29.741Z [DEBUG] core.sched: forced deployment GC
2020-12-18T13:32:29.741Z [DEBUG] core.sched: forced plugin GC
2020-12-18T13:32:29.741Z [DEBUG] core.sched: CSI plugin GC scanning before cutoff index: index=18446744073709551615 csi_plugin_gc_threshold=1h0m0s
2020-12-18T13:32:29.742Z [TRACE] core.sched: garbage collecting unclaimed CSI volume claims: eval.JobID=force-gc
2020-12-18T13:32:29.742Z [DEBUG] core.sched: forced volume claim GC
2020-12-18T13:32:29.742Z [DEBUG] core.sched: CSI volume claim GC scanning before cutoff index: index=18446744073709551615 csi_volume_claim_gc_threshold=5m0s
2020-12-18T13:32:29.742Z [DEBUG] core.sched: forced node GC
2020-12-18T13:32:29.742Z [DEBUG] worker: ack evaluation: eval_id=58c211c2-d90c-fa65-aacf-0d18965b9922
For reference a nomad alloc stop 05bf3c66-9dbb-b06a-4cf3-3216e6e922e4
shows:
2020-12-18T13:36:11.019Z [DEBUG] worker: dequeued evaluation: eval_id=79cfe6ec-5542-012f-6387-205fee147223
2020-12-18T13:36:11.019Z [DEBUG] http: request complete: method=PUT path=/v1/allocation/05bf3c66-9dbb-b06a-4cf3-3216e6e922e4/stop duration=5.053966ms
2020-12-18T13:36:11.019Z [TRACE] worker.service_sched.binpack: NewBinPackIterator created: eval_id=79cfe6ec-5542-012f-6387-205fee147223 job_id=authentication namespace=default algorithm=spread
2020-12-18T13:36:11.019Z [DEBUG] worker.service_sched: reconciled current state with desired state: eval_id=79cfe6ec-5542-012f-6387-205fee147223 job_id=authentication namespace=default results="Total changes: (place 0) (destructive 0) (inplace 0) (stop 0)
Desired Changes for "hermes": (place 0) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 2) (canary 0)"
2020-12-18T13:36:11.019Z [DEBUG] worker.service_sched: setting eval status: eval_id=79cfe6ec-5542-012f-6387-205fee147223 job_id=authentication namespace=default status=complete
2020-12-18T13:36:11.024Z [DEBUG] worker: updated evaluation: eval="<Eval "79cfe6ec-5542-012f-6387-205fee147223" JobID: "authentication" Namespace: "default">"
2020-12-18T13:36:11.024Z [DEBUG] worker: ack evaluation: eval_id=79cfe6ec-5542-012f-6387-205fee147223
Nomad version
Nomad v1.0.0 (cfca6405ad9b5f66dffc8843e3d16f92f3bedb43)
Operating system and Environment details
Linux, AWS
Issue
After upgrading our clusters to Nomad v1.0.0 we can not see the Topology view. After a quick investigation we found out that the UI is producing following error:
After checking the allocation id we get this:
The job itself:
So it seems like Nomad keeps this allocation, even after about a whole year... The client node is long gone..
We even tried to stop and purge the job (and recreated it, with a fresh job version counter) but this allocation is still there.
Neither
or
changed anything.
I even tried to stop the allocation, but it yields the same output and result every time, without changing anything.
Reproduction steps
I don't know how to reproduce this. But we have a few more allocations like this. Also pointing to different (long gone) nodes.