(very) old allocation still existing and breaking new Topology view

HINT-SJ commented 3 years ago

Nomad version

Nomad v1.0.0 (cfca6405ad9b5f66dffc8843e3d16f92f3bedb43)

Operating system and Environment details

Linux, AWS

Issue

After upgrading our clusters to Nomad v1.0.0 we can not see the Topology view. After a quick investigation we found out that the UI is producing following error:

Uncaught (in promise) Error: Node b37eb7a6-ff6b-af73-e756-e019f2cabf57 for alloc 05bf3c66-9dbb-b06a-4cf3-3216e6e922e4 not in index.

After checking the allocation id we get this:

nomad alloc status 05bf3c66-9dbb-b06a-4cf3-3216e6e922e4
ID                  = 05bf3c66-9dbb-b06a-4cf3-3216e6e922e4
Eval ID             = f373dfb7
Name                = authentication.hermes[0]
Node ID             = b37eb7a6
Node Name           = ip-10-169-141-162.eu-central-1.compute.internal
Job ID              = authentication
Job Version         = 86
Client Status       = complete
Client Description  = All tasks have completed
Desired Status      = stop
Desired Description = alloc not needed due to job update
Created             = 1y1mo ago
Modified            = 11mo28d ago
Deployment ID       = 7eb18ddf
Deployment Health   = unset
Canary              = true

Couldn't retrieve stats: Unexpected response code: 500 (Unknown node b37eb7a6-ff6b-af73-e756-e019f2cabf57)

Task "authentication" is "dead"
Task Resources
CPU       Memory   Disk     Addresses
1000 MHz  768 MiB  300 MiB  api: 10.169.141.162:22564

Task Events:
Started At     = N/A
Finished At    = 2019-11-01T10:17:37Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type      Description
2019-12-19T08:25:28+01:00  Received  Task received by client
2019-12-19T07:10:22+01:00  Killing   Sent interrupt. Waiting 5s before force killing
2019-12-19T06:54:40+01:00  Received  Task received by client
2019-12-19T06:49:22+01:00  Killing   Sent interrupt. Waiting 5s before force killing
2019-12-19T06:28:49+01:00  Received  Task received by client
2019-12-19T06:18:38+01:00  Killing   Sent interrupt. Waiting 5s before force killing
2019-12-19T05:57:56+01:00  Received  Task received by client
2019-12-19T05:52:38+01:00  Killing   Sent interrupt. Waiting 5s before force killing
2019-12-19T05:42:32+01:00  Received  Task received by client
2019-12-19T05:27:09+01:00  Killing   Sent interrupt. Waiting 5s before force killing

The job itself:

nomad job status authentication
ID            = authentication
Name          = authentication
Submit Date   = 2020-12-11T10:01:57+01:00
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
hermes      0       0         2        0       3         0

Latest Deployment
ID          = dfbb1013
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group  Auto Revert  Promoted  Desired  Canaries  Placed  Healthy  Unhealthy  Progress Deadline
hermes      true         true      2        1         2       2        0          2020-12-11T09:08:05Z

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created     Modified
c7ee4423  bd478068  hermes      173      run      running   2m58s ago   2m17s ago
074106b4  72c99dc4  hermes      173      run      running   3m25s ago   2m59s ago
8e0e3a95  a9e0ff4a  hermes      172      stop     complete  16h43m ago  2m42s ago
2f7934ad  bd478068  hermes      172      stop     complete  16h45m ago  2m42s ago
05bf3c66  b37eb7a6  hermes      86       stop     complete  1y1mo ago   11mo28d ago

So it seems like Nomad keeps this allocation, even after about a whole year... The client node is long gone..

We even tried to stop and purge the job (and recreated it, with a fresh job version counter) but this allocation is still there.

Neither

nomad system gc

or

nomad system reconcile summaries

changed anything.

I even tried to stop the allocation, but it yields the same output and result every time, without changing anything.

nomad alloc stop -verbose 05bf3c66-9dbb-b06a-4cf3-3216e6e922e4
==> Monitoring evaluation "1fbae262-dc1b-1dd3-ef24-faa5272c2763"
    Evaluation triggered by job "authentication"
==> Monitoring evaluation "1fbae262-dc1b-1dd3-ef24-faa5272c2763"
    Evaluation within deployment: "dfbb1013-14f8-b248-7742-5a8b99c5e6dd"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "1fbae262-dc1b-1dd3-ef24-faa5272c2763" finished with status "complete"

Reproduction steps

I don't know how to reproduce this. But we have a few more allocations like this. Also pointing to different (long gone) nodes.

tgross commented 3 years ago

Hi @HINT-SJ! It looks like there's two issues here: the UI issue and the stale allocation that won't be GC'd. I'm going to break off the UI issue over to https://github.com/hashicorp/nomad/issues/9658 so that the UI folks can focus their attention there, and we can discuss the GC issue in this thread, if that's ok.

For the GC issue, can you get debug logs from the cluster while you do a nomad system gc? It might help see if there's a persistent error with that alloc.

HINT-SJ commented 3 years ago

Hey @tgross so a nomad monitor -log-level=TRACE and parallel nomad system gc showed:

2020-12-18T13:32:29.741Z [DEBUG] worker: dequeued evaluation: eval_id=58c211c2-d90c-fa65-aacf-0d18965b9922
2020-12-18T13:32:29.741Z [DEBUG] core.sched: forced job GC
2020-12-18T13:32:29.741Z [DEBUG] core.sched: forced eval GC
2020-12-18T13:32:29.741Z [DEBUG] core.sched: forced deployment GC
2020-12-18T13:32:29.741Z [DEBUG] core.sched: forced plugin GC
2020-12-18T13:32:29.741Z [DEBUG] core.sched: CSI plugin GC scanning before cutoff index: index=18446744073709551615 csi_plugin_gc_threshold=1h0m0s
2020-12-18T13:32:29.742Z [TRACE] core.sched: garbage collecting unclaimed CSI volume claims: eval.JobID=force-gc
2020-12-18T13:32:29.742Z [DEBUG] core.sched: forced volume claim GC
2020-12-18T13:32:29.742Z [DEBUG] core.sched: CSI volume claim GC scanning before cutoff index: index=18446744073709551615 csi_volume_claim_gc_threshold=5m0s
2020-12-18T13:32:29.742Z [DEBUG] core.sched: forced node GC
2020-12-18T13:32:29.742Z [DEBUG] worker: ack evaluation: eval_id=58c211c2-d90c-fa65-aacf-0d18965b9922

For reference a nomad alloc stop 05bf3c66-9dbb-b06a-4cf3-3216e6e922e4 shows:

2020-12-18T13:36:11.019Z [DEBUG] worker: dequeued evaluation: eval_id=79cfe6ec-5542-012f-6387-205fee147223
2020-12-18T13:36:11.019Z [DEBUG] http: request complete: method=PUT path=/v1/allocation/05bf3c66-9dbb-b06a-4cf3-3216e6e922e4/stop duration=5.053966ms
2020-12-18T13:36:11.019Z [TRACE] worker.service_sched.binpack: NewBinPackIterator created: eval_id=79cfe6ec-5542-012f-6387-205fee147223 job_id=authentication namespace=default algorithm=spread
2020-12-18T13:36:11.019Z [DEBUG] worker.service_sched: reconciled current state with desired state: eval_id=79cfe6ec-5542-012f-6387-205fee147223 job_id=authentication namespace=default results="Total changes: (place 0) (destructive 0) (inplace 0) (stop 0)
Desired Changes for "hermes": (place 0) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 2) (canary 0)"
2020-12-18T13:36:11.019Z [DEBUG] worker.service_sched: setting eval status: eval_id=79cfe6ec-5542-012f-6387-205fee147223 job_id=authentication namespace=default status=complete
2020-12-18T13:36:11.024Z [DEBUG] worker: updated evaluation: eval="<Eval "79cfe6ec-5542-012f-6387-205fee147223" JobID: "authentication" Namespace: "default">"
2020-12-18T13:36:11.024Z [DEBUG] worker: ack evaluation: eval_id=79cfe6ec-5542-012f-6387-205fee147223

hashicorp / nomad