Nomad discovery - services contain reference to non-existing allocation

marekhanzlik commented 1 year ago

Nomad version

Nomad v1.6.2
BuildDate 2023-09-13T16:47:25Z
Revision 73e372ad94033db2ceaf53468b270a31544c23fd

Operating system and Environment details

Ubuntu 22.04

Issue

Hello, we've got into state that some of our jobs report multiple service endpoints even if there should be only one, and that second endpoint allocation is already garbage collected

curl "nomad1:4646/v1/service/cm-micrometer?namespace=devel1" | jq .                                                                                                   
[
  {
    "ID": "_nomad-task-317b42ec-910b-d340-7249-edc9d35ab02f-cm-cm-micrometer-micrometer",
    "ServiceName": "cm-micrometer",
    "Namespace": "devel1",
    "NodeID": "aff6d4c3-223e-3a85-224a-ab227300997c",
    "Datacenter": "tower",
    "JobID": "cm",
    "AllocID": "317b42ec-910b-d340-7249-edc9d35ab02f",
    "Tags": [
      "monitoring.enable=true",
      "monitoring.type=jvm"
    ],
    "Address": "10.10.67.163",
    "Port": 26617,
    "CreateIndex": 219215,
    "ModifyIndex": 219215
  },
  {
    "ID": "_nomad-task-4b44a1f5-1cbe-2e48-e0bd-e4be401a6753-cm-cm-micrometer-micrometer",
    "ServiceName": "cm-micrometer",
    "Namespace": "devel1",
    "NodeID": "4473cb7b-02b7-26ef-71f2-2bc11d664eef",
    "Datacenter": "tower",
    "JobID": "cm",
    "AllocID": "4b44a1f5-1cbe-2e48-e0bd-e4be401a6753",
    "Tags": [
      "monitoring.enable=true",
      "monitoring.type=jvm"
    ],
    "Address": "10.10.67.200",
    "Port": 20362,
    "CreateIndex": 154065,
    "ModifyIndex": 154065
  }
]

But the allocation with id 4b44a1f5 is already deleted and garbage collected (and for a long time)

curl "nomad1:4646/v1/allocation/4b44a1f5-1cbe-2e48-e0bd-e4be401a6753"                                                                                                     
alloc not found

Reproduction steps

We suspect it happened around the time we upgraded nomad to 1.6.2 - so possibly during upgrade process?

Nomad Client logs (if appropriate)

And here are logs that are showing that the alloc was GC'd

{
  "@level": "info",
  "@message": "Task event",
  "@module": "client.alloc_runner.task_runner",
  "@timestamp": "2023-09-26T07:20:57.192206+02:00",
  "alloc_id": "4b44a1f5-1cbe-2e48-e0bd-e4be401a6753",
  "failed": false,
  "msg": "Task received by client",
  "task": "cm",
  "type": "Received"
}
{
  "@level": "debug",
  "@message": "failed to copy cpuset",
  "@module": "client.driver_mgr.docker",
  "@timestamp": "2023-09-26T07:20:59.164748+02:00",
  "driver": "docker",
  "error": "openat2 /sys/fs/cgroup/nomad.slice/4b44a1f5-1cbe-2e48-e0bd-e4be401a6753.cm.scope/cpuset.cpus: no such file or directory"
}
{
  "@level": "info",
  "@message": "Task event",
  "@module": "client.alloc_runner.task_runner",
  "@timestamp": "2023-09-26T07:20:59.507422+02:00",
  "alloc_id": "4b44a1f5-1cbe-2e48-e0bd-e4be401a6753",
  "failed": false,
  "msg": "Sent interrupt. Waiting 5s before force killing",
  "task": "cm",
  "type": "Killing"
}
{
  "@level": "info",
  "@message": "Task event",
  "@module": "client.alloc_runner.task_runner",
  "@timestamp": "2023-09-26T07:21:35.060405+02:00",
  "alloc_id": "4b44a1f5-1cbe-2e48-e0bd-e4be401a6753",
  "failed": false,
  "msg": "Exit Code: 143, Exit Message: \"Docker container exited with non-zero exit code: 143\"",
  "task": "cm",
  "type": "Terminated"
}
{
  "@level": "debug",
  "@message": "received EOF, stopping recv loop",
  "@module": "client.alloc_runner.task_runner.task_hook.logmon.stdio",
  "@timestamp": "2023-09-26T07:21:41.214028+02:00",
  "alloc_id": "4b44a1f5-1cbe-2e48-e0bd-e4be401a6753",
  "err": "rpc error: code = Unavailable desc = error reading from server: EOF",
  "task": "cm"
}
{
  "@level": "debug",
  "@message": "reattached plugin process exited",
  "@module": "client.alloc_runner.task_runner.task_hook.logmon",
  "@timestamp": "2023-09-26T07:21:42.271141+02:00",
  "alloc_id": "4b44a1f5-1cbe-2e48-e0bd-e4be401a6753",
  "task": "cm"
}
{
  "@level": "debug",
  "@message": "plugin exited",
  "@module": "client.alloc_runner.task_runner.task_hook.logmon",
  "@timestamp": "2023-09-26T07:21:42.271434+02:00",
  "alloc_id": "4b44a1f5-1cbe-2e48-e0bd-e4be401a6753",
  "task": "cm"
}
{
  "@level": "debug",
  "@message": "task run loop exiting",
  "@module": "client.alloc_runner.task_runner",
  "@timestamp": "2023-09-26T07:21:42.271605+02:00",
  "alloc_id": "4b44a1f5-1cbe-2e48-e0bd-e4be401a6753",
  "task": "cm"
}
{
  "@level": "info",
  "@message": "marking allocation for GC",
  "@module": "client.gc",
  "@timestamp": "2023-09-26T07:21:42.271820+02:00",
  "alloc_id": "4b44a1f5-1cbe-2e48-e0bd-e4be401a6753"
}
{
  "@level": "info",
  "@message": "garbage collecting allocation",
  "@module": "client.gc",
  "@timestamp": "2023-09-26T08:30:44.802182+02:00",
  "alloc_id": "4b44a1f5-1cbe-2e48-e0bd-e4be401a6753",
  "reason": "forced collection"
}
{
  "@level": "info",
  "@message": "Task event",
  "@module": "client.alloc_runner.task_runner",
  "@timestamp": "2023-09-26T08:30:45.036341+02:00",
  "alloc_id": "4b44a1f5-1cbe-2e48-e0bd-e4be401a6753",
  "failed": false,
  "msg": "Sent interrupt. Waiting 5s before force killing",
  "task": "cm",
  "type": "Killing"
}
{
  "@level": "debug",
  "@message": "alloc garbage collected",
  "@module": "client.gc",
  "@timestamp": "2023-09-26T08:30:45.092062+02:00",
  "alloc_id": "4b44a1f5-1cbe-2e48-e0bd-e4be401a6753"
}

lgfa29 commented 1 year ago

Hi @marekhanzlik 👋

Thank you for the report. I suspect this is the same problem as reported in https://github.com/hashicorp/nomad/issues/16616 so I will close this one as a duplicate, but let me know if you think they're different problems.

We suspect it happened around the time we upgraded nomad to 1.6.2 - so possibly during upgrade process?

Just for additional data, which version were you running before the upgrade?

marekhanzlik commented 1 year ago

Used version 1.6.1 before update.
I've read through the issue you mentioned, and yes, it looks like it's the same problem. Thanks

hashicorp / nomad