pending allocations stuck in pending state after adoption by a new deployment

lattwood commented 2 months ago

Nomad version

Nomad v1.7.7
BuildDate 2024-04-16T19:26:43Z
Revision 0f34c85ee63f6472bd2db1e2487611f4b176c70c

Operating system and Environment details

All Linux

Issue

We have a large Nomad cluster- > 2k agents.
We run health checks on each agent and do dynamic metadata updates.
We have a large job that gets scheduled on 99% of the healthy agents.
We use a shell script to query nomad for node counts matching a filter on relevant dynamic metadata & run nomad job scale JOB TG COUNT when it changes.
We have a way to identify if the currently running deployment is due to a new Docker Image, or due to a scaling activity.
If it's due to a scaling activity on another task group, we cancel that deployment, and issue a new nomad job scale command immediately afterwards.
- Due to how frequently the dynamic metadata is changing, we are constantly doing deployments.
- We kicked off 250 deployments in the last hour.
Existing allocations usually all get adopted by the new deployment.
The issues:
- most of the time, but not always, the version on the existing allocs that are still "Pending" at deployment cancelation are updated. but we end up with allocations that don't get the version updated.
- sometimes the allocation exists, but isn't assigned to a client in the UI. if you inspect the alloc with the Nomad CLI, it tells you which client it has been assigned to.
- allocations get stuck in the pending state. we've found sending a SIGQUIT to the Nomad Agent fixes this, but that doesn't scale to > 2k nodes.
- allocations will get rescheduled before they ever leave the pending state. YEAH.
- possibly related, but might not be- if a node heartbeat is missed and reregistered, we can end up with stuck pending jobs on a node as well. This is also fixed with a SIGQUIT.
  - The stuck jobs aspect is brutal, since that doesn't start the health/deployment timers as far as I can tell.
  - The fact restarting the nomad agent is a little unnerving. I vaguely recall seeing something in the logs about "allocations out of date" or something like that when this issue has manifested itself in the past.

Reproduction steps

run a big cluster of nodes with intermittent connectivity issues to the master.
run a big long running job on said cluster
constantly be deploying/scaling the jobspec
wait

Expected Result

nomad schedules and runs jobs

Actual Result

nomad doesn't

Screenshots of Absurdity

In this screenshot, all the allocs are part of the same job, but different task groups. The colours correspond to the actual task groups. Note the varying versions, and some having client and some not.

Additionally, you can see the Modified time is the same for ones staying up to date, but isn't changing on other ones- you can also see the Created times are all over the place.

In this screenshot, you can see we have a pending allocation that has been rescheduled, and that rescheduled allocation is marked pending as well. And neither of the allocations have been assigned to a client as far as the Nomad WebUI is concerned.

Reporter's speculation

Maybe it has something to do w/ how the allocations are being adopted by the rapid deployments? This definitely reeks of race condition.

Nomad logs

Useless, unfortunately, due to this bug: https://github.com/hashicorp/nomad/issues/22431

lattwood commented 2 months ago

in trying to kick the broken allocs, I've managed to get Nomad to try sending an interrupt to a pending allocation!

edit: had to killall -9 nomad on ca11.

lattwood commented 2 months ago

I figured out how to make it worse! If I drain the node and mark it as ineligible, then re-enable eligibility, all of the system jobs end up with additional allocations. The only solution here is restarting/killing the Nomad agent.

lattwood commented 2 months ago

Just adding for additional flavour/I found this hilarious-

lattwood commented 2 months ago

Pending alloc example with logs from the Nomad Agent. Times are all accurate to each other.

root@HOSTNAME:~# TZ=America/Halifax journalctl --unit nomad --since '14 hours ago' | grep -v '\(runner\)'
Jun 11 23:28:35 HOSTNAME nomad[3215]: {"@level":"debug","@message":"received EOF, stopping recv loop","@module":"client.driver_mgr.docker.docker_logger.stdio","@timestamp":"2024-06-12T02:28:35.213106Z","driver":"docker","err":"rpc error: code = Unavailable desc = error reading from server: EOF"}
Jun 11 23:28:35 HOSTNAME nomad[3215]: {"@level":"info","@message":"plugin process exited","@module":"client.driver_mgr.docker.docker_logger","@timestamp":"2024-06-12T02:28:35.224019Z","driver":"docker","id":"2362096","plugin":"/usr/local/bin/nomad"}
Jun 11 23:28:35 HOSTNAME nomad[3215]: {"@level":"debug","@message":"plugin exited","@module":"client.driver_mgr.docker.docker_logger","@timestamp":"2024-06-12T02:28:35.225535Z","driver":"docker"}
Jun 11 23:28:35 HOSTNAME nomad[3215]: {"@level":"debug","@message":"image id reference count decremented","@module":"client.driver_mgr.docker","@timestamp":"2024-06-12T02:28:35.430987Z","driver":"docker","image_id":"sha256:31d1924d3c7e119a450ce9d0d2fbd417a1aba168781625acbc7bf77b398cee33","references":0}
Jun 11 23:28:35 HOSTNAME nomad[3215]: {"@level":"debug","@message":"(watcher) stopping all views","@module":"agent","@timestamp":"2024-06-12T02:28:35.472850Z"}
Jun 11 23:28:35 HOSTNAME nomad[3215]: {"@level":"info","@message":"stopped container","@module":"client.driver_mgr.docker","@timestamp":"2024-06-12T02:28:35.610787Z","container_id":"02ae65871d36198b058a7b98387c5b40bc857e7dfc8eb6bd6de60407e6f28305","driver":"docker"}
Jun 11 23:28:35 HOSTNAME nomad[3215]: {"@level":"debug","@message":"received EOF, stopping recv loop","@module":"client.driver_mgr.docker.docker_logger.stdio","@timestamp":"2024-06-12T02:28:35.626094Z","driver":"docker","err":"rpc error: code = Unavailable desc = error reading from server: EOF"}
Jun 11 23:28:35 HOSTNAME nomad[3215]: {"@level":"error","@message":"error encountered while scanning stdout","@module":"client.driver_mgr.docker.docker_logger","@timestamp":"2024-06-12T02:28:35.631515Z","driver":"docker","error":"read |0: file already closed"}
Jun 11 23:28:35 HOSTNAME nomad[3215]: {"@level":"info","@message":"plugin process exited","@module":"client.driver_mgr.docker.docker_logger","@timestamp":"2024-06-12T02:28:35.631600Z","driver":"docker","id":"2361837","plugin":"/usr/local/bin/nomad"}
Jun 11 23:28:35 HOSTNAME nomad[3215]: {"@level":"debug","@message":"plugin exited","@module":"client.driver_mgr.docker.docker_logger","@timestamp":"2024-06-12T02:28:35.632337Z","driver":"docker"}
Jun 11 23:28:35 HOSTNAME nomad[3215]: {"@level":"debug","@message":"image id reference count decremented","@module":"client.driver_mgr.docker","@timestamp":"2024-06-12T02:28:35.654011Z","driver":"docker","image_id":"sha256:6331aa0368327f0829316ca10b45844eb7303caf5f63cc979063df0ef7ffca01","references":0}
Jun 11 23:28:35 HOSTNAME nomad[3215]: {"@level":"info","@message":"marking allocation for GC","@module":"client.gc","@timestamp":"2024-06-12T02:28:35.669645Z","alloc_id":"221a2ab1-43ea-22bb-1616-24d9ce047609"}
Jun 11 23:31:40 HOSTNAME nomad[3215]: {"@level":"debug","@message":"cleanup removed downloaded image","@module":"client.driver_mgr.docker","@timestamp":"2024-06-12T02:31:40.406819Z","driver":"docker","image_id":"sha256:31d1924d3c7e119a450ce9d0d2fbd417a1aba168781625acbc7bf77b398cee33"}
Jun 11 23:31:41 HOSTNAME nomad[3215]: {"@level":"debug","@message":"cleanup removed downloaded image","@module":"client.driver_mgr.docker","@timestamp":"2024-06-12T02:31:41.201500Z","driver":"docker","image_id":"sha256:6331aa0368327f0829316ca10b45844eb7303caf5f63cc979063df0ef7ffca01"}

From what I can tell, all the logs from ~23:38:35 are due to the failed allocation, and there is nothing in the agent's logs about the pending allocation.

Generally speaking, forcibly killing Nomad (we have draining enabled) tends to fix these issues, but these state inconsistency bugs are terrifying.

jrasell commented 2 months ago

Hi @lattwood and thanks for raising this issue with the detail included. I'll add this to our backlog to investigate further. In the meantime, if you discover any other information or concrete reproduction, please let us know.

lattwood commented 2 months ago

More log spelunking- after this shows up in the logs on an agent/client, no further alloc updates occur, and the drain issue with the pending allocs also starts occuring too.

Jun 11 14:50:29 HOSTNAME nomad[2835]: {"@level":"debug","@message":"allocation updates applied","@module":"client","@timestamp":"2024-06-11T14:50:29.411617Z","added":0,"errors":0,"ignored":4,"removed":0,"updated":0}
Jun 11 14:50:50 HOSTNAME nomad[2835]: {"@level":"error","@message":"yamux: keepalive failed: i/o deadline reached","@module":"client","@timestamp":"2024-06-11T14:50:50.303589Z"}
Jun 11 14:50:50 HOSTNAME nomad[2835]: {"@level":"error","@message":"yamux: Failed to read stream data: read tcp [IPv6_ADDRESS_1]:42188-\u003e[IPv6_ADDRESS_2]:4647: use of closed network connection","@module":"client","@timestamp":"2024-06-11T14:50:50.303834Z"}
Jun 11 14:50:50 HOSTNAME nomad[2835]: {"@level":"error","@message":"error performing RPC to server","@module":"client.rpc","@timestamp":"2024-06-11T14:50:50.304920Z","error":"rpc error: EOF","rpc":"Node.UpdateAlloc","server":{"IP":"IPv6_ADDRESS_2","Port":4647,"Zone":""}}
Jun 11 14:50:50 HOSTNAME nomad[2835]: {"@level":"error","@message":"error performing RPC to server which is not safe to automatically retry","@module":"client.rpc","@timestamp":"2024-06-11T14:50:50.305307Z","error":"rpc error: EOF","rpc":"Node.UpdateAlloc","server":{"IP":"IPv6_ADDRESS_2","Port":4647,"Zone":""}}
Jun 11 14:50:50 HOSTNAME nomad[2835]: {"@level":"error","@message":"error updating allocations","@module":"client","@timestamp":"2024-06-11T14:50:50.305578Z","error":"rpc error: EOF"}
Jun 11 14:50:50 HOSTNAME nomad[2835]: {"@level":"error","@message":"error performing RPC to server","@module":"client.rpc","@timestamp":"2024-06-11T14:50:50.305401Z","error":"rpc error: EOF","rpc":"Alloc.GetAllocs","server":{"IP":"IPv6_ADDRESS_2","Port":4647,"Zone":""}}

lattwood commented 2 months ago

grabbed a goroutine stack dump and found a clue.

the same node is blocked here, and was for 1489 minutes, which ends up being just after the "error performing RPC to server" stuff in the above comment: https://github.com/hashicorp/yamux/blob/6c5a7317d6e3b6e7f85db696d6c83ed353e7cb4c/stream.go#L145-L153

Maybe timeouts aren't getting set on retries? idk.

edit: the deserialization methods for a jobspec are also evident in the stack trace.

Jun 12 15:40:35 HOSTNAME nomad[2835]: goroutine 254 gp=0xc0006888c0 m=nil [select, 1489 minutes]:
Jun 12 15:40:35 HOSTNAME nomad[2835]: runtime.gopark(0xc0024a8d00?, 0x2?, 0x25?, 0xa9?, 0xc0024a8c94?)
Jun 12 15:40:35 HOSTNAME nomad[2835]:         runtime/proc.go:402 +0xce fp=0xc0024a8b38 sp=0xc0024a8b18 pc=0x44630e
Jun 12 15:40:35 HOSTNAME nomad[2835]: runtime.selectgo(0xc0024a8d00, 0xc0024a8c90, 0xc003ed46f0?, 0x0, 0xc0024a8cd8?, 0x1)
Jun 12 15:40:35 HOSTNAME nomad[2835]:         runtime/select.go:327 +0x725 fp=0xc0024a8c58 sp=0xc0024a8b38 pc=0x458385
Jun 12 15:40:35 HOSTNAME nomad[2835]: github.com/hashicorp/yamux.(*Stream).Read(0xc0032fa000, {0xc002cc1000, 0x1000, 0xc001cc0b00?})
Jun 12 15:40:35 HOSTNAME nomad[2835]:         github.com/hashicorp/yamux@v0.1.1/stream.go:145 +0x40e fp=0xc0024a8d50 sp=0xc0024a8c58 pc=0xdb2ace
Jun 12 15:40:35 HOSTNAME nomad[2835]: bufio.(*Reader).Read(0xc002081e00, {0xc001cc0b00, 0xb, 0xdf9db0?})
Jun 12 15:40:35 HOSTNAME nomad[2835]:         bufio/bufio.go:241 +0x197 fp=0xc0024a8d88 sp=0xc0024a8d50 pc=0x6ca857
Jun 12 15:40:35 HOSTNAME nomad[2835]: github.com/hashicorp/go-msgpack/codec.decReadFull({0x3970800, 0xc002081e00}, {0xc001cc0b00, 0xb, 0x40})
Jun 12 15:40:35 HOSTNAME nomad[2835]:         github.com/hashicorp/go-msgpack@v1.1.6-0.20240304204939-8824e8ccc35f/codec/decode.go:3086 +0xa3 fp=0xc0024a8dd8 sp=0xc0024a8d88 pc=0xdfd303
Jun 12 15:40:35 HOSTNAME nomad[2835]: github.com/hashicorp/go-msgpack/codec.(*ioDecReader).readb(0xc0035bde00, {0xc001cc0b00, 0xb, 0x7ed07119d5b8?})
Jun 12 15:40:35 HOSTNAME nomad[2835]:         github.com/hashicorp/go-msgpack@v1.1.6-0.20240304204939-8824e8ccc35f/codec/decode.go:422 +0x4d fp=0xc0024a8e28 sp=0xc0024a8dd8 pc=0xdf1aed
Jun 12 15:40:35 HOSTNAME nomad[2835]: github.com/hashicorp/go-msgpack/codec.(*decReaderSwitch).readbIO(0xc002081e00?, {0xc001cc0b00?, 0x0?, 0x17?})
...
elided for brevity
...
Jun 12 15:40:35 HOSTNAME nomad[2835]: github.com/hashicorp/nomad/client.(*Client).rpc(0xc00091f888, {0x3112c54, 0xf}, {0x2f9c6c0, 0xc000485d90}, {0x2c6b860, 0xc0008f1860})
Jun 12 15:40:35 HOSTNAME nomad[2835]:         github.com/hashicorp/nomad/client/rpc.go:104 +0x23e fp=0xc0024a9988 sp=0xc0024a9818 pc=0x1dcdb9e
Jun 12 15:40:35 HOSTNAME nomad[2835]: github.com/hashicorp/nomad/client.(*Client).RPC(...)
Jun 12 15:40:35 HOSTNAME nomad[2835]:         github.com/hashicorp/nomad/client/rpc.go:61
Jun 12 15:40:35 HOSTNAME nomad[2835]: github.com/hashicorp/nomad/client.(*Client).rpc(0xc00091f888, {0x3112c54, 0xf}, {0x2f9c6c0, 0xc000485d90}, {0x2c6b860, 0xc0008f1860})
Jun 12 15:40:35 HOSTNAME nomad[2835]:         github.com/hashicorp/nomad/client/rpc.go:154 +0xb2d fp=0xc0024a9af8 sp=0xc0024a9988 pc=0x1dce48d
Jun 12 15:40:35 HOSTNAME nomad[2835]: github.com/hashicorp/nomad/client.(*Client).RPC(...)
Jun 12 15:40:35 HOSTNAME nomad[2835]:         github.com/hashicorp/nomad/client/rpc.go:61
Jun 12 15:40:35 HOSTNAME nomad[2835]: github.com/hashicorp/nomad/client.(*Client).watchAllocations(0xc00091f888, 0xc0008f1740)
Jun 12 15:40:35 HOSTNAME nomad[2835]:         github.com/hashicorp/nomad/client/client.go:2395 +0x96e fp=0xc0024a9fc0 sp=0xc0024a9af8 pc=0x1dacf6e
Jun 12 15:40:35 HOSTNAME nomad[2835]: github.com/hashicorp/nomad/client.(*Client).run.gowrap1()
Jun 12 15:40:35 HOSTNAME nomad[2835]:         github.com/hashicorp/nomad/client/client.go:1850 +0x25 fp=0xc0024a9fe0 sp=0xc0024a9fc0 pc=0x1da9d65
Jun 12 15:40:35 HOSTNAME nomad[2835]: runtime.goexit({})
Jun 12 15:40:35 HOSTNAME nomad[2835]:         runtime/asm_amd64.s:1695 +0x1 fp=0xc0024a9fe8 sp=0xc0024a9fe0 pc=0x47ebe1
Jun 12 15:40:35 HOSTNAME nomad[2835]: created by github.com/hashicorp/nomad/client.(*Client).run in goroutine 247
Jun 12 15:40:35 HOSTNAME nomad[2835]:         github.com/hashicorp/nomad/client/client.go:1850 +0x86

lattwood commented 2 months ago

@jrasell I found the bug, it was in yamux. PR: https://github.com/hashicorp/yamux/pull/127

lattwood commented 2 months ago

I'm suspecting this bug is caused by the whole pending allocations

jrasell commented 2 months ago

Hi @lattwood, amazing work and thanks so much for digging into this and raising a PR. I'll poke around internally and see who I need to get a reviewers on the PR.

lattwood commented 2 months ago

@jrasell good stuff. i'm likely going to be cutting a 1.7.7-dev or 1.7.8 build with the yamux change and rolling it out on our side today, once i change the return to a goto.

lattwood commented 2 months ago

Just added some more bug/correctness fixes to the PR.

lattwood commented 2 months ago

https://github.com/lattwood/nomad/releases/tag/v1.7.8-yamux-fork

I threw a build up there with the yamux fix.

lattwood commented 2 months ago

Confirmed that hashicorp/yamux#127 fixes the issue.

We'll be running a forked Nomad build until there's a release of Nomad with the yamux fix in it.

edit: idk if I'm out of line for saying this, but the yamux fix probably should be backported everywhere it's in use 😅

schmichael commented 3 weeks ago

Did you have TLS configured for Nomad's RPC?

lattwood commented 3 weeks ago

@schmichael yup. (our nomad agents are distributed around the world)

lattwood commented 2 weeks ago

Just saw another issue opened on yamux this week that could be responsible for what we're seeing here as well- https://github.com/hashicorp/yamux/issues/133

lattwood commented 2 days ago

hmm, i don't think so? heh.

On Thu, Sep 5, 2024 at 3:05 PM Michael Schurter @.***> wrote:

Closed #23305 https://github.com/hashicorp/nomad/issues/23305 as completed via hashicorp/yamux#127 https://github.com/hashicorp/yamux/pull/127.

— Reply to this email directly, view it on GitHub https://github.com/hashicorp/nomad/issues/23305#event-14151232453, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPJNKA4TPGTSASFDO6LNHDZVCMPDAVCNFSM6AAAAABJFAB6ASVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJUGE2TCMRTGI2DKMY . You are receiving this because you were mentioned.Message ID: @.***>

schmichael commented 2 days ago

hmm, i don't think so? heh.

:facepalm: Sorry. Evidently Github treats "Should fix #..." as meaning "Absolutely does fix #..." Reopening.

hashicorp / nomad