Streams should check for Session shutdown when waiting for data & clean up timers

lattwood commented 2 months ago

Stream.Close is called when the Session's keepalive fails. This call wakes up Stream.Read & Stream.write waiters. The wakeup is via a non-blocking send to recvNotifyCh/sendNotifyCh, which are unbuffered channels.

In Stream.Read this send can be attempted between the call to s.stateLock.Unlock() and the select statement at the bottom of the method. When this occurs, if deadlines haven't been set on the Stream in question, the call to Stream.Read will block forever.

By adding Session.shutdownCh to the channels in the select statement, this coordination gap is addressed.

This uses goto START instead of return 0, io.EOF, as that would potentially leave a last read on the buffer in the Read case.

Should fix hashicorp/nomad#23305. Definitely will fix a bug about allocations getting stuck in the pending state until the Nomad Agent is restarted that I haven't filed.

The issue was discovered by sending a "broken" Nomad Agent a SIGQUIT, and seeing the goroutine responsible for fetching allocation updates from the Nomad Servers was blocked in the select statement at the bottom of Stream.Read for > 24 hours (or more specifically, how long the allocation was in the pending state), with a lot of stuff about msgpack and jobspec fields in the middle of the stacktrace.

lattwood commented 2 months ago

~I need to update this PR, I just realized the goto implementation would protect against missing a final read in the happy case on shutdown.~

Done

lattwood commented 2 months ago

One thing that does alarm me is how Stream.recvNotifyCh is used.

It's only ever blocked on in Read. The only place that should need to unblock it is in Stream.recvData, called by Session.handleStreamMessage when it's de-multiplexing the connection into individual streams.

But it's also used for...

after SetReadDeadline is called (mostly makes sense, this allows for changing the timeout settings while something is calling Stream.Read)
asynchronously after Stream.Read returns, through defer asyncNotify(s.recvNotifyCh)
- yamux Streams are not threadsafe, so there is no reason that should ever wake anything up, because there should only be one call to Read at a time anyways. (and I don't think that go guarantees the first goroutine to block on a channel is the first one do get a message)
  - they're not threadsafe because the Read call that gets woken up is non-deterministic. See example here: https://go.dev/play/p/4QR3HvLoGyz
- this is the one that makes me think there's more synchronization bugs in the code :(
inside notifyWaiting, which also does non-blocking sends to other synchronization channels.
- used as a catch-all around connection state changes.

lattwood commented 2 months ago

PR feedback addressed, I think.

lattwood commented 2 months ago

https://github.com/lattwood/nomad/releases/tag/v1.7.8-yamux-fork

for anyone following along at home, there's a zip with a linux amd64 nomad build with this fix.

if you're paranoid i recommend just building it yourself, however.

lattwood commented 2 months ago

~So all hell broke loose while rolling out the custom Nomad binary.~

Now that things have stabilized, going to see if this issue continues to appear.

lattwood commented 2 months ago

All hell broke loose due to a provider network outage.

After 24 hours with this custom fork of Nomad deployed to both a ~100 node cluster, and a ~2300 node cluster, we do not have any stuck/pending allocs.

Confirmed, this fixes hashicorp/nomad#23305.

lattwood commented 2 months ago

*confirmed, it improves the bug.

still happening, sadly, and the output of killall -SIGQUIT nomad points to a hung Read on the yamux tunnel again.

When I'm back from vacation there's a good chance I'm going to just shoehorn a timer of some kind into yamux's Read method to get it out of that switch statement at the bottom of the method.

Jun 26 17:17:21 HOSTNAME nomad[398995]: {"@level":"error","@message":"yamux: keepalive failed: i/o deadline reached","@module":"client","@timestamp":"2024-06-26T17:17:21.287415Z"}
Jun 26 17:17:21 HOSTNAME nomad[398995]: {"@level":"error","@message":"yamux: Failed to read stream data: read tcp [IPv6]:4647: use of closed network connection","@module":"client","@timestamp":"2024-06-26T17:17:21.288316Z"}
Jun 26 17:17:21 HOSTNAME nomad[398995]: {"@level":"warn","@message":"yamux: failed to send go away: session shutdown","@module":"client","@timestamp":"2024-06-26T17:17:21.301811Z"}
Jun 26 17:17:21 HOSTNAME nomad[398995]: {"@level":"error","@message":"error performing RPC to server","@module":"client.rpc","@timestamp":"2024-06-26T17:17:21.322083Z","error":"rpc error: EOF","rpc":"Node.Register","server":{"IP":"2a05:IPV6","Port":4647,"Zone":""}}
Jun 26 17:17:21 HOSTNAME nomad[398995]: {"@level":"error","@message":"error performing RPC to server which is not safe to automatically retry","@module":"client.rpc","@timestamp":"2024-06-26T17:17:21.322195Z","error":"rpc error: EOF","rpc":"Node.Register","server":{"IP":"2a05:IPV6","Po>
Jun 26 17:17:21 HOSTNAME nomad[398995]: {"@level":"error","@message":"error registering","@module":"client","@timestamp":"2024-06-26T17:17:21.322214Z","error":"rpc error: EOF"}
Jun 26 17:17:21 HOSTNAME nomad[398995]: {"@level":"error","@message":"error performing RPC to server","@module":"client.rpc","@timestamp":"2024-06-26T17:17:21.322165Z","error":"rpc error: EOF","rpc":"Alloc.GetAllocs","server":{"IP":"2a05:IPv6","Port":4647,"Zone":""}}

Then the node gets marked as Down in Nomad, but the live CPU utilization stuff continues to get reported to the UI.

schmichael commented 2 months ago

Thanks for the update @lattwood! I just wanted to let you know we still intend to merge this PR after some more internal validation across teams since many HashiCorp products rely on yamux. We're also planning on at least improving the archaic infrastructure to pass go test -race and go vet.

Lots of vacations this time of year for us as well, so it will probably be a couple weeks before we're able to merge this.

schmichael commented 1 month ago

Sorry for the delay! I've finally been testing this patch. Playing around for a while with tc and nodesim I haven't discovered any problems with the patch, but unfortunately I haven't reproduced the original Nomad issue either. I'm not too surprised I haven't reproduced this as there are a lot of variables at play and reproducing production network/compute conditions to cause race conditions is extremely challenging.

The fact that this patch seems to have solved your issue makes me lean toward merging this even if we can't come to a satisfactory explanation of why.

The wakeup is via a non-blocking send to recvNotifyCh/sendNotifyCh, which are unbuffered channels.

(Emphasis mine) These are buffered channels: https://github.com/hashicorp/yamux/blob/v0.1.1/stream.go#L75-L76

I don't see a way for Stream.Read to miss a send on recvNotifyCh as there's only one receive: https://github.com/hashicorp/yamux/blob/v0.1.1/stream.go#L146

I know it's been a long time, but do you still have a goroutine dump handy? Is it possible another goroutine was blocked in session.go for a similar amount of time? It seems likely not, but perhaps there's a way Session.Close() can get deadlocked between close(s.shutdownCh) (which your PR would now check and avoid the deadlock) and stream.forceClose() (which should always send on recvNotifyCh in a way that unblocks Stream.Read()).

I tried to write a test to exercise the recvNotify is insufficient to notice disconnects but can't get it to fail on v0.1.1 as I would expect. Any chance you could take a look and see if you can find a way to make it fail? https://gist.github.com/schmichael/eae76f127aa31475fde152a47ef4a38b

I sprinkled some fmt.Println()s into Stream.Read() as well to try to ensure the Read was waiting before I closed the Session, but I might be missing something.

schmichael commented 1 month ago

Argh sorry that #130 caused a merge conflict. Tests now pass with the race detector enabled, and we finally enabled GHA CI!

We've been testing your patch more internally, so I wanted to leave a quick update:

My test above tried to exercise the bug by explicitly calling Session.Close, but @rboyer pointed out a missed yamux heartbeat is probably key to exercising the bug.
@rboyer has observed a (the?) bug in his own test, but not consistently yet. So far it appears this patch fixes it!
128 may be relevant here. If Nomad is doing that, then that's a bug in Nomad we need to fix. I intend to dig into that a bit. That being said, even if there's a bug due to yamux misuse in the Nomad code base, I'm not opposed to merging a patch in yamux that makes the misuse less pathological.

I'm hoping to get this all finished up this week so it makes it into the next Nomad patch release.

lattwood commented 2 weeks ago

@schmichael so, without revealing too much, what we're doing with Nomad regularly sees us have our hosting providers firewall off the Nomad Masters and the Nomad Agents for upwards of 30 minutes at a time. Agents can initiate TCP connections to the Masters when this happens, but all traffic from the Masters to the Agents becomes blocked. So the client SYN makes it to the server, but the server SYN-ACK doesn't leave the provider's network.

I think that could actually manifest as io.ReadFull blocking on the line in question, since that's reading from a bufio.Reader from the conn passed in to newSession which comes from the io.ReadWriteCloser passed to Server or Client. If there's no deadline set on the underlying net.Conn connection, when that firewalling of traffic occurs, it would block at least until the OS realizes the other side of the TCP connection has gone away.

So I think you're onto something here.

With all that said, we are still seeing the issue occur periodically, but now we don't have such a smoking gun in the form of yamux.

I'll see about grabbing some goroutine stacktraces the next time I'm kicking nodes.

And I'll get this PR's conflicts addressed.

lattwood commented 2 weeks ago

~ > nomad job status -json | jq -r '.[].Allocations[] | select(.ClientStatus == "pending") | .NodeName' | wc -l
      37

How many do ya want? :D

schmichael commented 2 weeks ago

With all that said, we are still seeing the issue occur periodically, but now we don't have such a smoking gun in the form of yamux.

Hm, phooey. Is it significantly better than without the yamux patch?

How many do ya want? :D

Working on a distributed system written in Go means the team is pretty used to combing through huge mountains of goroutines, so share whatever you're able. :grin:

lattwood commented 2 weeks ago

It has been better. Here's (what I think is) the relevant snippet from the goroutine stacks-

blocked on a read due to no io deadline? and this is the goroutine running watchAllocations

lattwood commented 2 weeks ago

@schmichael stacks sent.

It's a fork of Nomad 1.7.7 using this version of yamux- https://github.com/lattwood/yamux/tree/shutdown_edgecase (not the rebased branch for this PR)

#!/usr/bin/env bash

set -euo pipefail
IFS=$'\n\t'

target=$1
ssh "root@${target}" 'killall -SIGQUIT nomad && sleep 3'
ssh "root@${target}" 'journalctl --unit nomad.service --since "5 minutes ago" > gostacks.txt'
scp "root@${target}:gostacks.txt" "out/${target}_stacks.txt"

> parallel -j16 './get_stack.sh {}' ::: `nomad job status -json | jq -r '.[].Allocations[] | select(.ClientStatus == "pending") | .NodeName' | sort -u

lattwood commented 2 weeks ago

PR rebased against hashicorp/yamux/main

lattwood commented 2 weeks ago

(this patch does reduce the frequency of it getting blocked tho)

lattwood commented 1 week ago

@schmichael I think this is good to merge at this point?

Also, any chance you happened to look at that yamux fork mentioned on the Nomad issue about potentially swapping out for SPDY, in terms of compatibility issues? If it doesn't seem like a horrible idea, I think we could be open to running a Nomad fork on our end to kick the tires on it as it relates to solving the issue we're experiencing.

lattwood commented 5 days ago

Friendly poke @schmichael

schmichael commented 3 days ago

Peeked at a few of the goroutine dumps you sent and something is clearly blocking Session.recvLoop in io.ReadFull which without this PR could cause Stream.Read to block even when the session is closed.

Merging! Thanks for your extensive help and even more extensive patience @lattwood. We can continue debugging on the Nomad issue.

hashicorp / yamux

Streams should check for Session shutdown when waiting for data & clean up timers #127

128 may be relevant here. If Nomad is doing that, then that's a bug in Nomad we need to fix. I intend to dig into that a bit. That being said, even if there's a bug due to yamux misuse in the Nomad code base, I'm not opposed to merging a patch in yamux that makes the misuse less pathological.