libp2p / go-libp2p

libp2p implementation in Go
MIT License
6.08k stars 1.07k forks source link

Bug: Graceful shutdown issues #2968

Closed burdiyan closed 1 month ago

burdiyan commented 1 month ago

We recently started facing issues with graceful shutdown in our app. After receiving termination signal, the app still hangs and never exists until forcefully shut down.

After spending some time debugging, I've found our that this place in libp2p never returns:

https://github.com/libp2p/go-libp2p/blob/v0.36.3/config/host.go#L28

To clarify, we are using libp2p with AutoRelay, HolePunching, DHT, and other things. The node needs to run for a while before this problem occurs. I suspect that it could be AutoRelay that's causing this, because the problem starts occurring after AutoRelay starts doing periodic relay finding.

So, closableRoutedHost.Close() gets called, but the underlying fx.App's Stop method never returns.

burdiyan commented 1 month ago

It's not easy for me to provide a clean reproduction for this, but you could clone this repo: https://github.com/seed-hypermedia/seed and do go run ./backend/cmd/seed-daemon. After leaving it for a while (until periodic auto relay logs are seen), and then pressing ctrl+c it can be seen that the Shutdown started, but it gets stuck.

Doing some very tedious and manual debugging I figured out that it gets stuck in the place I shared previously.

sukunrt commented 1 month ago

Can you check if the environment variable GODEBUG="asynctimerchan=1" fixes the issue. It's probably because of https://github.com/golang/go/issues/69312

Alternatively, you can change your go version in your go.mod to go1.22.

vyzo commented 1 month ago

I think I found a solution for the timer problem (will make pr for pubsub as well):

if !timer.Stop() {
  select {
  case <-timer.C:
  default:
  }
}
burdiyan commented 1 month ago

@sukunrt Oooh, I see. Unfortunately I can't use Go 1.22 at this point, because I'm already using iterators in some places :)

I think the solution @vyzo proposes could work. I remember doing something similar in my own code at some point.

marten-seemann commented 1 month ago

@vyzo I'd advise against making any changes to production code. This was a Go bug and is going to get fixed in Go 1.23.2. Just use the compiler flag @sukunrt mentioned for now.

burdiyan commented 1 month ago

@sukunrt can you point to me to the exact timer that could be causing the shutdown issues?

burdiyan commented 1 month ago

Confirming that running with GODEBUG="asynctimerchan=1" fixes the problem for me.

sukunrt commented 1 month ago

@vyzo that solution is racy for versions <= go1.22.

if !timer.Stop() {
  select {
  case <-timer.C:
  default:
  }
}

When timer.Stop returns false, it doesn't mean the value has been pushed to the channel. It only means that Stop didn't stop the timer from executing, the value may be available in the channel or will be pushed soon.

vyzo commented 1 month ago

ok, fair enough; lets wait for the upstream fix then.

sukunrt commented 1 month ago

@sukunrt can you point to me to the exact timer that could be causing the shutdown issues?

One is in quic-go: see https://github.com/quic-go/quic-go/pull/4659 One is in autonat: https://github.com/libp2p/go-libp2p/blob/master/p2p/host/autonat/autonat.go#L221

I'm sure there are some others in go-libp2p and the dependencies.

I'm keeping this issue open. I'll add some text in the next patch release regarding this, and close the issue.

vyzo commented 1 month ago

there is one in pubsub too

sukunrt commented 1 month ago

fixed by v0.36.4