libp2p / go-libp2p

libp2p implementation in Go
MIT License
6.1k stars 1.07k forks source link

Issues with identify: identify failed to complete #2983

Open vyzo opened 1 month ago

vyzo commented 1 month ago

The problem: when I disconnect and reconnect to a peer, the latter finds it impossible to open a stream, with the error indication that identified failed to complete. This happens consistently in our node, and I can trigger it reliably; so there is some bug related to identify.

Relevant logs:

2024-09-27T12:48:24.680Z    DEBUG   basichost   basic/basic_host.go:483 negotiated: /ipfs/id/push/1.0.0 (took 27.52µs)
2024-09-27T12:48:24.680Z    DEBUG   net/identify    identify/id.go:543  /ipfs/id/push/1.0.0 received message from 12D3KooWEuVEZ26PwP7f8mYxuQ97ChiqLvXL5JeDfmXmqN4ds3YT /ip4/45.139.212.29/udp/17846/quic-v1
2024-09-27T12:48:30.475Z    DEBUG   basichost   basic/basic_host.go:483 negotiated: actor/root/messages/0.0.1 (took 48.66µs)
2024-09-27T12:48:30.479Z    DEBUG   actor   actor/dispatch.go:205   dispatching message from {BAAREICLTJJ63SCP5MUFVGX7GUM42QL2I5E46I355CASAEDIL55VX4SHTA====== did:key:z6MkjYP7Q8m862LdqNtExNMcH1BzgAkYTVp6QC5gHesktKbu {12D3KooWEuVEZ26PwP7f8mYxuQ97ChiqLvXL5JeDfmXmqN4ds3YT root}} to /public/hello
2024-09-27T12:48:30.479Z    DEBUG   node    node/public.go:59   hello from 12D3KooWEuVEZ26PwP7f8mYxuQ97ChiqLvXL5JeDfmXmqN4ds3YT
2024-09-27T12:48:30.479Z    DEBUG   swarm2  swarm/swarm.go:475  [12D3KooWFcvZkrs1LTQCdGEVPYi7diCL76Z9f1LcgcMMspujKWTj] opening stream to peer [12D3KooWEuVEZ26PwP7f8mYxuQ97ChiqLvXL5JeDfmXmqN4ds3YT]
2024-09-27T12:48:33.450Z    WARN    libp2p  libp2p/libp2p.go:380    send: failed to open stream to peer 12D3KooWEuVEZ26PwP7f8mYxuQ97ChiqLvXL5JeDfmXmqN4ds3YT: identify failed to complete: context deadline exceeded
Version Information
    github.com/libp2p/go-libp2p v0.36.4
vyzo commented 1 month ago

I have worked around it by manually making streams and skipping the identify wait, but there certainly seem to be a bug in identify here.

sukunrt commented 1 month ago

Is there any more info you can provide? A trivial connect -> disconnect -> connect test doesn't reproduce this for me.

when I disconnect and reconnect to a peer, the latter finds it impossible to open a stream

Do you mean when you open a stream to the peer after reconnecting, the error is identify failed to complete?

I see DEBUG log about NewStream failing with id failed to complete Do you get any DEBUG logs from the identify package which are relevant to the peer?

vyzo commented 1 month ago

I am trying to understand the problem better, hipefully i can get you a good log package to diagnose this.

But yes, we fail to open streams because identify fails to complete.

Wondertan commented 1 month ago

We observed similar behavior when event bus subscriptions were not read fast enough on our side. A client connects and initiates identify; a server processes a new connection in the swarm and blocks never reaching start and thus not processing identify streams. JFYI

MarcoPolo commented 1 month ago

when I disconnect and reconnect to a peer, the latter finds it impossible to open a stream

I'm not sure I parse this sentence correctly. I'm understanding it as meaning:

  1. Peer A is connected to peer B.
  2. Peer A disconnects to peer B; then reconnects to peer B.
  3. Peer B attempts to open a stream to peer A.
  4. Peer B fails to open a stream.

Is that understanding correct? Or is it instead:

  1. Peer A is connected to peer B.
  2. Peer A disconnects to peer B; then reconnects to peer B.
  3. Peer A attempts to open a stream to peer B.
  4. Peer A fails to open a stream.

The issue kind of sounds like we aren't picking the best conn, and then we try to identify on it.

vyzo commented 1 month ago

It is the first scenario.

MarcoPolo commented 1 month ago

ah okay. That makes sense. The issue is probably that Peer B doesn't realize that the "best connection" it picked is actually closed/disconnected. So it times out on waiting for that connection to Identify (it never will).

We can be smarter here and interrupt with an even better connection if a new one appears.

vyzo commented 1 month ago

Maybe we should have a collective identify completion channel for the peer, and not one for each conn.

Stebalien commented 1 month ago

This relates to #2355 and attaching protocol information to connections instead of peers.

MarcoPolo commented 3 weeks ago

fyi, I have a branch I'm working on that should solve this issue and improve the best connection logic. The basic idea is to create a small new service that subscribes to Identify events and fulfills request to return the best connection for a peer that supports a given protocol and other criteria (e.g. is it a limited connection? Prefer a connection with existing streams). I'll get it pushed soon after the next go-libp2p release.

MarcoPolo commented 1 week ago

These comments together:

The problem: when I disconnect and reconnect to a peer, the latter finds it impossible to open a stream, with the error indication that identified failed to complete. This happens consistently in our node, and I can trigger it reliably; so there is some bug related to identify.

and

I have worked around it by manually making streams and skipping the identify wait

Makes me rule out that this is only an issue with a dropped connection, like what I mentioned. Since in that case the workaround of manually making streams should also not work.

My current theory is that this is related to the eventbus being stalled as @Wondertan points out. It would be good to revisit https://github.com/libp2p/go-libp2p/issues/2361. The main argument against it was "We already have metrics, we don't need metrics AND logs." Looking at this again now, I still think having logs in addition would be nice, since setuping grafana and prometheus is non-trivial (and a big ask of our users), and issues like these should be easier to debug.

I'll make a PR to add this logging. I'll tag vyzo to try it and see if that is indeed their issue.