Open vyzo opened 1 month ago
I have worked around it by manually making streams and skipping the identify wait, but there certainly seem to be a bug in identify here.
Is there any more info you can provide? A trivial connect -> disconnect -> connect test doesn't reproduce this for me.
when I disconnect and reconnect to a peer, the latter finds it impossible to open a stream
Do you mean when you open a stream to the peer after reconnecting, the error is identify failed to complete
?
I see DEBUG log about NewStream failing with id failed to complete
Do you get any DEBUG logs from the identify package which are relevant to the peer?
I am trying to understand the problem better, hipefully i can get you a good log package to diagnose this.
But yes, we fail to open streams because identify fails to complete.
We observed similar behavior when event bus subscriptions were not read fast enough on our side. A client connects and initiates identify; a server processes a new connection in the swarm and blocks never reaching start and thus not processing identify streams. JFYI
when I disconnect and reconnect to a peer, the latter finds it impossible to open a stream
I'm not sure I parse this sentence correctly. I'm understanding it as meaning:
Is that understanding correct? Or is it instead:
The issue kind of sounds like we aren't picking the best conn, and then we try to identify on it.
It is the first scenario.
ah okay. That makes sense. The issue is probably that Peer B doesn't realize that the "best connection" it picked is actually closed/disconnected. So it times out on waiting for that connection to Identify (it never will).
We can be smarter here and interrupt with an even better connection if a new one appears.
Maybe we should have a collective identify completion channel for the peer, and not one for each conn.
This relates to #2355 and attaching protocol information to connections instead of peers.
fyi, I have a branch I'm working on that should solve this issue and improve the best connection logic. The basic idea is to create a small new service that subscribes to Identify events and fulfills request to return the best connection for a peer that supports a given protocol and other criteria (e.g. is it a limited connection? Prefer a connection with existing streams). I'll get it pushed soon after the next go-libp2p release.
These comments together:
The problem: when I disconnect and reconnect to a peer, the latter finds it impossible to open a stream, with the error indication that identified failed to complete. This happens consistently in our node, and I can trigger it reliably; so there is some bug related to identify.
and
I have worked around it by manually making streams and skipping the identify wait
Makes me rule out that this is only an issue with a dropped connection, like what I mentioned. Since in that case the workaround of manually making streams should also not work.
My current theory is that this is related to the eventbus being stalled as @Wondertan points out. It would be good to revisit https://github.com/libp2p/go-libp2p/issues/2361. The main argument against it was "We already have metrics, we don't need metrics AND logs." Looking at this again now, I still think having logs in addition would be nice, since setuping grafana and prometheus is non-trivial (and a big ask of our users), and issues like these should be easier to debug.
I'll make a PR to add this logging. I'll tag vyzo to try it and see if that is indeed their issue.
The problem: when I disconnect and reconnect to a peer, the latter finds it impossible to open a stream, with the error indication that identified failed to complete. This happens consistently in our node, and I can trigger it reliably; so there is some bug related to identify.
Relevant logs:
Version Information