vacp2p / nim-libp2p

libp2p implementation in Nim
https://vacp2p.github.io/nim-libp2p/docs/
MIT License
251 stars 55 forks source link

Hard dependency on identify protocol? #924

Open thomaseizinger opened 1 year ago

thomaseizinger commented 1 year ago

Hi,

I did some work on the rust-yamux implementation lately and when trying to integrate it with rust-libp2p, I ran into interop errors with nim-libp2p: https://github.com/libp2p/rust-libp2p/pull/3013

After some debugging and comparing of before and after outputs of our logs, I suspect that nim-libp2p has a hard dependency on the identify protocol. That is, if the other node does not speak the identify protocol, nim-libp2p will refuse to open streams of other protocols too.

In my testing, I observed that nim-libp2p would open a first stream with the identify protocol and in previous versions then follow-up with a stream for the ping protocol.

There also seems to be a race condition here. My guess is that my work on the rust-yamux implementation reduced the latency and thus we answer the initial multistream-select handshake for the identify protocol faster than before.

In summary, the behaviour seems to be:

Note that outbound pings from the rust-libp2p binary were working fine this entire time.

For now, the problem is resolved by adding the identify protocol to our interop test binary. I do think that there are multiple bugs in nim-libp2p here:

  1. Failing to open a stream for a protocol should always result in an error observable by the user
  2. Identify not being supported by the other node should be a no-op. In my view (YMMV), no protocol of the libp2p spec should be considered "mandatory".
  3. What identify reports should at most be treated advisory. Again, YMMV but I think it should be possible to "force" execution of a protocol. Only the multistream-select handshake will tell whether a protocol is actually supported.
Menduist commented 1 year ago

Thanks for the report, was that happening with nim-libp2p as the dialer or receiver? We do have a strict dep on identify for now, it will be removed later, but should still be handled gracefully

EDIT: regarding point 3, we most of the time rely on multistream-select to know if a protocol is supported. Identify is just mandatory for legacy reasons, not that the values are actually required for libp2p

thomaseizinger commented 1 year ago

Thanks for the report, was that happening with nim-libp2p as the dialer or receiver?

As the dialer. Listener was working fine!

Menduist commented 1 year ago

Turns out we do not have a hard dep on identify: https://github.com/status-im/nim-libp2p/blob/c45f9705ab292ac81d65feae03802d916c36e38d/libp2p/peerstore.nim#L206-L217 (read: if we can't negotiate the stream, just close it & keep on going)

Most likely, the actual issue is just a race condition which doesn't happen with identify enabled, because identify is delaying the connection upgrading process

thomaseizinger commented 1 year ago

Okay, thanks for debugging! Should identify be removed from the interop test then?

Menduist commented 1 year ago

I would like to understand what is going on, but it seems no one has succeeded to reproduce the issue outside of CI at this point

thomaseizinger commented 1 year ago

I would like to understand what is going on

I think that would be good!

May I suggest the following:

This would unblock other implementations and allows you to iterate on the problem without any pressure. Perhaps long-term, we shouldn't even add it back given that we just want to test the ping protocol.

Menduist commented 1 year ago

Removing identify on our side will not fix anything, the only reason why currently it works with identify is probably thanks to the extra round trip required for identify, which gives enough time to avoid the race condition

However, this test was working for months, and started to break when js-libp2p & rust-libp2p tweaked their Yamux implementation. While it's likely that the race condition is on our side, it's also possible that the recent changes broke something And wherever the bug is, it might cause issues in live networks

I apologize for the delay caused in rust & js PRs being merged, but we evidently at this point have no idea what is going on, and I'd rather stay on the safe side. I'm looking into this actively

thomaseizinger commented 1 year ago

Okay, I had assumed there was no race condition with identify disabled :)

There is no rush from our end, I am happy to merge the PR in rust-libp2p with identify added and remove it later once the race condition is resolved.

Let me know if you need any help debugging rust-yamux.