Open juligasa opened 1 year ago
It sounds like there are 2 issues here:
NewStream
failing.Regarding (1), I suspect that this is due to the cleanup logic, although I can't see any bug. When a peer disconnects (i.e. when all of the (potentially multiple) connection we had with a peer are gone), we clean up the protocol list from the peer store after a short period. You can find the logic here: https://github.com/libp2p/go-libp2p/blob/fc1c9f5d4e7aef936847ffa8915a8eaffb34eba2/p2p/host/pstoremanager/pstoremanager.go#L81.
The bug here is pretty obvious, we call RemovePeer
even if the peer has connected back to us in the mean time. This will leave the peer list of addresses permanently empty, at least until the next Identify event happens (which doesn't happen frequently in practice, depending on your application).
Regarding (2), you're probably running into this error: https://github.com/libp2p/go-libp2p/blob/fc1c9f5d4e7aef936847ffa8915a8eaffb34eba2/p2p/host/basic/basic_host.go#L696 I'm not sure why that happens, the peer should have completed the multistream protocol negotiation.
The bug here is pretty obvious, we call
RemovePeer
even if the peer has connected back to us in the mean time. This will leave the peer list of addresses permanently empty, at least until the next Identify event happens (which doesn't happen frequently in practice, depending on your application).
I'm not sure if this is true. We clean up disconnected
when the peer reconnects: https://github.com/libp2p/go-libp2p/blob/fc1c9f5d4e7aef936847ffa8915a8eaffb34eba2/p2p/host/pstoremanager/pstoremanager.go#L108-L111
I think there's still a race condition, since eventbus notifications are async. The peer might have reconnected, identify completed, the timer fired and RemovePeer
called before we process the Connected
event. We should fix this by actively checking if the peer is connected before calling RemovePeer
.
This soulds like an unlikely event though, and I'm not sure if it explains what you're seeing.
@juligasa How do you know that you had the list of protocols for the peer at one point, and that they then disappear? Do you have any logs?
In our particular case we always check the protocols right after doing host.Connect()
successfully, so That's why I think it does not have to do with cleaning protocols on disconnected peers.
I have created a test branch with multiple traces to see whats going on:
https://github.com/MintterHypermedia/mintter/blob/playground/zero-protocols/backend/mttnet/connect.go#L86-L124
Do you make sure that this code is run after Identify completed?
This line doesn't look correct, you'll never be able to return a stream opened after the list was clear. https://github.com/MintterHypermedia/mintter/blob/7666ed6622e48abe494da77fb946e14f9cabe92f/backend/mttnet/connect.go#L98
In production we have a retry loop (that eventually times out) that gets protocols every 50ms with the hope of identifying completes first, but no luck. The stream creation was for testing purposes to see if there was any hint.
@marten-seemann I wonder if this looks similar to this problem https://discuss.ipfs.tech/t/files-are-rarely-available-through-public-gateways/17144 (which you thought might be resource manager related)
In production we have a retry loop (that eventually times out) that gets protocols every 50ms with the hope of identifying completes first, but no luck.
Are you familiar with the event bus? You can subscribe to the identify event, which will fire once the list of protocols is available.
Are you familiar with the event bus? You can subscribe to the identify event, which will fire once the list of protocols is available.
@marten-seemann I guess that wouldn't be necessary, because libp2p currently waits for Identify to complete during Connect
, so we assume that checking for protocols after Connect
successfully returns should work, but sometimes it doesn't.
So, looks like the problem is that sometimes we get to have peers that we've connected with, and we've done the identify with, but even after that we have 0 protocols for them in our peer store.
Are you familiar with the event bus? You can subscribe to the identify event, which will fire once the list of protocols is available.
I see, we could do that. However, I wonder why the response from some peers is instant, and for other peers it takes +10s¿? we give up retrying at 10 seconds which is beyond the user's expectation to connect to a peer that is apparently online. After restarting the app, we get a new random set of peers reporting empty protocols.
To summarize, the problem we're facing at Mintter is:
libp2p.Connect
successfully returns. Which assuming the current code in libp2p also means that Identify would have already happened.NewStream
would also fail for those peers.I was hoping that libp2p 1.32 would have fixed the issue (there's a mention of fixing the error of not waiting for transient connections, which sounded like exactly the problem we face), but according to the tests @juligasa made recently, the problem is still happening.
I'm confused how we can get Context Deadline exceeded(from NewStream) after the peer reported zero protocols. In MintterHypermedia/mintter@playground/zero-protocols/backend/mttnet/connect.go#L86-L124, I see that the error on NewStream is not wrapped or printed anywhere. Can you print err
too here. Also, is it possible to share log lines for a zero protocols peer?
@aschmahmann that's unlikely because in that case the err is supposed to be stream reset or protocol negotiation failed. We shouldn't see Context Deadline Exceeded with that issues.
Sure. I re-run the whole test again. But the results are a bit different from the last time. Now I get fewer zero-protocol errors, and most of them are at the beginning. At the next sync round (we do one every minute) they seem to have protocols. However, for those fewer cases where the protocol is still failing, we get three type of outcomes.
NewStream
NewStream
with the error context deadline exceeded
NewStream
with the error failed to negotiate protocol: context deadline exceeded
The difference at the network level with the tests in weeks before, is that by now, most of the network should have upgraded to the latest code which includes the libp2p upgrade from v0.31.0 to v0.32.1
Apologies for the delay here.
I don't have the complete explanation of what's causing this, but my best guess currently is that the peer has gone away and is unresponsive.
Can you run your tests with GOLOG_LOG_LEVEL="net/identify=debug" env flag set and is it possible to increase this context timeout to 45 seconds(30 seconds for the connection keep alive to expire and 15 as a buffer). Within 45 seconds if the peer is indeed unresponsive we should close the connection and the error will now change to ErrStreamReset. Checking identify debug logs will tell us why the initial identify failed, leading to 0 protocols.
I ran the test again. This time the difference is that after zero protocols I always can open a stream with the peer reporting zero protocols. So those peers shouldn't have gone away. The timeout does not make any difference here
As for the net/identify=debug
, is super verbose, don't really know what to look at.
As I understand, now you can open the streams so the only problem is that these peers have provided zero protocols in identify or potentially identify has failed.
In the identify logs can you check if there are any entries for the peer that has zero protocols?
In particular, I'm interested in this line: https://github.com/libp2p/go-libp2p/blob/master/p2p/protocol/identify/id.go#L399C2-L399C2
As I understand, now you can open the streams so the only problem is that these peers have provided zero protocols in identify or potentially identify has failed.
Correct
As for the failed to identify
traces. I see them in two flavors:
2023-12-22T16:46:32.485+0100 WARN net/identify identify/id.go:399 failed to identify 12D3KooWHZG1TAfuW2ARQm3Anq8QVhqBwWgnyZpn8U5xuxz1qA5v: Application error 0x1 (remote): conn-5798258: system: cannot reserve inbound connection: resource limit exceeded
95% of the occurences2023-12-22T16:46:32.575+0100 WARN net/identify identify/id.go:399 failed to identify 12D3KooWEhAvHBjfy1at22dJisN8wvD8KvvCyNYh45smcsMFrLML: Application error 0x47415445 (remote): connection gated
5% of the occurencesBut none of the peers on those traces correspong with any peer reporting 0 protocols. Actually, searching for peers reporting zero protocols, have quite sensible traces, like
2023-12-22T16:46:33.852+0100 DEBUG net/identify identify/id.go:512 /ipfs/id/1.0.0 received message from 12D3KooWJmT4B5QVfpBXh1XMJyq1dV9L6e64mwWc4peSU9h49M6A /ip4/23.20.24.146/udp/4002/quic-v1/p2p/12D3KooWNmjM4sMbSkDEA6ShvjTgkrJHjMya46fhZ9PjKZ4KVZYq/p2p-circuit
2023-12-22T16:46:33.852+0100 DEBUG net/identify identify/id.go:461 /ipfs/id/1.0.0 sending message to 12D3KooWJmT4B5QVfpBXh1XMJyq1dV9L6e64mwWc4peSU9h49M6A /ip4/23.20.24.146/udp/4002/quic-v1/p2p/12D3KooWNmjM4sMbSkDEA6ShvjTgkrJHjMya46fhZ9PjKZ4KVZYq/p2p-circuit
2023-12-22T16:46:33.852+0100 DEBUG net/identify identify/id.go:778 12D3KooWM38pA4drTk7zVttN6rgMVZyGdQfemj8sdLvt8QD6QZ8s received listen addrs for 12D3KooWJmT4B5QVfpBXh1XMJyq1dV9L6e64mwWc4peSU9h49M6A: [/ip4/23.20.24.146/tcp/4002/p2p/12D3KooWNmjM4sMbSkDEA6ShvjTgkrJHjMya46fhZ9PjKZ4KVZYq/p2p-circuit /ip4/23.20.24.146/udp/4002/quic-v1/p2p/12D3KooWNmjM4sMbSkDEA6ShvjTgkrJHjMya46fhZ9PjKZ4KVZYq/p2p-circuit /ip4/23.20.24.146/udp/4002/quic-v1/webtransport/certhash/uEiCKDdWcj0Ana-uOxOvfxPeCa5AySnFnndOF3IluqDCS8g/certhash/uEiDX-M06wPhYAwfCBWFYdGeJzF1WEicNV0t4D03wSc6tUA/p2p/12D3KooWNmjM4sMbSkDEA6ShvjTgkrJHjMya46fhZ9PjKZ4KVZYq/p2p-circuit /ip4/52.22.139.174/tcp/4002/p2p/12D3KooWGvsbBfcbnkecNoRBM7eUTiuriDqUyzu87pobZXSdUUsJ/p2p-circuit /ip4/52.22.139.174/udp/4002/quic-v1/p2p/12D3KooWGvsbBfcbnkecNoRBM7eUTiuriDqUyzu87pobZXSdUUsJ/p2p-circuit /ip4/52.22.139.174/udp/4002/quic-v1/webtransport/certhash/uEiDLL3ZwtJrzbvsBVY1qEZlnqdZzrZkuqoBerkADK7k7Tg/certhash/uEiBo3TlTcEq5bDfHELfzrR4fYGs4a7ja0Zm2313x5K8gpQ/p2p/12D3KooWGvsbBfcbnkecNoRBM7eUTiuriDqUyzu87pobZXSdUUsJ/p2p-circuit /ip4/192.168.1.78/tcp/56002 /ip4/192.168.1.78/udp/56002/quic-v1 /ip4/192.168.1.78/udp/56002/quic-v1/webtransport/certhash/uEiB25I72CuDBslKIys4RHN7-oUIA3DkGUt9cmp88ZD9DYQ/certhash/uEiCWnNHBBXctx6gu1Zc4atA3FZY3XMz_CKMVe7h-o6GzRw]
2023-12-22T16:46:34.363+0100 DEBUG net/identify identify/id.go:461 /ipfs/id/1.0.0 sending message to 12D3KooWJmT4B5QVfpBXh1XMJyq1dV9L6e64mwWc4peSU9h49M6A /ip4/88.1.95.131/udp/56002/quic-v1
2023-12-22T16:46:34.372+0100 DEBUG net/identify identify/id.go:512 /ipfs/id/1.0.0 received message from 12D3KooWJmT4B5QVfpBXh1XMJyq1dV9L6e64mwWc4peSU9h49M6A /ip4/88.1.95.131/udp/56002/quic-v1
2023-12-22T16:46:34.372+0100 DEBUG net/identify identify/id.go:778 12D3KooWM38pA4drTk7zVttN6rgMVZyGdQfemj8sdLvt8QD6QZ8s received listen addrs for 12D3KooWJmT4B5QVfpBXh1XMJyq1dV9L6e64mwWc4peSU9h49M6A: [/ip4/23.20.24.146/tcp/4002/p2p/12D3KooWNmjM4sMbSkDEA6ShvjTgkrJHjMya46fhZ9PjKZ4KVZYq/p2p-circuit /ip4/23.20.24.146/udp/4002/quic-v1/p2p/12D3KooWNmjM4sMbSkDEA6ShvjTgkrJHjMya46fhZ9PjKZ4KVZYq/p2p-circuit /ip4/23.20.24.146/udp/4002/quic-v1/webtransport/certhash/uEiCKDdWcj0Ana-uOxOvfxPeCa5AySnFnndOF3IluqDCS8g/certhash/uEiDX-M06wPhYAwfCBWFYdGeJzF1WEicNV0t4D03wSc6tUA/p2p/12D3KooWNmjM4sMbSkDEA6ShvjTgkrJHjMya46fhZ9PjKZ4KVZYq/p2p-circuit /ip4/52.22.139.174/tcp/4002/p2p/12D3KooWGvsbBfcbnkecNoRBM7eUTiuriDqUyzu87pobZXSdUUsJ/p2p-circuit /ip4/52.22.139.174/udp/4002/quic-v1/p2p/12D3KooWGvsbBfcbnkecNoRBM7eUTiuriDqUyzu87pobZXSdUUsJ/p2p-circuit /ip4/52.22.139.174/udp/4002/quic-v1/webtransport/certhash/uEiDLL3ZwtJrzbvsBVY1qEZlnqdZzrZkuqoBerkADK7k7Tg/certhash/uEiBo3TlTcEq5bDfHELfzrR4fYGs4a7ja0Zm2313x5K8gpQ/p2p/12D3KooWGvsbBfcbnkecNoRBM7eUTiuriDqUyzu87pobZXSdUUsJ/p2p-circuit /ip4/192.168.1.78/tcp/56002 /ip4/192.168.1.78/udp/56002/quic-v1 /ip4/192.168.1.78/udp/56002/quic-v1/webtransport/certhash/uEiB25I72CuDBslKIys4RHN7-oUIA3DkGUt9cmp88ZD9DYQ/certhash/uEiCWnNHBBXctx6gu1Zc4atA3FZY3XMz_CKMVe7h-o6GzRw]
2023-12-22T16:47:32.612+0100 DEBUG net/identify identify/id.go:461 /ipfs/id/1.0.0 sending message to 12D3KooWJmT4B5QVfpBXh1XMJyq1dV9L6e64mwWc4peSU9h49M6A /ip4/88.1.95.131/udp/56002/quic-v1
2023-12-22T16:47:32.667+0100 DEBUG net/identify identify/id.go:316 already sent this snapshot to peer {"peer": "12D3KooWJmT4B5QVfpBXh1XMJyq1dV9L6e64mwWc4peSU9h49M6A", "seq": 7}
What about those resource limits errors? should be worried? related to this https://github.com/libp2p/go-libp2p/issues/2628 ?
In our network, we identify peers talking
hypermedia
protocol by checking if that protocol is present in the list of protocols when connecting to a peer. We get the list of protocols by doing<caller>.Peerstore().GetProtocols(<called_pid>)
(we could be usingFirstSupportedProtocol
but for the sake of this issue, getting all supported protocols is best) after connecting with a peer on the network. The usual response is like:[/ipfs/lan/kad/1.0.0 /hypermedia/0.2.0 /ipfs/bitswap /ipfs/bitswap/1.0.0 /ipfs/bitswap/1.1.0 /ipfs/bitswap/1.2.0 /ipfs/id/push/1.0.0 /ipfs/id/1.0.0 /ipfs/ping/1.0.0 /libp2p/circuit/relay/0.2.0/stop]
However, some peers report no protocols at all[]
(as if we were issuingRemoveProtocols
) and the only way to make them work properly again is either restarting the called node or restarting the caller node. Note that while they are in this protocolless state, any subsequent calls toNewStream()
will fail (with the errorfailed to negotiate protocol: context deadline exceeded
), which is the real blocker in our case. After restarting the caller, a new random set of peers will face the same issue. In unitary tests, this seems to work. Any idea on what's going on?Version Information