Open Giulio2002 opened 1 year ago
Its a go-libp2p problem, not a pubsub bug.
Hrm, it might be relates to the notifiee trying to open a stream? Hard to tell what's going on here.
Disabling pubsub does not make my peers get kicked out so I think it is pubsub or you mean something else? What I know is that after the upgrade with no change of config whatsoever, this issue arose. I would also like to note that disabling gossip publishing does NOT fix the issue.
pubsub doesn't do anything out of the ordinary, so it is libp2p bug triggered by pubsub probably.
uhm, okay so I will try to get the debug logs for @MarcoPolo then, I think that is my only choice in light of this plot twist
I think this also happened to me also. Trying to bump from v0.8.1 to v0.9.3.
When comparing this https://github.com/libp2p/go-libp2p-pubsub/compare/v0.8.1...v0.9.3 I've stumbled across this
which I have trouble understanding.
Before, it was ok, if the read bytes could have not been able to be unmarshalled to a rpc.RPC then the peer was considered dead. But now we have
_, err := s.Read([]byte{0})
. This can actually happen in a normal case where the pubsub fetched 1 byte. err is nil, a log debug that nobody reads appeared but then the peer was considered dead.
Pubsub streams are unidirectional.
Prevously it would read incoming data and ignore them, now it resets the stream, which makes it stricter.
Are you interoping with a faulty peer?
no, just a peer that uses the following libp2p libs versions:
github.com/libp2p/go-libp2p v0.22.0
github.com/libp2p/go-libp2p-kad-dht v0.18.0
github.com/libp2p/go-libp2p-kbucket v0.4.7
github.com/libp2p/go-libp2p-pubsub v0.8.1
github.com/multiformats/go-multiaddr v0.6.0
We have a go-libp2p-pubsub fork and now we are trying with the reverted code for the handlePeerDead function to track if that is the cause. I've suspected pubsub lib for the following reason: The MultiversX node (ex Elrond Network) handles in 2 way the connected peers for various tasks:
host.Network().Conns()
function are handled in the network sharding subsystempubsub.ListPeers()
function is used to determine the correct connected peers to request data from (since we are a sharding protocol).
With v0.9.3 we've seen some discrepancies between what the host.Network().Conns()
function returned versus the pubsub.ListPeers()
when running a network with half of the nodes on the lib versions from above (older libs) and half of the network on the latest libp2p libs versions and go compiler v1.20.3.By discrepancies, I mean the host.Network().Conns()
returns a list of connections the node has, pubsub.ListPeers()
returns 0 connected peers on topics that normally should have peers.
running with debug logs enabled, I've stumbled across this section when the discrepancies between the host's network and pubsub became evident:
2023-06-23T09:17:15.225Z DEBUG pubsub go-libp2p-pubsub@v0.9.3/gossipsub.go:526 PEERUP: Add new peer 16Uiu2HAm1obk3M5SXWadTjGm6VX5AQtsQAw6weyAMmU2d7s4SzeN using /meshsub/1.1.0
2023-06-23T09:17:15.225Z DEBUG pubsub go-libp2p-pubsub@v0.9.3/comm.go:184 writing message to 16Uiu2HAm1obk3M5SXWadTjGm6VX5AQtsQAw6weyAMmU2d7s4SzeN: stream reset
2023-06-23T09:17:15.225Z DEBUG pubsub go-libp2p-pubsub@v0.9.3/gossipsub.go:555 PEERDOWN: Remove disconnected peer 16Uiu2HAm1obk3M5SXWadTjGm6VX5AQtsQAw6weyAMmU2d7s4SzeN
2023-06-23T09:17:15.234Z DEBUG basichost basic/basic_host.go:737 host 16Uiu2HAmS9igZyR75wKEJ9GgpgtRGTKZqLxiSJmenqaRQ8HsgNfC dialing 16Uiu2HAm1obk3M5SXWadTjGm6VX5AQtsQAw6weyAMmU2d7s4SzeN
2023-06-23T09:17:15.241Z DEBUG pubsub go-libp2p-pubsub@v0.9.3/pubsub.go:695 already have connection to peer: 16Uiu2HAm1obk3M5SXWadTjGm6VX5AQtsQAw6weyAMmU2d7s4SzeN
2023-06-23T09:17:15.246Z DEBUG pubsub go-libp2p-pubsub@v0.9.3/pubsub.go:695 already have connection to peer: 16Uiu2HAm1obk3M5SXWadTjGm6VX5AQtsQAw6weyAMmU2d7s4SzeN
2023-06-23T09:17:15.252Z DEBUG pubsub go-libp2p-pubsub@v0.9.3/pubsub.go:695 already have connection to peer: 16Uiu2HAm1obk3M5SXWadTjGm6VX5AQtsQAw6weyAMmU2d7s4SzeN
2023-06-23T09:17:15.304Z DEBUG pubsub go-libp2p-pubsub@v0.9.3/validation.go:476 validation done; took 264.307µs
I think the PEERDOWN event shouldn't have happened :thinking:
Interesting, seems like an issue with multiple connections. Maybe a bug with the reset handling, can you investigate further around there?
Yes, it takes ~1h-2h for each test but we will pursue finding what is wrong. I will add some more logs on our pubsub fork and we will post here the results. The work will be carried out in this branch https://github.com/multiversx/go-libp2p-pubsub/pull/11
I have this sequence on 2 peers that are both running the pubsub 0.9.3 version. Something went wrong on the muxer (please note that besides pubsub, we've implemented a custom direct-send protocol that makes use of separate streams):
PID 16Uiu2HAkuyCrWCEBtubkbhfUKMuw9BJtL7Fe799v1rUnazbcwkFG logs:
DEBUG[2023-06-23 16:03:48.496] refreshed peers on topic topic = unsignedTransactions_1_2 connected peers =
2023-06-23T16:03:48.734Z DEBUG pubsub go-libp2p-pubsub@v0.3.3-0.20230623141024-03d2ddf255ea/gossipsub.go:526 PEERUP: Add new peer 16Uiu2HAmPru7627X1dFqsNsxaSyjyS15ZFAkGmbJsn73PwkQEgMN using /meshsub/1.1.0
2023-06-23T16:03:48.734Z DEBUG pubsub go-libp2p-pubsub@v0.3.3-0.20230623141024-03d2ddf255ea/comm.go:170 trying to write message to 16Uiu2HAmPru7627X1dFqsNsxaSyjyS15ZFAkGmbJsn73PwkQEgMN...
2023-06-23T16:03:48.734Z DEBUG pubsub go-libp2p-pubsub@v0.3.3-0.20230623141024-03d2ddf255ea/comm.go:151 peer dead 16Uiu2HAmPru7627X1dFqsNsxaSyjyS15ZFAkGmbJsn73PwkQEgMN, error: stream reset, buff: 00
2023-06-23T16:03:48.734Z DEBUG pubsub go-libp2p-pubsub@v0.3.3-0.20230623141024-03d2ddf255ea/comm.go:172 message wrote to 16Uiu2HAmPru7627X1dFqsNsxaSyjyS15ZFAkGmbJsn73PwkQEgMN, error: stream reset
2023-06-23T16:03:48.734Z DEBUG pubsub go-libp2p-pubsub@v0.3.3-0.20230623141024-03d2ddf255ea/comm.go:188 writing message to 16Uiu2HAmPru7627X1dFqsNsxaSyjyS15ZFAkGmbJsn73PwkQEgMN: stream reset
2023-06-23T16:03:48.734Z DEBUG pubsub go-libp2p-pubsub@v0.3.3-0.20230623141024-03d2ddf255ea/gossipsub.go:555 PEERDOWN: Remove disconnected peer 16Uiu2HAmPru7627X1dFqsNsxaSyjyS15ZFAkGmbJsn73PwkQEgMN
PID 16Uiu2HAmPru7627X1dFqsNsxaSyjyS15ZFAkGmbJsn73PwkQEgMN logs:
2023-06-23T16:03:48.736Z DEBUG pubsub go-libp2p-pubsub@v0.3.3-0.20230623141024-03d2ddf255ea/comm.go:115 opening new stream to peer: stream reset16Uiu2HAkuyCrWCEBtubkbhfUKMuw9BJtL7Fe799v1rUnazbcwkFG
2023-06-23T16:03:48.736Z DEBUG basichost basic/basic_host.go:413 protocol mux failed: stream reset (took 1.052892ms)
2023-06-23T16:03:48.736Z DEBUG basichost basic/basic_host.go:754 host 16Uiu2HAmPru7627X1dFqsNsxaSyjyS15ZFAkGmbJsn73PwkQEgMN finished dialing 16Uiu2HAkuyCrWCEBtubkbhfUKMuw9BJtL7Fe799v1rUnazbcwkFG
2023-06-23T16:03:48.736Z DEBUG dht go-libp2p-kad-dht@v0.23.0/query.go:554 connected. dial success.
2023-06-23T16:03:48.736Z DEBUG basichost basic/basic_host.go:737 host 16Uiu2HAmPru7627X1dFqsNsxaSyjyS15ZFAkGmbJsn73PwkQEgMN dialing 16Uiu2HAkuyCrWCEBtubkbhfUKMuw9BJtL7Fe799v1rUnazbcwkFG
2023-06-23T16:03:48.736Z DEBUG basichost basic/basic_host.go:433 negotiated: /meshsub/1.1.0 (took 1.304288ms)
will try to dig further based on these logs
After trying with the new go-libp2p version v0.28.1 everything worked fine: no more connectedness discrepancies between the pubsub and the host, or at least not detectable. Probably the streams do not reset after normal message write/read. We also tested the backward compatibility with the pubsub v0.8.0 & go-libp2p 0.22.0 and we did not find any issues. The final go.mod imports that we use and found that are working properly are these:
github.com/gogo/protobuf v1.3.2
github.com/ipfs/go-log v1.0.5
github.com/jbenet/goprocess v0.1.4
github.com/libp2p/go-libp2p v0.28.1
github.com/libp2p/go-libp2p-kad-dht v0.23.0
github.com/libp2p/go-libp2p-kbucket v0.6.3
github.com/libp2p/go-libp2p-pubsub v0.9.3
github.com/multiformats/go-multiaddr v0.9.0
thank you for the detailed investigation on this!
On Thu, Jun 29, 2023 at 12:11 PM Iulian Pascalau @.***> wrote:
After trying with the new go-libp2p version v0.28.1 everything worked fine: no more connectedness discrepancies between the pubsub and the host, or at least not detectable. Probably the streams do not reset after normal message write/read. We also tested the backward compatibility with the pubsub v0.8.0 & go-libp2p 0.22.0 and we did not find any issues. The final go.mod imports that we use and found that are working properly are these:
github.com/gogo/protobuf v1.3.2github.com/ipfs/go-log v1.0.5github.com/jbenet/goprocess v0.1.4github.com/libp2p/go-libp2p v0.28.1github.com/libp2p/go-libp2p-kad-dht v0.23.0github.com/libp2p/go-libp2p-kbucket v0.6.3github.com/libp2p/go-libp2p-pubsub v0.9.3github.com/multiformats/go-multiaddr v0.9.0
— Reply to this email directly, view it on GitHub https://github.com/libp2p/go-libp2p-pubsub/issues/530#issuecomment-1612682040, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAI4SWKGTBTQ6VZKISA2I3XNVBFTANCNFSM6AAAAAAWI44LKQ . You are receiving this because you commented.Message ID: @.***>
Copypasted from the go-libp2p issue
Hello, I am from the Erigon team, one of the Ethereum clients, I am trying to build a consensus client and I am having issues after upgrading libp2p-pubsub to 0.9.3. What seems to happen is that after the upgrade I lose all the peers I have. I found a work around that seems to somehow fix this as long as I do not do any request which is calling host.RemoveStreamHandler("/p2p/id/delta/1.0.0") . If I do that when I only subscribe and publish to gossip it seems to not give any issue, however if I repeatedly make requests to the network, I will get kicked out from the network (or at least lose all peers and be unable to reconnect).
I get kicked out after a while from the ETH 2.0 network after upgrading libp2p version.
I first posted this issue on the go-libp2p github but then I found out that disabling pubsub fixed the issue for seemingly no reason so now it is here. the original issue and discussion is at: https://github.com/libp2p/go-libp2p/issues/2202 . @MarcoPolo