libp2p / go-libp2p-pubsub

The PubSub implementation for go-libp2p
https://github.com/libp2p/specs/tree/master/pubsub
Other
309 stars 178 forks source link

Libp2p Pubsub kicks all of my peers after upgrading from 0.8.0 to 0.9.3 #530

Open Giulio2002 opened 1 year ago

Giulio2002 commented 1 year ago

Copypasted from the go-libp2p issue

Hello, I am from the Erigon team, one of the Ethereum clients, I am trying to build a consensus client and I am having issues after upgrading libp2p-pubsub to 0.9.3. What seems to happen is that after the upgrade I lose all the peers I have. I found a work around that seems to somehow fix this as long as I do not do any request which is calling host.RemoveStreamHandler("/p2p/id/delta/1.0.0") . If I do that when I only subscribe and publish to gossip it seems to not give any issue, however if I repeatedly make requests to the network, I will get kicked out from the network (or at least lose all peers and be unable to reconnect).

I get kicked out after a while from the ETH 2.0 network after upgrading libp2p version.

I first posted this issue on the go-libp2p github but then I found out that disabling pubsub fixed the issue for seemingly no reason so now it is here. the original issue and discussion is at: https://github.com/libp2p/go-libp2p/issues/2202 . @MarcoPolo

vyzo commented 1 year ago

Its a go-libp2p problem, not a pubsub bug.

vyzo commented 1 year ago

Hrm, it might be relates to the notifiee trying to open a stream? Hard to tell what's going on here.

Giulio2002 commented 1 year ago

Disabling pubsub does not make my peers get kicked out so I think it is pubsub or you mean something else? What I know is that after the upgrade with no change of config whatsoever, this issue arose. I would also like to note that disabling gossip publishing does NOT fix the issue.

vyzo commented 1 year ago

pubsub doesn't do anything out of the ordinary, so it is libp2p bug triggered by pubsub probably.

Giulio2002 commented 1 year ago

uhm, okay so I will try to get the debug logs for @MarcoPolo then, I think that is my only choice in light of this plot twist

iulianpascalau commented 1 year ago

I think this also happened to me also. Trying to bump from v0.8.1 to v0.9.3. When comparing this https://github.com/libp2p/go-libp2p-pubsub/compare/v0.8.1...v0.9.3 I've stumbled across this image which I have trouble understanding. Before, it was ok, if the read bytes could have not been able to be unmarshalled to a rpc.RPC then the peer was considered dead. But now we have _, err := s.Read([]byte{0}). This can actually happen in a normal case where the pubsub fetched 1 byte. err is nil, a log debug that nobody reads appeared but then the peer was considered dead.

vyzo commented 1 year ago

Pubsub streams are unidirectional.

Prevously it would read incoming data and ignore them, now it resets the stream, which makes it stricter.

Are you interoping with a faulty peer?

iulianpascalau commented 1 year ago

no, just a peer that uses the following libp2p libs versions:

github.com/libp2p/go-libp2p v0.22.0
github.com/libp2p/go-libp2p-kad-dht v0.18.0
github.com/libp2p/go-libp2p-kbucket v0.4.7
github.com/libp2p/go-libp2p-pubsub v0.8.1
github.com/multiformats/go-multiaddr v0.6.0

We have a go-libp2p-pubsub fork and now we are trying with the reverted code for the handlePeerDead function to track if that is the cause. I've suspected pubsub lib for the following reason: The MultiversX node (ex Elrond Network) handles in 2 way the connected peers for various tasks:

  1. the connections returned by the host.Network().Conns() function are handled in the network sharding subsystem
  2. the pubsub.ListPeers() function is used to determine the correct connected peers to request data from (since we are a sharding protocol). With v0.9.3 we've seen some discrepancies between what the host.Network().Conns() function returned versus the pubsub.ListPeers() when running a network with half of the nodes on the lib versions from above (older libs) and half of the network on the latest libp2p libs versions and go compiler v1.20.3.

By discrepancies, I mean the host.Network().Conns() returns a list of connections the node has, pubsub.ListPeers() returns 0 connected peers on topics that normally should have peers.

iulianpascalau commented 1 year ago

running with debug logs enabled, I've stumbled across this section when the discrepancies between the host's network and pubsub became evident:

2023-06-23T09:17:15.225Z    DEBUG   pubsub  go-libp2p-pubsub@v0.9.3/gossipsub.go:526    PEERUP: Add new peer 16Uiu2HAm1obk3M5SXWadTjGm6VX5AQtsQAw6weyAMmU2d7s4SzeN using /meshsub/1.1.0
2023-06-23T09:17:15.225Z    DEBUG   pubsub  go-libp2p-pubsub@v0.9.3/comm.go:184 writing message to 16Uiu2HAm1obk3M5SXWadTjGm6VX5AQtsQAw6weyAMmU2d7s4SzeN: stream reset
2023-06-23T09:17:15.225Z    DEBUG   pubsub  go-libp2p-pubsub@v0.9.3/gossipsub.go:555    PEERDOWN: Remove disconnected peer 16Uiu2HAm1obk3M5SXWadTjGm6VX5AQtsQAw6weyAMmU2d7s4SzeN
2023-06-23T09:17:15.234Z    DEBUG   basichost   basic/basic_host.go:737 host 16Uiu2HAmS9igZyR75wKEJ9GgpgtRGTKZqLxiSJmenqaRQ8HsgNfC dialing 16Uiu2HAm1obk3M5SXWadTjGm6VX5AQtsQAw6weyAMmU2d7s4SzeN
2023-06-23T09:17:15.241Z    DEBUG   pubsub  go-libp2p-pubsub@v0.9.3/pubsub.go:695   already have connection to peer: 16Uiu2HAm1obk3M5SXWadTjGm6VX5AQtsQAw6weyAMmU2d7s4SzeN
2023-06-23T09:17:15.246Z    DEBUG   pubsub  go-libp2p-pubsub@v0.9.3/pubsub.go:695   already have connection to peer: 16Uiu2HAm1obk3M5SXWadTjGm6VX5AQtsQAw6weyAMmU2d7s4SzeN
2023-06-23T09:17:15.252Z    DEBUG   pubsub  go-libp2p-pubsub@v0.9.3/pubsub.go:695   already have connection to peer: 16Uiu2HAm1obk3M5SXWadTjGm6VX5AQtsQAw6weyAMmU2d7s4SzeN
2023-06-23T09:17:15.304Z    DEBUG   pubsub  go-libp2p-pubsub@v0.9.3/validation.go:476   validation done; took 264.307µs

I think the PEERDOWN event shouldn't have happened :thinking:

vyzo commented 1 year ago

Interesting, seems like an issue with multiple connections. Maybe a bug with the reset handling, can you investigate further around there?

iulianpascalau commented 1 year ago

Yes, it takes ~1h-2h for each test but we will pursue finding what is wrong. I will add some more logs on our pubsub fork and we will post here the results. The work will be carried out in this branch https://github.com/multiversx/go-libp2p-pubsub/pull/11

iulianpascalau commented 1 year ago

I have this sequence on 2 peers that are both running the pubsub 0.9.3 version. Something went wrong on the muxer (please note that besides pubsub, we've implemented a custom direct-send protocol that makes use of separate streams):

PID 16Uiu2HAkuyCrWCEBtubkbhfUKMuw9BJtL7Fe799v1rUnazbcwkFG logs:

DEBUG[2023-06-23 16:03:48.496]   refreshed peers on topic                 topic = unsignedTransactions_1_2 connected peers =  
2023-06-23T16:03:48.734Z    DEBUG   pubsub  go-libp2p-pubsub@v0.3.3-0.20230623141024-03d2ddf255ea/gossipsub.go:526  PEERUP: Add new peer 16Uiu2HAmPru7627X1dFqsNsxaSyjyS15ZFAkGmbJsn73PwkQEgMN using /meshsub/1.1.0
2023-06-23T16:03:48.734Z    DEBUG   pubsub  go-libp2p-pubsub@v0.3.3-0.20230623141024-03d2ddf255ea/comm.go:170   trying to write message to 16Uiu2HAmPru7627X1dFqsNsxaSyjyS15ZFAkGmbJsn73PwkQEgMN...
2023-06-23T16:03:48.734Z    DEBUG   pubsub  go-libp2p-pubsub@v0.3.3-0.20230623141024-03d2ddf255ea/comm.go:151   peer dead 16Uiu2HAmPru7627X1dFqsNsxaSyjyS15ZFAkGmbJsn73PwkQEgMN, error: stream reset, buff: 00
2023-06-23T16:03:48.734Z    DEBUG   pubsub  go-libp2p-pubsub@v0.3.3-0.20230623141024-03d2ddf255ea/comm.go:172   message wrote to 16Uiu2HAmPru7627X1dFqsNsxaSyjyS15ZFAkGmbJsn73PwkQEgMN, error: stream reset
2023-06-23T16:03:48.734Z    DEBUG   pubsub  go-libp2p-pubsub@v0.3.3-0.20230623141024-03d2ddf255ea/comm.go:188   writing message to 16Uiu2HAmPru7627X1dFqsNsxaSyjyS15ZFAkGmbJsn73PwkQEgMN: stream reset
2023-06-23T16:03:48.734Z    DEBUG   pubsub  go-libp2p-pubsub@v0.3.3-0.20230623141024-03d2ddf255ea/gossipsub.go:555  PEERDOWN: Remove disconnected peer 16Uiu2HAmPru7627X1dFqsNsxaSyjyS15ZFAkGmbJsn73PwkQEgMN

PID 16Uiu2HAmPru7627X1dFqsNsxaSyjyS15ZFAkGmbJsn73PwkQEgMN logs:

2023-06-23T16:03:48.736Z    DEBUG   pubsub  go-libp2p-pubsub@v0.3.3-0.20230623141024-03d2ddf255ea/comm.go:115   opening new stream to peer: stream reset16Uiu2HAkuyCrWCEBtubkbhfUKMuw9BJtL7Fe799v1rUnazbcwkFG
2023-06-23T16:03:48.736Z    DEBUG   basichost   basic/basic_host.go:413 protocol mux failed: stream reset (took 1.052892ms)
2023-06-23T16:03:48.736Z    DEBUG   basichost   basic/basic_host.go:754 host 16Uiu2HAmPru7627X1dFqsNsxaSyjyS15ZFAkGmbJsn73PwkQEgMN finished dialing 16Uiu2HAkuyCrWCEBtubkbhfUKMuw9BJtL7Fe799v1rUnazbcwkFG
2023-06-23T16:03:48.736Z    DEBUG   dht go-libp2p-kad-dht@v0.23.0/query.go:554  connected. dial success.
2023-06-23T16:03:48.736Z    DEBUG   basichost   basic/basic_host.go:737 host 16Uiu2HAmPru7627X1dFqsNsxaSyjyS15ZFAkGmbJsn73PwkQEgMN dialing 16Uiu2HAkuyCrWCEBtubkbhfUKMuw9BJtL7Fe799v1rUnazbcwkFG
2023-06-23T16:03:48.736Z    DEBUG   basichost   basic/basic_host.go:433 negotiated: /meshsub/1.1.0 (took 1.304288ms)

will try to dig further based on these logs

iulianpascalau commented 1 year ago

After trying with the new go-libp2p version v0.28.1 everything worked fine: no more connectedness discrepancies between the pubsub and the host, or at least not detectable. Probably the streams do not reset after normal message write/read. We also tested the backward compatibility with the pubsub v0.8.0 & go-libp2p 0.22.0 and we did not find any issues. The final go.mod imports that we use and found that are working properly are these:

github.com/gogo/protobuf v1.3.2
github.com/ipfs/go-log v1.0.5
github.com/jbenet/goprocess v0.1.4
github.com/libp2p/go-libp2p v0.28.1
github.com/libp2p/go-libp2p-kad-dht v0.23.0
github.com/libp2p/go-libp2p-kbucket v0.6.3
github.com/libp2p/go-libp2p-pubsub v0.9.3
github.com/multiformats/go-multiaddr v0.9.0
vyzo commented 1 year ago

thank you for the detailed investigation on this!

On Thu, Jun 29, 2023 at 12:11 PM Iulian Pascalau @.***> wrote:

After trying with the new go-libp2p version v0.28.1 everything worked fine: no more connectedness discrepancies between the pubsub and the host, or at least not detectable. Probably the streams do not reset after normal message write/read. We also tested the backward compatibility with the pubsub v0.8.0 & go-libp2p 0.22.0 and we did not find any issues. The final go.mod imports that we use and found that are working properly are these:

github.com/gogo/protobuf v1.3.2github.com/ipfs/go-log v1.0.5github.com/jbenet/goprocess v0.1.4github.com/libp2p/go-libp2p v0.28.1github.com/libp2p/go-libp2p-kad-dht v0.23.0github.com/libp2p/go-libp2p-kbucket v0.6.3github.com/libp2p/go-libp2p-pubsub v0.9.3github.com/multiformats/go-multiaddr v0.9.0

— Reply to this email directly, view it on GitHub https://github.com/libp2p/go-libp2p-pubsub/issues/530#issuecomment-1612682040, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAI4SWKGTBTQ6VZKISA2I3XNVBFTANCNFSM6AAAAAAWI44LKQ . You are receiving this because you commented.Message ID: @.***>