libp2p / go-libp2p-pubsub

The PubSub implementation for go-libp2p
https://github.com/libp2p/specs/tree/master/pubsub
Other
326 stars 186 forks source link

Peers not always joining a topic and/or connecting #93

Closed haadcode closed 4 years ago

haadcode commented 6 years ago

We've been running into some hickups in OrbitDB recently as we've been upgrading from go-ipfs 0.4.13 to 0.4.16 (and now 0.4.17).

We've observed a behaviour where pubsub peers don't always connect to each other. Detailed information on the behaviour and what we've discovered so far in https://github.com/orbitdb/orbit-db/pull/417. To summarize: the tests are run with two local go-ipfs nodes, both subscribe to the same pubsub topic, but sometimes (about 50% of test runs) they don't seem to connect to each other on pubsub level even though they're connected on swarm level (see this comment https://github.com/orbitdb/orbit-db/pull/417#issuecomment-407002661).

Looking at the commit log in this repo, I noticed that the pubsub implementation was changed (yay, great to see the gossip pubsub coming together!) some time ago, so most likely this is regression. I believe 0.4.13 was working fine, but we haven't verified which version after that doesn't work anymore.

If anyone has an idea why this is happening and work on a fix, would highly appreciate it. We're (finally) adding back the support for go-ipfs in OrbitDB, but the tests are not running reliably at all times due to this bug, so we're hesitant to make a new release until this is fixed (alternatively we could instruct user to use =< 0.4.13 but that's no fun).

If there's any information I can provide to solve this, please let me know!

bigs commented 6 years ago

thanks for the report! the old floodsub implementation is still present and being used by IPFS, though it was changed a non-trivial amount to abstract the routing layer from the rest of the implementation, potentially adding room for regression. could you enable debug logging in the pubsub module and run the tests a few times to see if we can get some extra information?

haadcode commented 6 years ago

How does one "enable logging in the pubsub module"? Please advice.

haadcode commented 6 years ago

@bigs would appreciate advice on this to be able to provide more information to narrow this down and possibly have a fix soon.

vyzo commented 6 years ago

Are you using gossipsub or floodsub? as of #95 gossipsub has quite a bit of logging around internal events, and this includes logging peer discovery.

haadcode commented 6 years ago

@vyzo using whatever comes with the official release and using it through go-ipfs-dep and js-ipfs-api.

Generally though my question is how to turn on logging for go-ipfs? I used to know it, but can't remember anymore and can't find up-to-date documentation what env vars or flags I should set :)

vyzo commented 6 years ago

This is still floodsub, although gossipsub is coming soon to mainline ipfs as well.

You can see logs with the IPFS_LOGGING env var -- eg you can set it to DEBUG for lots of verbosity.

haadcode commented 6 years ago

Thank you @vyzo!

After figuring out how to get the debug logs in ipfsd-ctl, was able to get them on CI and finally reproduce the failure and the behaviour described above. See here https://circleci.com/gh/orbitdb/orbit-db/1523#queue-placeholder/containers/3.

There's a pubsub error pubsub: already have connection to peer: <peer.ID bUcKci> pubsub.go:205 and a second one a little below in the logs. As previously described, this happens only some times but that error doesn't happen in successful test runs. Hopefully this will help!

For those ending up here looking for instructions on how to get go-ipfs logs when used via js-ipfsd-ctl and js-ipfs-api, logging need to be turned on in ipfsd-ctl too (this is not documented and was unintuitive for me): IPFS_LOGGING=DEBUG DEBUG=ipfsd-ctl:daemon:stderr node your-program.js

haadcode commented 6 years ago

Ping. @vyzo @bigs any pointers on this? Where in the code should we start looking if we wanted to dig deeper?

vyzo commented 6 years ago

The peer discovery mechanism is based on ambient connection events, with entry point here: https://github.com/libp2p/go-floodsub/blob/master/notify.go#L18

The notification is handled here: https://github.com/libp2p/go-floodsub/blob/master/pubsub.go#L201

thiagozs commented 6 years ago

Hello guys, we use the lib in our project and we got the same problem, The announce of peers we got

INFO[0001] Start new node                               
INFO[0001] Register peer...                             
INFO[0001] bootstrapping...                             
INFO[0006] bootstrapping done                           
INFO[0009] Announce, searching for other peers          
INFO[0009] Register Channel                              name=Transactions
11:50:52.660 ERROR     pubsub: already have connection to peer:  <peer.ID byL35h> asm_amd64.s:2361
11:50:52.661 ERROR     pubsub: already have connection to peer:  <peer.ID byL35h> asm_amd64.s:2361

We are try use the same examples of code for connection each others.

whyrusleeping commented 6 years ago

Hrm... @haadcode are you calling connect from both peers at the same time? That is, are you telling peer A to connect to peer B and telling peer B to connect to peer A? If you are, then I think this might be an issue with the ordering of connection notifications. cc @Stebalien

bigs commented 6 years ago

apologies for the delays, going to dig into this today and get back to you ASAP

bigs commented 6 years ago

@haadcode @whyrusleeping i think that might be on the money... after spelunking on this, my guess is it's a bizarre situation where the connect notification from A dialing out to B is overridden by the notification of B dialing in to A, with the same happening on the other side. this would cause both sides to cancel the stream that THEY opened in favor of the one the other opened.

i think a solution could be to serialize the notification process on the swarm level OR to make the floodsub logic more robust to make up for this (keep track of whether you've initialized a stream for every peer, only allowing re-initialization when there is a disconnect between connects)

thoughts @whyrusleeping? i can make this patch tonight or tomorrow.

bigs commented 6 years ago

~my gut is the more correct answer lies in swarm so i will take a go of it~

edit: on second thought, we maintain a list of connections to a peer for a reason (i'm guessing) so i'll shore up flood sub and see if that does it.

whyrusleeping commented 6 years ago

@bigs yeah, give it a shot. See if you can reproduce the issue that @haadcode is describing in isolation (a test) first.

haadcode commented 6 years ago

Thank you @whyrusleeping and @bigs! I was able to make the error go away by not doing what @whyrusleeping was describing, ie. manually connect the peers. By not calling swarm.connect at all (we do it in the failing tests as described: both call each other) the problem goes away. Iirc, calling it from just one still results in the same erroneous behaviour.

Wherever this is fixed, my two cents would be that I feel this should be fixed in a way that one system doesn't effect the behaviour of another sub-system from the user's perspective. Ie. calling swarm.connect shouldn't effect pubsub (they may or may not be connected internally, but for the user it shouldn't matter).

bigs commented 6 years ago

failing test and subsequent fix over at #105. feel free to test out that commit, @haadcode

bigs commented 6 years ago

@haadcode quick update... after some discussion on our end, we've got a better fix planned. first attempt will be done today

bigs commented 6 years ago

@haadcode another update... this is actually a bug in our muxing protocol triggered when a tcp simultaneous open occurs between two hosts. there are a few options for fixes, but it will take some time before we settle on something to ship.

haadcode commented 6 years ago

@bigs thanks for the update! Wow, didn't realize this goes so deep, but I'm glad to hear this has lead to the lower levels and the bug was caught :) Is there an open issue somewhere else to follow the discussion/progress?

bigs commented 6 years ago

@haadcode with any luck today or tomorrow i'll be sketching a fix for yamux!

code4trade commented 5 years ago

@bigs @haadcode Do you have any news about this problem? We are commenting a small part of pubsub.go, is there any other alternative to work around this problem?

haadcode commented 5 years ago

Friendly bump, any updates @bigs?

bigs commented 5 years ago

@haadcode @code4trade i think the best way forward is for me to push a fork of yamux and register it with a new protocol string. then you could use this fork and avoid this issue. you can also force your transports to use mplex which doesn't have the issue

haadcode commented 5 years ago

i think the best way forward is for me to push a fork of yamux and register it with a new protocol string. then you could use this fork and avoid this issue.

A fork doesn't work for us unfortunately as we want to be able to use the latest from dist.ipfs.io, ie. official release (as that's what our users will use). What's blocking you to fix it in yamux and pulling in the fixed version to the official release?

you can also force your transports to use mplex which doesn't have the issue

How would one do that? Is this a configuration flag somewhere?

vyzo commented 5 years ago

Hey guys, can you try this again to verify that the problem still exists? We did fix the multiple connections issue in #132.

thiagozs commented 5 years ago

Hello @vyzo I made a quick update on the lib and for me work like a charm. I keeping make a test. Thanks for the update.

iulianpascalau commented 5 years ago

While using go-libp2p-kad-dht for peer discovery it works like a charm, using the go-libp2p/p2p/discovery/mdns discovery mechanism, issues as described above still appear: peers have connections but querying the pubsub subsystem for peers subscribed to a topic shows bizarre results: out of 3 connected peers, one peer "sees" 2 other peers but the other 2 peers see only one subscription to that particular topic. This fact results in having "lonely peers" that do no get messages from other peers. It happened to me a couple of times, this failure does not appear every time I run my tests.

vyzo commented 4 years ago

fixed by #132.