Closed haadcode closed 4 years ago
thanks for the report! the old floodsub implementation is still present and being used by IPFS, though it was changed a non-trivial amount to abstract the routing layer from the rest of the implementation, potentially adding room for regression. could you enable debug logging in the pubsub
module and run the tests a few times to see if we can get some extra information?
How does one "enable logging in the pubsub module"? Please advice.
@bigs would appreciate advice on this to be able to provide more information to narrow this down and possibly have a fix soon.
Are you using gossipsub or floodsub? as of #95 gossipsub has quite a bit of logging around internal events, and this includes logging peer discovery.
@vyzo using whatever comes with the official release and using it through go-ipfs-dep and js-ipfs-api.
Generally though my question is how to turn on logging for go-ipfs? I used to know it, but can't remember anymore and can't find up-to-date documentation what env vars or flags I should set :)
This is still floodsub, although gossipsub is coming soon to mainline ipfs as well.
You can see logs with the IPFS_LOGGING
env var -- eg you can set it to DEBUG for lots of verbosity.
Thank you @vyzo!
After figuring out how to get the debug logs in ipfsd-ctl, was able to get them on CI and finally reproduce the failure and the behaviour described above. See here https://circleci.com/gh/orbitdb/orbit-db/1523#queue-placeholder/containers/3.
There's a pubsub error pubsub: already have connection to peer: <peer.ID bUcKci> pubsub.go:205
and a second one a little below in the logs. As previously described, this happens only some times but that error doesn't happen in successful test runs. Hopefully this will help!
For those ending up here looking for instructions on how to get go-ipfs logs when used via js-ipfsd-ctl and js-ipfs-api, logging need to be turned on in ipfsd-ctl too (this is not documented and was unintuitive for me):
IPFS_LOGGING=DEBUG DEBUG=ipfsd-ctl:daemon:stderr node your-program.js
Ping. @vyzo @bigs any pointers on this? Where in the code should we start looking if we wanted to dig deeper?
The peer discovery mechanism is based on ambient connection events, with entry point here: https://github.com/libp2p/go-floodsub/blob/master/notify.go#L18
The notification is handled here: https://github.com/libp2p/go-floodsub/blob/master/pubsub.go#L201
Hello guys, we use the lib in our project and we got the same problem, The announce of peers we got
INFO[0001] Start new node
INFO[0001] Register peer...
INFO[0001] bootstrapping...
INFO[0006] bootstrapping done
INFO[0009] Announce, searching for other peers
INFO[0009] Register Channel name=Transactions
11:50:52.660 ERROR pubsub: already have connection to peer: <peer.ID byL35h> asm_amd64.s:2361
11:50:52.661 ERROR pubsub: already have connection to peer: <peer.ID byL35h> asm_amd64.s:2361
We are try use the same examples of code for connection each others.
Hrm... @haadcode are you calling connect from both peers at the same time? That is, are you telling peer A to connect to peer B and telling peer B to connect to peer A? If you are, then I think this might be an issue with the ordering of connection notifications. cc @Stebalien
apologies for the delays, going to dig into this today and get back to you ASAP
@haadcode @whyrusleeping i think that might be on the money... after spelunking on this, my guess is it's a bizarre situation where the connect notification from A dialing out to B is overridden by the notification of B dialing in to A, with the same happening on the other side. this would cause both sides to cancel the stream that THEY opened in favor of the one the other opened.
i think a solution could be to serialize the notification process on the swarm level OR to make the floodsub logic more robust to make up for this (keep track of whether you've initialized a stream for every peer, only allowing re-initialization when there is a disconnect between connects)
thoughts @whyrusleeping? i can make this patch tonight or tomorrow.
~my gut is the more correct answer lies in swarm so i will take a go of it~
edit: on second thought, we maintain a list of connections to a peer for a reason (i'm guessing) so i'll shore up flood sub and see if that does it.
@bigs yeah, give it a shot. See if you can reproduce the issue that @haadcode is describing in isolation (a test) first.
Thank you @whyrusleeping and @bigs! I was able to make the error go away by not doing what @whyrusleeping was describing, ie. manually connect the peers. By not calling swarm.connect
at all (we do it in the failing tests as described: both call each other) the problem goes away. Iirc, calling it from just one still results in the same erroneous behaviour.
Wherever this is fixed, my two cents would be that I feel this should be fixed in a way that one system doesn't effect the behaviour of another sub-system from the user's perspective. Ie. calling swarm.connect
shouldn't effect pubsub (they may or may not be connected internally, but for the user it shouldn't matter).
failing test and subsequent fix over at #105. feel free to test out that commit, @haadcode
@haadcode quick update... after some discussion on our end, we've got a better fix planned. first attempt will be done today
@haadcode another update... this is actually a bug in our muxing protocol triggered when a tcp simultaneous open occurs between two hosts. there are a few options for fixes, but it will take some time before we settle on something to ship.
@bigs thanks for the update! Wow, didn't realize this goes so deep, but I'm glad to hear this has lead to the lower levels and the bug was caught :) Is there an open issue somewhere else to follow the discussion/progress?
@haadcode with any luck today or tomorrow i'll be sketching a fix for yamux!
@bigs @haadcode Do you have any news about this problem? We are commenting a small part of pubsub.go, is there any other alternative to work around this problem?
Friendly bump, any updates @bigs?
@haadcode @code4trade i think the best way forward is for me to push a fork of yamux and register it with a new protocol string. then you could use this fork and avoid this issue. you can also force your transports to use mplex
which doesn't have the issue
i think the best way forward is for me to push a fork of yamux and register it with a new protocol string. then you could use this fork and avoid this issue.
A fork doesn't work for us unfortunately as we want to be able to use the latest from dist.ipfs.io
, ie. official release (as that's what our users will use). What's blocking you to fix it in yamux and pulling in the fixed version to the official release?
you can also force your transports to use mplex which doesn't have the issue
How would one do that? Is this a configuration flag somewhere?
Hey guys, can you try this again to verify that the problem still exists? We did fix the multiple connections issue in #132.
Hello @vyzo I made a quick update on the lib and for me work like a charm. I keeping make a test. Thanks for the update.
While using go-libp2p-kad-dht for peer discovery it works like a charm, using the go-libp2p/p2p/discovery/mdns discovery mechanism, issues as described above still appear: peers have connections but querying the pubsub subsystem for peers subscribed to a topic shows bizarre results: out of 3 connected peers, one peer "sees" 2 other peers but the other 2 peers see only one subscription to that particular topic. This fact results in having "lonely peers" that do no get messages from other peers. It happened to me a couple of times, this failure does not appear every time I run my tests.
fixed by #132.
We've been running into some hickups in OrbitDB recently as we've been upgrading from go-ipfs 0.4.13 to 0.4.16 (and now 0.4.17).
We've observed a behaviour where pubsub peers don't always connect to each other. Detailed information on the behaviour and what we've discovered so far in https://github.com/orbitdb/orbit-db/pull/417. To summarize: the tests are run with two local go-ipfs nodes, both subscribe to the same pubsub topic, but sometimes (about 50% of test runs) they don't seem to connect to each other on pubsub level even though they're connected on swarm level (see this comment https://github.com/orbitdb/orbit-db/pull/417#issuecomment-407002661).
Looking at the commit log in this repo, I noticed that the pubsub implementation was changed (yay, great to see the gossip pubsub coming together!) some time ago, so most likely this is regression. I believe 0.4.13 was working fine, but we haven't verified which version after that doesn't work anymore.
If anyone has an idea why this is happening and work on a fix, would highly appreciate it. We're (finally) adding back the support for go-ipfs in OrbitDB, but the tests are not running reliably at all times due to this bug, so we're hesitant to make a new release until this is fixed (alternatively we could instruct user to use =< 0.4.13 but that's no fun).
If there's any information I can provide to solve this, please let me know!