ipfs / kubo

An IPFS implementation in Go
https://docs.ipfs.tech/how-to/command-line-quick-start/
Other
15.98k stars 3k forks source link

Pubsub discovery seem to be broken when using MDNS only #7757

Open tchardin opened 3 years ago

tchardin commented 3 years ago

Version information:

github.com/ipfs/go-ipfs v0.7.0

Description:

When subscribing and publishing to the same topic, 2 IPFS nodes connected to each other do not receive messages. When I use the swarm api to list connected peers, the other peer does appear in the list however when listing pubsub peers the list is empty. When I swap with my own libp2p host, it works as expected. You can checkout this repo which features a basic example demonstrating the issue. Simply run the app in separate terminal windows and you can see the swarm is connected but not pubsub. You can also uncomment the libp2p host to see how it should work. Please let me know if I am misunderstanding or missing a configuration to enable pubsub discovery on top of swarm. Thanks for your time!

welcome[bot] commented 3 years ago

Thank you for submitting your first issue to this repository! A maintainer will be here shortly to triage and review. In the meantime, please double-check that you have provided all the necessary information to make this process easy! Any information that can help save additional round trips is useful! We currently aim to give initial feedback within two business days. If this does not happen, feel free to leave a comment. Please keep an eye on how this issue will be labeled, as labels give an overview of priorities, assignments and additional actions requested by the maintainers:

Finally, remember to use https://discuss.ipfs.io if you just need general support.

wolfgang commented 3 years ago

I have noticed a similar problem, and in my case it happens randomly. Meaning, in 10 runs of subscribe/publish pairs, at least one publish would fail because the publishing peer does not have the subscriber as a pubsub peer. I have written a "minimal" javascript program that demonstrates this:

Do N times:

In my case this always fails before the 10th run. And it fails because node 2 does not have node 1 as pubsub peer for the topic.

Code is here (yarn && yarn start): https://github.com/wolfgang/ipfs-pubsub-test

aschmahmann commented 3 years ago

I ran the code from https://github.com/tchardin/ipfs-pubsub-test and noticed that the peers weren't connected at all not having to do anything with pubsub. I suspect there are some configuration issues (or perhaps an MDNS bug).

Did you only experience this locally? Have you experienced this with the go-ipfs binary or only as a library? If as a library could you post the configuration of your node(s) (ipfs config show).

alexjc commented 3 years ago

@aschmahmann The code by @wolfgang waits for the peers before doing pubsub. It fails at test 1/100 or 2/100 for me, with 100% failure rate.

I've noticed this being unreliable since mid-2019, but haven't been able to find a solution, or a workaround that's production-quality...

tchardin commented 3 years ago

I ran the code from https://github.com/tchardin/ipfs-pubsub-test and noticed that the peers weren't connected at all not having to do anything with pubsub. I suspect there are some configuration issues (or perhaps an MDNS bug).

@aschmahmann do you mean both peers and ipfsPeers logs display 0? When I run it, I see the Swarm peers are connected but I have no pubsub peers.

I am using go-ipfs as library. I experienced this locally using mdns (I removed bootstrap peers so I could see only my local peers but they all connect as expected). The logs show mdns is working and the swarm is connected. When printing my config I get:

{
  "Identity": {
    "PeerID": "QmYDdMJ9iLMk3YSXb6Z6BCGA8vUBa3KB1qWHzEyTpiiW4r",
    "PrivKey": "..."
  },
  "Datastore": {
    "StorageMax": "10GB",
    "StorageGCWatermark": 90,
    "GCPeriod": "1h",
    "Spec": {
      "mounts": [
        {
          "child": {
            "path": "blocks",
            "shardFunc": "/repo/flatfs/shard/v1/next-to-last/2",
            "sync": true,
            "type": "flatfs"
          },
          "mountpoint": "/blocks",
          "prefix": "flatfs.datastore",
          "type": "measure"
        },
        {
          "child": {
            "compression": "none",
            "path": "datastore",
            "type": "levelds"
          },
          "mountpoint": "/",
          "prefix": "leveldb.datastore",
          "type": "measure"
        }
      ],
      "type": "mount"
    },
    "HashOnRead": false,
    "BloomFilterSize": 0
  },
"Addresses": {
    "Swarm": [
      "/ip4/0.0.0.0/tcp/0"
    ],
    "Announce": [],
    "NoAnnounce": [],
    "API": "/ip4/127.0.0.1/tcp/5001",
    "Gateway": "/ip4/127.0.0.1/tcp/8080"
  },
  "Mounts": {
    "IPFS": "/ipfs",
    "IPNS": "/ipns",
    "FuseAllowOther": false
  },
  "Discovery": {
    "MDNS": {
      "Enabled": true,
      "Interval": 10
    }
  },
  "Routing": {
    "Type": "dht"
  },
  "Ipns": {
    "RepublishPeriod": "",
    "RecordLifetime": "",
    "ResolveCacheSize": 128
  },
  "Bootstrap": [],
  "Gateway": {
    "HTTPHeaders": {
      "Access-Control-Allow-Headers": [
        "X-Requested-With",
        "Range",
        "User-Agent"
      ],
      "Access-Control-Allow-Methods": [
        "GET"
      ],
      "Access-Control-Allow-Origin": [
        "*"
      ]
    },
    "RootRedirect": "",
    "Writable": false,
    "PathPrefixes": [],
    "APICommands": [],
    "NoFetch": false,
    "NoDNSLink": false,
    "PublicGateways": null
  },
  "API": {
    "HTTPHeaders": {}
  },
  "Swarm": {
    "AddrFilters": null,
    "DisableBandwidthMetrics": false,
    "DisableNatPortMap": false,
    "EnableRelayHop": false,
    "EnableAutoRelay": false,
    "Transports": {
      "Network": {},
      "Security": {},
      "Multiplexers": {}
    },
 "ConnMgr": {
      "Type": "basic",
      "LowWater": 600,
      "HighWater": 900,
      "GracePeriod": "20s"
    }
  },
  "AutoNAT": {},
  "Pubsub": {
    "Router": "",
    "DisableSigning": false
  },
  "Peering": {
    "Peers": null
  },
  "Provider": {
    "Strategy": ""
  },
  "Reprovider": {
    "Interval": "12h",
    "Strategy": "all"
  },
  "Experimental": {
    "FilestoreEnabled": false,
    "UrlstoreEnabled": false,
    "ShardingEnabled": false,
    "GraphsyncEnabled": false,
    "Libp2pStreamMounting": false,
    "P2pHttpProxy": false,
    "StrategicProviding": false
  },
  "Plugins": {
    "Plugins": null
  }
}
aphelionz commented 3 years ago

Chiming in here to say that it seems to be broken for OrbitDB tests as well, see the builds from ipfs-log failing on CircleCI

The call to ipfs.pubsub.peers that seems to be reporting no peers connected is here. Note that it works locally but fails in CI.

Here's a peek of an strace from the CI run:

socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 36
connect(36, {sa_family=AF_INET, sin_port=htons(33699), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation
 now in progress)
epoll_ctl(13, EPOLL_CTL_ADD, 36, {EPOLLOUT, {u32=36, u64=36}}) = 0
epoll_wait(13, [{EPOLLOUT, {u32=36, u64=36}}], 1024, 0) = 1
getsockopt(36, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
write(36, "POST /api/v0/pubsub/peers?arg=XX"..., 219) = 219
epoll_ctl(13, EPOLL_CTL_ADD, 36, {EPOLLIN, {u32=36, u64=36}}) = -1 EEXIST (File exists)
epoll_ctl(13, EPOLL_CTL_MOD, 36, {EPOLLIN, {u32=36, u64=36}}) = 0
epoll_wait(13, [], 1024, 0)             = 0
epoll_wait(13, [{EPOLLIN, {u32=36, u64=36}}], 1024, 199) = 1
aschmahmann commented 3 years ago

@aphelionz are you noticing that pubsub isn't registering shared peers, but that IPFS/libp2p note that the peers are connected?

That's not extremely surprising in that if local MDNS didn't work in CircleCI that wouldn't be great but not completely insane. What's surprising me here (and that I haven't yet been able to reproduce) is getting the peers to be connected (e.g. ipfs swarm peers shows the other peer) and yet even after waiting a few seconds two pubsub peers subscribed to the same topic do not learn that the other one is part of that topic.

aphelionz commented 3 years ago

pubsub isn't registering shared peers, but that IPFS/libp2p note that the peers are connected?

That's exactly correct. ipfs.swarm.peers shows that the peers are connected but ipfs.pubsub.peers does not, even though they are both subscribed to the same topic. I hadn't seen this issue before.

tchardin commented 3 years ago

Fwiw I tried older versions until v5.0.0 and doesn't work for previous versions of ipfs either. Is there a working IPFS pubsub e2e test I could run locally to see if it's anything to do with my local configuration? This is a blocker for a product I'm working on now and I'd love to avoid using 2 libp2p hosts to make it work. Thanks for your help!

alexjc commented 3 years ago

@tchardin I don't think there are workarounds for this, I tried multiple things also going back ~18 months. The E2E tests we have (as @wolfgang's above) also fail... It's not clear it ever worked.

tchardin commented 3 years ago

Yeah I had to implement a custom IPFS node with my own libp2p configs and it works fine now. Still unclear where the issue is coming from.

aschmahmann commented 3 years ago

I think this is basically the result of a race condition where IPFS loads up more things at startup than a basic libp2p node does making it more common to occur. Potential fix at: https://github.com/libp2p/go-libp2p-pubsub/pull/393

aphelionz commented 3 years ago

Thank you so much @aschmahmann! Will this be in some patch release soon, and/or available on npm?

aschmahmann commented 3 years ago

@aphelionz, no problem :smile:. This should make it into go-ipfs v0.8.0 RC2

tchardin commented 3 years ago

Updating this issue as it only occurs with MDNS discovery. Pubsub works fine when peers are connected via Bootstrap and DHT discovery. Also tried running with the latest pubsub version fix and it doesn't work either. Renaming the issue to be more precise. Thanks