IPFS loses swarm connection while pinning

markg85 commented 5 years ago

Hi,

I'm playing with IPFS and pinning. I might have discovered an oddity while pinning and swarm connections.

The setup is as follows. 1 IPFS server on a cloud hosting provider 1 IPFS locally Both are the latest IPFS version (0.4.18). Both run with --routing=dhtclient The server is running with IPFS_PROFILE=server

Locally i added a large folder. On the cloud i'm pinning that same folder. On the cloud i'm grepping to see if i'm connected to my local machine: docker exec ipfs_host ipfs swarm peers | grep CID

Locally in the WEBUI i'm monitoring for traffic to see when it's uploading. This gives me quite notable gaps: https://i.imgur.com/1whgzx6.png

The server oftentimes quickly reconnects to the peer it is pinning from, but sometimes it takes a LONG while or just doesn't reconnect at all anymore (or so it seems). So long that i manually connect the peer to the swarm again on the server to resume uploading. Like you see in the before linked image. It had a lot of gaps and just ended doing nothing.

Both locally and on the cloud there were no internet connection issues that might have caused this. Also, it's very much repeatable. Just try the same setup yourself and you will probably see the same thing happening.

Also, most gaps happen to be spaced at around 90 seconds intervals. Might be a coincidence as i ended up manually reconnecting over and over again till everything was pinned.

Best regards, Mark

raulk commented 5 years ago

This looks like an issue we fixed recently: https://github.com/libp2p/go-libp2p-kad-dht/pull/237#discussion_r251742560

Would you be able to build IPFS from master and try reproducing?

markg85 commented 5 years ago

This looks like an issue we fixed recently: libp2p/go-libp2p-kad-dht#237 (comment)

Would you be able to build IPFS from master and try reproducing?

If you provide me the commands for the docker ipfs image, yes gladly :)

raulk commented 5 years ago

@markg85 you can just fetch the master tag from Docker Hub: https://hub.docker.com/r/ipfs/go-ipfs/tags

markg85 commented 5 years ago

@markg85 you can just fetch the master tag from Docker Hub: https://hub.docker.com/r/ipfs/go-ipfs/tags

Ehh, oke. The cloud version is now the docker master one. My local version (arch linux distribution package) is still just the latest version (0.4.18).

The master ipfs doesn't appear to be able to connect:

failure: dial attempt failed: <peer.ID QmSuFCF6> --> <peer.ID Qm5SHS8v> dial attempt failed: context deadline exceeded

raulk commented 5 years ago

On which machine are you executing the connect command? Is this local trying to connect to the cloud, or viceversa? Beware that your peer ID could’ve possibly changed.

markg85 commented 5 years ago

I'm executing the command on the cloud (that id changed) to the local one (that remained as is). I'm trying to build go-ipfs locally now, just to see of that would work as both would be from master.

raulk commented 5 years ago

Thanks. Just one note: I think your issue could be with the connection manager killing the session. You can try to increase the connection manager limits in the IPFS config.

https://github.com/ipfs/go-ipfs/blob/master/docs/config.md

markg85 commented 5 years ago

No i won't. It currently is at the defaults and that already causes the cloud provider to thing that i got hacked due to thousands of connections in mere minutes like i'm attacking someone. I'm guessing that improved greatly with your p2p fixes and the recent bitswap fixes. At least, i hope it did :)

raulk commented 5 years ago

Note that the connection manager and the swarm dialer limit are distinct. The connection rate (inflight dials) is governed by the swarm (what your cloud provider may be complaining about). That has improved with the DHT fixes. The connection manager is in charge of keeping open connections within bounds.

markg85 commented 5 years ago

I'm sorry, but i can't get this working anymore at all now. Both instances now run on got master. Executing the swarm connect to my local ipfs still gives:

failure: dial attempt failed: <peer.ID QmSuFCF6> --> <peer.ID Qm5SHS8v> dial attempt failed: context deadline exceeded

Is there anything i can add in debug logging to help trace this thing? Note: I am online on irc (markg85) in #ipfs

raulk commented 5 years ago

@markg85 was kind to pair with me on this. The issue is that despite having a static mapping in his router for IPFS on port 4001, current master was discovering a wrong public port (1024, weird). This led to his address in the DHT being incorrect, and dials failing due to his NAT dropping the incoming traffic. ipfs swarm addrs local shows the incorrect port number. He will post more details shortly. This issue did not happen with 0.4.18.

markg85 commented 5 years ago

@raulk and i paired on IRC to debug this. Turns out that IPFS is advertising a multiaddr with a bad public port. For example: ipfs swarm addrs local Gives: (ip's anonymized)

/ip4/127.0.0.1/tcp/4001
/ip4/123..123.123.123/tcp/1025
/ip6/::/tcp/4001
/ip6/::1/tcp/4001

While i had port 4001 open and forwarded. It shows port 1025 in this case, which is wrong.

raulk commented 5 years ago

@markg85 can you post the equivalent output from 0.4.18, please? Thanks again.

markg85 commented 5 years ago

And as i just tested, 0.4.18 has the same issue.

$ ipfs version
ipfs version 0.4.18

$ ipfs swarm addrs local
/ip4/10.0.3.50/tcp/4001
/ip4/127.0.0.1/tcp/4001
/ip4/123.123.123.123/tcp/1024
/ip6/::1/tcp/4001

markg85 commented 5 years ago

Just a friendly reminder. A new go-ipfs has been released. I had hoped this bug to be magically fixed, but the new version apparently didn't fix that.

Both my local and remote machine now run 0.4.19! Both run IPFS in docker from the latest image.

On my remote machine there is no 1024 port. Good! On my local machine i do still see a 1024 port being present!

The local machine has a clean IPFS setup, data and config. The remote had it from the previous version.

Please take a look at this. It cases swarm connections to "sometimes" fail and "sometimes" work.

markg85 commented 5 years ago

How can i raise the attention of the right people for this issue? As i have a feeling the ones that need to know about this don't. Which causes new releases to be shipped with the very same bug still present.

Stebalien commented 5 years ago

We are working on this but it's just not the only thing we're working on fixing. @raulk is the right person.

markg85 commented 5 years ago

I would suggest marking this a blocker for the next release.

Stebalien commented 5 years ago

That's not going to get the problem fixed any faster, just delay other fixes.

markg85 commented 5 years ago

I understand, but do know that this bug prevents making a connection at all. That little side effect alone should make it a quite high priority.

On the other side, i have it but others don't seem to be bothered by it at all. So it might just be occurring with some router vendors? Or some other special non-obvious thing. And with just using IPFS (aka, not running commands but just using it to browse the "IPFS internet") there seems to be nothing wrong.

remmerw commented 5 years ago

@markg85 I have the same issue with advertising wrong ports (ipfs id) My observation is: When I have run one IPFS node behind a router, after a period of time it reports the public IP of the router with the swarm port (4001) -> this can be correct when doing port forwarding But when you run a second node behind the same router (different machine). It advertises the public IP address and the port 1025 (sometimes 1024) [not yet figured it out] When you run a third node behind the same router (different machine) it just increase the port number by one and advertise it. I am not an expert in NAT, but it looks like an issue

markg85 commented 5 years ago

@remmerw That might be something. Or perhaps something that makes investigating it more easy for the devs.

In my case however, I've only ever had 1 node running behind the router. Never more.

voidao commented 5 years ago

@markg85 @remmerw @raulk Seems like I got a pretty similar issue(Local desktop node failed swarm connect to remote cloud node)!

:~ $ ipfs swarm connect /ip4/1*.10.6.19/tcp/4009/ipfs/... failure: context deadline exceeded

~ $ jsipfs swarm connect /ip4/1*.10.6.19/tcp/4009/ipfs/... No available transports to dial peer

Some clues/findings:

It used to work pretty well, almost always succeed, but suddenly ran into trouble without any change.
Newly initialized node(with new repo location and peerid, @local) would succeed to connect, but ran into issue later on.
Based on 1&2, I guess the cause may be some restriction on remote/cloud side, which is triggered by IPFS related networking operations.

Stebalien commented 5 years ago

@voidao that's likely unrelated to this issue. "Cloud" nodes don't have NAT issues.

WRT this issue, the core problem is that IPFS doesn't know how you've configured your router. It has to guess as well as it can.

It does this by:

Asking the router to forward a port using UPnP (and related protocols).
Opening outbound connections using the same port on which it receives connections and tracking addresses observed by peers. Many routers will consistently map the same external port to the same internal port so the external port observed by our peers can often be re-used for inbound connections.

Unfortunately, it doesn't look like either of those are working in this case.

I'm going to close this in favor of https://github.com/libp2p/go-libp2p/issues/559 as that's an actionable solution to this issue.

voidao commented 5 years ago

@Stebalien Thank you for the detailed explanation! It makes sense to me, and I guess it's caused by the router or something else in the NAT environment.

ipfs / kubo

IPFS loses swarm connection while pinning #5977