ipfs / kubo

An IPFS implementation in Go
https://docs.ipfs.tech/how-to/command-line-quick-start/
Other
15.83k stars 2.96k forks source link

Writeup of router kill issue #3320

Closed whyrusleeping closed 10 months ago

whyrusleeping commented 7 years ago

So we know that ipfs can kill people routers. We should do a quick write up of what the causes are, which routers are normally affected, and maybe propose a couple ideas for solutions.

@Kubuxu do you think you could handle doing this at some point?

donothesitate commented 7 years ago

My theory is, it does exhaust / overload NAT table, that on some routers does cause lockups. UDP on the same routers can keep working without problem, as well as TCP connections that were already open when lockup occurred.

Possible solution: Have a switch to limit number of peers/connections. Related #3311

ghost commented 7 years ago

That sounds highly likely. nf_conntrack_max on my edge router is set to 1024 by default and ipfs eats 700 of those on its own, per computer I'm running it on.

A lot of those are dead connections too: if I open the webui which tries to ping them it quickly drops to 90 or so.

hsanjuan commented 7 years ago

Running 5 daemons on local network with a well-known hash (they were pinning dist) kills my Fritzbox.

AFAIK everyone has high esteem for Fritzboxes as very good routers, and not some shitty hardware. Internet reports a NAT table size of around 7000. I find the problem is exacerbated when my nodes are pinning popular content (I suspect this not only consumes all the bandwidth but also increases the number of connections when other peers try to download these blocks?).

Kubuxu commented 7 years ago

So my idea of what happens is that conntracker table fills up (it is small in cheapo routers, bigger is good ones) and it starts trowing out other connections. @hsanjuan can you repeat the test, kill ipfs daemons and check if it comes back up online?

hsanjuan commented 7 years ago

@Kubuxu yeah yeah things are back up immediately when I kill them. Only once I had the router reboot itself, which worried me more.

Kubuxu commented 7 years ago

So other possibility is that cheapo routers have bigger conntracker limit than their RAM can handle and they kernel panics or lockups. Not sure how to check it.

whyrusleeping commented 7 years ago

Does UDP eat up conntracker entries? We're moving quickly towards having support for QUIC.

Kubuxu commented 7 years ago

AFAIK, yes. At least from the time my services were DDoSed with UDP packets and they were much more destructive because of low conntracker limits.

hsanjuan commented 7 years ago

Is it possible that this problem got much worse in the last releases (ie >=0.4.5). I used to be able to run 4 nodes without problems and now it seems I'm not, even after cleaning their contents.

kakra commented 7 years ago

I'm having issues, too. Maybe ipfs should take two connection pools and migrate peer connections from a bad quality pool to a good quality pool by applying some heuristics to the peers. Peers with higher delays, lower bandwidth and short lives would live in the "bad pool" and easily replaced by new peers if connection limits are hit. Better peers would migrate to the "good pool" and only be replaced by better peers if limits are hit. Having both pools gives slow peers a chance to be part of the network without being starved by higher quality peers, which is important for a p2p distributed network.

BTW, udp also needs connection tracking, this wouldn't help here, and usually udp tracking tables are much smaller and much more short-lived which adds a lot of new problems. But udp could probably lower the need for bandwidth as there's no implicit retransmission and no ack. Of course, the protocol has to be designed in a way to handle packet loss, and it must take into account that NAT gateways usually drop udp connection table entries much faster. It doesn't make sense to deploy udp and then reimplement retransfers and keep-alive, as this would replicate tcp with no benefit (probably it would even lower performance).

Also, ipfs should limit the amount of outstanding packets, not the amount of connections itself. If there are too many packets in-flight, it should throttle further communication with peers, maybe prioritizing some over others. This way, it could also auto-tune to the available bandwidth but I'm not sure.

Looking at what BBR does for network queues, it may be better to throw away some requests instead of queuing up a huge backlog. This can improve overall network performance, bloating buffers is a performance killer. I'd like to run ipfs 24/7 but if it increases my network latency, I simply cannot, which hurts widespread deployment.

Maybe ipfs needs to measure latency and throw away slowly responding peers. For this to properly work, it needs to auto-adjust to the bandwidth, because once network queues fill, latency will exponentially spike up and the former mentioned latency measurement is useless.

These big queues are also a problem with many routers as they tend to use huge queues to increase total bandwidth for benchmarks but it totally kills latency, and thus kills important services like DNS to properly work.

I'm running a 400/25mbps assymmetric link here, and as soon as "ipfs stats bw" get beyond a certain point, everything else chokes, browsers become unusable waiting for websites tens of seconds, or resulting in DNS errors. Once a web request comes through in such a situation, the website almost immediately completely appears (minus assets hosted on different hosts) so this is clearly an upstream issue with queues and buffers filled up and improper prioritizing (as ACKs still seem to pass early through the queues, otherwise download would be reduced, too).

I don't know if QUIC would really help here... It just reduces initial round-trip times (which HTTP/2 also does) which is not really an issue here as I consider ipfs a bulk-transfer tool, not a latency-sensitive one like web browsing.

Does ipfs properly use TOS/QoS flags in IP packets?

PS: ipfs should not try to avoid tcp/ips auto-tuning capabilities by moving to UDP. Instead it should be nice to competing traffic by keeping latency below a sane limit and let TCP do the bandwidth tuning. And it should be nice to edge-router equipment (which is most of the time cheap and cannot be avoided) by limiting outstanding requests and amount of total connections. I remembered when Windows XP tried to fix this in the TCP/IP stack by limiting outstanding TCP handshakes to ten, blocking everything else then globally. This was a silly idea but it was thinking in the right direction, I guess.

dsvi commented 6 years ago

I think you might as well not do anything at all, since routers are getting consistently better at supporting higher numbers of connections. My 5 years old struggled with supporting 2 ipfs nodes (about 600 connections each) + torrent (500 connections). I've just got cheap chinese one, and it works like a charm. Most of even cheap routers nowadays have hardware NAT. They don't much care how many connections you throw at them. Also, switching to UDP doesn't help, since when i unleash torrent far beyond 500 connections limit, it used to kill the old router as good as ipfs. And torrent uses only UDP.

ghost commented 6 years ago

@dsvi: I'd rather not have to pay hard cash just to use IPFS on the pretence that it's fine to be badly behaved because some other software can be misconfigured to crash routers. A lot of people don't even have the luxury of being allowed to connect to their ISP using their own hardware.

And what a strawman you've picked — a Bittorrent client! A system that evolved its defaults based on fifteen years real world experience for precisely this reason!

No thanks, just fix the code.

kakra commented 6 years ago

@dsvi I wonder if they use their own routers because the page times out upon request... ;-)

But please do not suggest that: Many people are stuck with what is delivered by their providers with no chance to swap that equipment for better stuff. Ipfs has not only to be nice to such equipment but to overall network traffic on that router, too: If it makes the rest of my traffic demands unusable, there's no chance for ipfs to evolve because nobody or only very few could run it 24/7. Ipfs won't reach its goal if it is started by people only on demand.

dsvi commented 6 years ago

Sorrry guys, should have expressed it better. I'll try this time from another direction ;)

  1. Internet world is becoming decentralized in general. This is a natural trend which is everywhere, from secure instant messaging, filesharing, decentralized email systems and so on. And creating tons of connections is a natural part of such systems. They are distributed, and for effective work they have to support tons of connections (distributions channels). It's unavoidable in general. There can be improvements here and there, but its fundamentally "unfixible"
  2. Hardware vendors have acknowledged that already. Modern router chipsets are way better in that regard nowadays, since all the hardware review sites have included at least torrent tests, in their review test suites. So you don't really need nowadays something 200$+ to work well with it. And a year from now, it will only get waay better, since they tend to offload a lot of routing work to hardware. So it already is not a problem, and will be even less so with every year.

And what about people who stuck with relic hardware for whatever reason? Well i feel sorry for some of them, but the progress will go on with them, or without.

Calmarius commented 6 years ago

@dsvi

"Internet world is becoming decentralized in general. "

Nope! It's becoming centralized. Almost the whole internet is served by a handful datacenter companies. For most people search means Google, e-mail means Gmail, social interactions mean Facebook, videos mean Youtube, chat means Facebook Messenger, picture sharing means Instagram. The rest of the web is hosted at the one of the several largest datacenter companies.

At the beginning we used to have Usenet and IRC servers running on our computers at home. Then services got more and more centralized.

I don't see signs of any decentralization. But I see signs of further centralization. For example some ISPs don't even give you public IP addresses anymore (for example 4G networks).

"And creating tons of connections is a natural part of such systems."

Having too many simultaneous connections makes the system inefficient. If you have enough peers to saturate your bandwidth it's pointless to add more.

Currently my IPFS daemon opens 2048 connections within several hours to peers then runs out of file descriptors and becomes useless. This should be fixed.

vext01 commented 5 years ago

I'm using a crappy TalkTalk router provided by the ISP and I've been unable to find a configuration where IPFS doesn't drag my internet connection to it's knees.

Using ifstat I see usually between 200kb/s and 1MB up and down whilst ipfs is connected to a couple of hundred peers.

I'd like to try connecting to fewer peers, but even with:

      "LowWater": 20,
      "HighWater": 30,

ipfs still connects to hundreds.

vext01 commented 5 years ago

Perhaps this is a dumb question, but why don't you make it so that IPFS stops connecting to more peers once the high water mark is reached?

Stebalien commented 5 years ago

We should implement a max connections but high/low water are really designed to be target bounds.

The libp2p team is currently refactoring the "dialer" system in a way that'll make it easy for us to configure a maximum number of outbound connections. Unfortunately, there's really nothing we can do about inbound connections except kill them as soon as we can. On the other hand, having too many connections usually comes from dialing.

Stebalien commented 5 years ago

Note: there's actually another issue here. I'm not sure if limiting the max number of open connections will really fix this problem. I haven't tested this but I'm guessing that many routers have problems with connection velocity (the rate at which we (try to) establish connections) not simply having a bunch of connections. That's because routers often need to remember connections even after they've closed (for a period of time).

@vyzo's work on NAT detection and autorelay should help quite a bit, unless I'm mistaken.

kakra commented 5 years ago

A work-around could be to limit the number of opening connections (in contrast to opened connections) - thus reducing the number of connection attempts running at the same time. I think this could be much more important than limiting the number of total connections.

If such a change propagated through the network, it should also reduce the amount of overwhelming incoming connection attempts - especially those with slow handshaking because the sending side is not that busy with opening many connections at the same time.

Stebalien commented 5 years ago

We actually do that (mostly to avoid running out of file descriptors). We limit ourselves to opening at most 160 TCP connections at the same time.

kakra commented 5 years ago

@Stebalien Curious, since when? Because I noticed a while ago that running IPFS no longer chokes DNS resolution of my router...

Stebalien commented 5 years ago

We've been doing it for quite a while however, we may have changed some constants. I'm not sure.

hsanjuan commented 5 years ago

IPFS also chokes DNS resolution on my router @kakra (particularly when building dists, which downloads large files). I haven't checked if there are any improvements though, but will keep an eye. Do you have a Fritzbox too?

kakra commented 5 years ago

@hsanjuan No, it's a Vodafone DOCSIS 3.1 router, 1 gigait connection.

raulk commented 5 years ago

The connection manager currently works with hard bounds set by configuration. Do you think it’s feasible to “sense” appropriate limits from the environment by allowing connections to be alive until we hit an hiccup, at which point we back off?

Kubuxu commented 5 years ago

Some other mechanism of "router kill" I've observed is SYN and RST storms when ipfs starts working and is being shut down. I had an ISP router go down sometimes when starting and stopping ipfs.

magik6k commented 5 years ago

SYN and RST storms

This was reported in libp2p-swarm some time ago: https://github.com/libp2p/go-libp2p/issues/1550

raulk commented 5 years ago

This was reported in libp2p-swarm some time ago: libp2p/go-libp2p#1550

With the new DHT dial queue, this should not happen any longer: https://github.com/libp2p/go-libp2p-kad-dht/pull/237

However I was more interested in knowing if there's a event (or set thereof) that we can interpret as a high-fidelity signal that we're tripping up the router, so we can back off and adjust our connection manager limits adaptively.

Kubuxu commented 5 years ago

However I was more interested in knowing if there's a event (or set thereof) that we can interpret as a high-fidelity signal that we're tripping up the router, so we can back off and adjust our connection manager limits adaptively.

Hard to tell, many of them will continue forwarding connections without any problem because they are on the fast path, until they give up.

vext01 commented 5 years ago

With the new DHT dial queue, this should not happen any longer

Great! I'm going to try on my DSL router later!

AndreaCensi commented 5 years ago

On my Mac, IPFS kills (in addition to my router) also some applications --- e.g., Chrome, which I suspect cannot do a DNS query, but I am not sure.

What is the workaround currently? Can IPFS be set up so that the daemon on my PCs connect only to my server in the cloud, and through it to the entire network? (I do not mean making a private network with the swarm key, but using one known "gateway").

andrew commented 5 years ago

This still regularly happens on my home network, killing both the router and the wifi extender.

victorb commented 5 years ago

Same here. Development is currently infeasible with anything involving go-ipfs (using ipfslite for Bolivar, but suffering from same problem).

Just running a node for about ~5 minutes (had time to connect to 30 peers) kills my router, which is standard issued router from my ISP.

Stebalien commented 5 years ago

@victorb could you try running in DHT client mode? I want to see if that makes any difference (relates to #6283).

AndreaCensi commented 5 years ago

@Stebalien FWIW, following the advice in other threads, I am using --routing dhclient, which makes the situation a bit better, but the problem persists.

Is there a previous version of go-ipfs where this problem is reduced? My problem like @victorb is also that I would like to set up a stable development environment, with the basic ipfs functionality.

Stebalien commented 5 years ago

No. The problem is likely to be significantly worse in previous versions. This is a long standing issue we've been whittling away at, not a regression.

victorb commented 5 years ago

Thanks @Stebalien. Did the following change: https://github.com/open-services/bolivar/pull/6/files

Seems to have stopped killing my router at home, at least after running the tests for a couple of times (without that change, it gets killed every time I run the tests).

While it's good that it's now working, the application does not work as wanted after that change, as it won't help local peers find content anymore, making it pretty useless.

Wondering if another DHT client could be introduced. One where the node doesn't respond to all DHT requests, only to ones coming from the local network.

Stebalien commented 5 years ago

Wondering if another DHT client could be introduced. One where the node doesn't respond to all DHT requests, only to ones coming from the local network.

Unfortunately, that won't work if any one of your local nodes is a member of the global DHT. That node will include you in their routing table and tell their peers about you.

as it won't help local peers find content anymore, making it pretty useless.

So, bitswap should still work. Are you trying to get this to work in an offline/disconnected context where local peers can't search the global DHT?

Honestly, a local peer exchange seems like the best solution for issues like this.

victorb commented 5 years ago

Unfortunately, that won't work if any one of your local nodes is a member of the global DHT. That node will include you in their routing table and tell their peers about you.

Hm, I see... I guess what I'm trying to figure out is how to get local (as in same local network) peers to collaborate on sending data, while avoiding the issue of ipfs/libp2p killing peoples router.

So, bitswap should still work. Are you trying to get this to work in an offline/disconnected context where local peers can't search the global DHT?

Exactly. The use case is that there are two peers, they are locally connected but might or might not have internet connection. Even if they do/do not have a internet connection, they should still be able to fetch content from each other, as they are both using the same hash to resolve links from within that hash.

I've done some tests with the DHT Client and my assumption seems correct that they won't be able to find content from each other, when both are running as DHT client. So while it's nice that the router doesn't die anymore, it kind of makes the use-case of p2p useless, as it needs to reach something that is not just a DHT client.

Honestly, a local peer exchange seems like the best solution for issues like this.

Is this something that already exists or you're talking about something that might exists in the future?

Stebalien commented 5 years ago

I've done some tests with the DHT Client and my assumption seems correct that they won't be able to find content from each other, when both are running as DHT client. So while it's nice that the router doesn't die anymore, it kind of makes the use-case of p2p useless, as it needs to reach something that is not just a DHT client.

MDNS should cause these peers to connect to eachother automatically. However, we may need to improve MDNS support.

We could also try using the DHT for local rendezvous (i.e., who else has my same public IP?).

remon-nashid commented 4 years ago

Has there been any recent efforts to tackle this issue. This issue still renders IPFS useless.

Stebalien commented 4 years ago

Has there been any recent efforts to tackle this issue.

Yes.

remon-nashid commented 4 years ago

Thanks @Stebalien! That sounds quite promising. While IPFS keeps progressing, is there a set of settings that could help to avoid the issue, while not compromising on IPFS features? There are many suggestions in this thread but some could be outdated or just ineffective. A standard set of settings, preferably in IPFS documentation, would be awesome.

Stebalien commented 4 years ago

Unfortunately, the best solution is to replace your router.

Baring that, you can try running IPFS in client-only mode by running the daemon with ipfs daemon --routing=dhtclient. This will lead to significantly fewer inbound connections.

chevdor commented 4 years ago

@Stebalien any list anywhere of what you consider a bad/good router?

Stebalien commented 4 years ago

I've never found a router I've been truly happy with. I use a tp-link (TL-R600VPN) and it seems to hold up to IPFS just fine, but it has its own problems (we actually have a work-around in libp2p to deal with them).

SunsBR commented 3 years ago

Its 2020 this started in 2016, and it still crashes routers. have a fiber ISP router (cannot change it). symmetric 100mbs up/down the connection number goes up really fast and I cannot even ping the router (local lan continues working fine). any other p2p software in a good number of machines never locked up the router. take a time to come back after kill the daemon, probably a watchdog reseting the router. ipfs ver 0.5.1

yxuco commented 3 years ago

Same issue here using Linksys CG7500 cable modem/router on Comcast network. The router reboots frequently when ipfs daemon is running. The network is more stable after I set the following config properties, which had kept the number of peer connections below 500 most of the time. However, the performance for curl https://ipfs.io/ipfs/mycid gradually degraded with time for a file pinned on the local node, and it returns only an error Couldn't connect to server after 15 minutes. Does it mean that the local daemon became non-responsive to inbound requests even though it was still running with no errors?

    "Swarm": {
        "ConnMgr": {
            "GracePeriod": "10s",
            "HighWater": 90,
            "LowWater": 60,
            "Type": "basic"
        },
RubenKelevra commented 3 years ago

I suspect that the issue is not the number of open connections, but the amount of connection attempts to private networks, which the router tries to push to his default route. Since the ISP has no route for them, the probably doesn't get answered with ICMP but just dropped. This causes the router to be unable to clear them up in a timely manner.

By default around 1/3 of all connection attempts go to private networks.

I wrote about this issue here.