Closed whyrusleeping closed 10 months ago
My theory is, it does exhaust / overload NAT table, that on some routers does cause lockups. UDP on the same routers can keep working without problem, as well as TCP connections that were already open when lockup occurred.
Possible solution: Have a switch to limit number of peers/connections. Related #3311
That sounds highly likely. nf_conntrack_max
on my edge router is set to 1024 by default and ipfs eats 700 of those on its own, per computer I'm running it on.
A lot of those are dead connections too: if I open the webui which tries to ping them it quickly drops to 90 or so.
Running 5 daemons on local network with a well-known hash (they were pinning dist) kills my Fritzbox.
AFAIK everyone has high esteem for Fritzboxes as very good routers, and not some shitty hardware. Internet reports a NAT table size of around 7000. I find the problem is exacerbated when my nodes are pinning popular content (I suspect this not only consumes all the bandwidth but also increases the number of connections when other peers try to download these blocks?).
So my idea of what happens is that conntracker table fills up (it is small in cheapo routers, bigger is good ones) and it starts trowing out other connections. @hsanjuan can you repeat the test, kill ipfs daemons and check if it comes back up online?
@Kubuxu yeah yeah things are back up immediately when I kill them. Only once I had the router reboot itself, which worried me more.
So other possibility is that cheapo routers have bigger conntracker limit than their RAM can handle and they kernel panics or lockups. Not sure how to check it.
Does UDP eat up conntracker entries? We're moving quickly towards having support for QUIC.
AFAIK, yes. At least from the time my services were DDoSed with UDP packets and they were much more destructive because of low conntracker limits.
Is it possible that this problem got much worse in the last releases (ie >=0.4.5). I used to be able to run 4 nodes without problems and now it seems I'm not, even after cleaning their contents.
I'm having issues, too. Maybe ipfs should take two connection pools and migrate peer connections from a bad quality pool to a good quality pool by applying some heuristics to the peers. Peers with higher delays, lower bandwidth and short lives would live in the "bad pool" and easily replaced by new peers if connection limits are hit. Better peers would migrate to the "good pool" and only be replaced by better peers if limits are hit. Having both pools gives slow peers a chance to be part of the network without being starved by higher quality peers, which is important for a p2p distributed network.
BTW, udp also needs connection tracking, this wouldn't help here, and usually udp tracking tables are much smaller and much more short-lived which adds a lot of new problems. But udp could probably lower the need for bandwidth as there's no implicit retransmission and no ack. Of course, the protocol has to be designed in a way to handle packet loss, and it must take into account that NAT gateways usually drop udp connection table entries much faster. It doesn't make sense to deploy udp and then reimplement retransfers and keep-alive, as this would replicate tcp with no benefit (probably it would even lower performance).
Also, ipfs should limit the amount of outstanding packets, not the amount of connections itself. If there are too many packets in-flight, it should throttle further communication with peers, maybe prioritizing some over others. This way, it could also auto-tune to the available bandwidth but I'm not sure.
Looking at what BBR does for network queues, it may be better to throw away some requests instead of queuing up a huge backlog. This can improve overall network performance, bloating buffers is a performance killer. I'd like to run ipfs 24/7 but if it increases my network latency, I simply cannot, which hurts widespread deployment.
Maybe ipfs needs to measure latency and throw away slowly responding peers. For this to properly work, it needs to auto-adjust to the bandwidth, because once network queues fill, latency will exponentially spike up and the former mentioned latency measurement is useless.
These big queues are also a problem with many routers as they tend to use huge queues to increase total bandwidth for benchmarks but it totally kills latency, and thus kills important services like DNS to properly work.
I'm running a 400/25mbps assymmetric link here, and as soon as "ipfs stats bw" get beyond a certain point, everything else chokes, browsers become unusable waiting for websites tens of seconds, or resulting in DNS errors. Once a web request comes through in such a situation, the website almost immediately completely appears (minus assets hosted on different hosts) so this is clearly an upstream issue with queues and buffers filled up and improper prioritizing (as ACKs still seem to pass early through the queues, otherwise download would be reduced, too).
I don't know if QUIC would really help here... It just reduces initial round-trip times (which HTTP/2 also does) which is not really an issue here as I consider ipfs a bulk-transfer tool, not a latency-sensitive one like web browsing.
Does ipfs properly use TOS/QoS flags in IP packets?
PS: ipfs should not try to avoid tcp/ips auto-tuning capabilities by moving to UDP. Instead it should be nice to competing traffic by keeping latency below a sane limit and let TCP do the bandwidth tuning. And it should be nice to edge-router equipment (which is most of the time cheap and cannot be avoided) by limiting outstanding requests and amount of total connections. I remembered when Windows XP tried to fix this in the TCP/IP stack by limiting outstanding TCP handshakes to ten, blocking everything else then globally. This was a silly idea but it was thinking in the right direction, I guess.
I think you might as well not do anything at all, since routers are getting consistently better at supporting higher numbers of connections. My 5 years old struggled with supporting 2 ipfs nodes (about 600 connections each) + torrent (500 connections). I've just got cheap chinese one, and it works like a charm. Most of even cheap routers nowadays have hardware NAT. They don't much care how many connections you throw at them. Also, switching to UDP doesn't help, since when i unleash torrent far beyond 500 connections limit, it used to kill the old router as good as ipfs. And torrent uses only UDP.
@dsvi: I'd rather not have to pay hard cash just to use IPFS on the pretence that it's fine to be badly behaved because some other software can be misconfigured to crash routers. A lot of people don't even have the luxury of being allowed to connect to their ISP using their own hardware.
And what a strawman you've picked — a Bittorrent client! A system that evolved its defaults based on fifteen years real world experience for precisely this reason!
No thanks, just fix the code.
@dsvi I wonder if they use their own routers because the page times out upon request... ;-)
But please do not suggest that: Many people are stuck with what is delivered by their providers with no chance to swap that equipment for better stuff. Ipfs has not only to be nice to such equipment but to overall network traffic on that router, too: If it makes the rest of my traffic demands unusable, there's no chance for ipfs to evolve because nobody or only very few could run it 24/7. Ipfs won't reach its goal if it is started by people only on demand.
Sorrry guys, should have expressed it better. I'll try this time from another direction ;)
And what about people who stuck with relic hardware for whatever reason? Well i feel sorry for some of them, but the progress will go on with them, or without.
@dsvi
"Internet world is becoming decentralized in general. "
Nope! It's becoming centralized. Almost the whole internet is served by a handful datacenter companies. For most people search means Google, e-mail means Gmail, social interactions mean Facebook, videos mean Youtube, chat means Facebook Messenger, picture sharing means Instagram. The rest of the web is hosted at the one of the several largest datacenter companies.
At the beginning we used to have Usenet and IRC servers running on our computers at home. Then services got more and more centralized.
I don't see signs of any decentralization. But I see signs of further centralization. For example some ISPs don't even give you public IP addresses anymore (for example 4G networks).
"And creating tons of connections is a natural part of such systems."
Having too many simultaneous connections makes the system inefficient. If you have enough peers to saturate your bandwidth it's pointless to add more.
Currently my IPFS daemon opens 2048 connections within several hours to peers then runs out of file descriptors and becomes useless. This should be fixed.
I'm using a crappy TalkTalk router provided by the ISP and I've been unable to find a configuration where IPFS doesn't drag my internet connection to it's knees.
Using ifstat I see usually between 200kb/s and 1MB up and down whilst ipfs is connected to a couple of hundred peers.
I'd like to try connecting to fewer peers, but even with:
"LowWater": 20,
"HighWater": 30,
ipfs still connects to hundreds.
Perhaps this is a dumb question, but why don't you make it so that IPFS stops connecting to more peers once the high water mark is reached?
We should implement a max connections but high/low water are really designed to be target bounds.
The libp2p team is currently refactoring the "dialer" system in a way that'll make it easy for us to configure a maximum number of outbound connections. Unfortunately, there's really nothing we can do about inbound connections except kill them as soon as we can. On the other hand, having too many connections usually comes from dialing.
Note: there's actually another issue here. I'm not sure if limiting the max number of open connections will really fix this problem. I haven't tested this but I'm guessing that many routers have problems with connection velocity (the rate at which we (try to) establish connections) not simply having a bunch of connections. That's because routers often need to remember connections even after they've closed (for a period of time).
@vyzo's work on NAT detection and autorelay should help quite a bit, unless I'm mistaken.
A work-around could be to limit the number of opening connections (in contrast to opened connections) - thus reducing the number of connection attempts running at the same time. I think this could be much more important than limiting the number of total connections.
If such a change propagated through the network, it should also reduce the amount of overwhelming incoming connection attempts - especially those with slow handshaking because the sending side is not that busy with opening many connections at the same time.
We actually do that (mostly to avoid running out of file descriptors). We limit ourselves to opening at most 160 TCP connections at the same time.
@Stebalien Curious, since when? Because I noticed a while ago that running IPFS no longer chokes DNS resolution of my router...
We've been doing it for quite a while however, we may have changed some constants. I'm not sure.
IPFS also chokes DNS resolution on my router @kakra (particularly when building dists, which downloads large files). I haven't checked if there are any improvements though, but will keep an eye. Do you have a Fritzbox too?
@hsanjuan No, it's a Vodafone DOCSIS 3.1 router, 1 gigait connection.
The connection manager currently works with hard bounds set by configuration. Do you think it’s feasible to “sense” appropriate limits from the environment by allowing connections to be alive until we hit an hiccup, at which point we back off?
Some other mechanism of "router kill" I've observed is SYN and RST storms when ipfs starts working and is being shut down. I had an ISP router go down sometimes when starting and stopping ipfs.
SYN and RST storms
This was reported in libp2p-swarm some time ago: https://github.com/libp2p/go-libp2p/issues/1550
This was reported in libp2p-swarm some time ago: libp2p/go-libp2p#1550
With the new DHT dial queue, this should not happen any longer: https://github.com/libp2p/go-libp2p-kad-dht/pull/237
However I was more interested in knowing if there's a event (or set thereof) that we can interpret as a high-fidelity signal that we're tripping up the router, so we can back off and adjust our connection manager limits adaptively.
However I was more interested in knowing if there's a event (or set thereof) that we can interpret as a high-fidelity signal that we're tripping up the router, so we can back off and adjust our connection manager limits adaptively.
Hard to tell, many of them will continue forwarding connections without any problem because they are on the fast path, until they give up.
With the new DHT dial queue, this should not happen any longer
Great! I'm going to try on my DSL router later!
On my Mac, IPFS kills (in addition to my router) also some applications --- e.g., Chrome, which I suspect cannot do a DNS query, but I am not sure.
What is the workaround currently? Can IPFS be set up so that the daemon on my PCs connect only to my server in the cloud, and through it to the entire network? (I do not mean making a private network with the swarm key, but using one known "gateway").
This still regularly happens on my home network, killing both the router and the wifi extender.
Same here. Development is currently infeasible with anything involving go-ipfs (using ipfslite for Bolivar, but suffering from same problem).
Just running a node for about ~5 minutes (had time to connect to 30 peers) kills my router, which is standard issued router from my ISP.
@victorb could you try running in DHT client mode? I want to see if that makes any difference (relates to #6283).
@Stebalien FWIW, following the advice in other threads, I am using --routing dhclient
, which makes the situation a bit better, but the problem persists.
Is there a previous version of go-ipfs
where this problem is reduced? My problem like @victorb is also that I would like to set up a stable development environment, with the basic ipfs functionality.
No. The problem is likely to be significantly worse in previous versions. This is a long standing issue we've been whittling away at, not a regression.
Thanks @Stebalien. Did the following change: https://github.com/open-services/bolivar/pull/6/files
Seems to have stopped killing my router at home, at least after running the tests for a couple of times (without that change, it gets killed every time I run the tests).
While it's good that it's now working, the application does not work as wanted after that change, as it won't help local peers find content anymore, making it pretty useless.
Wondering if another DHT client could be introduced. One where the node doesn't respond to all DHT requests, only to ones coming from the local network.
Wondering if another DHT client could be introduced. One where the node doesn't respond to all DHT requests, only to ones coming from the local network.
Unfortunately, that won't work if any one of your local nodes is a member of the global DHT. That node will include you in their routing table and tell their peers about you.
as it won't help local peers find content anymore, making it pretty useless.
So, bitswap should still work. Are you trying to get this to work in an offline/disconnected context where local peers can't search the global DHT?
Honestly, a local peer exchange seems like the best solution for issues like this.
Unfortunately, that won't work if any one of your local nodes is a member of the global DHT. That node will include you in their routing table and tell their peers about you.
Hm, I see... I guess what I'm trying to figure out is how to get local (as in same local network) peers to collaborate on sending data, while avoiding the issue of ipfs/libp2p killing peoples router.
So, bitswap should still work. Are you trying to get this to work in an offline/disconnected context where local peers can't search the global DHT?
Exactly. The use case is that there are two peers, they are locally connected but might or might not have internet connection. Even if they do/do not have a internet connection, they should still be able to fetch content from each other, as they are both using the same hash to resolve links from within that hash.
I've done some tests with the DHT Client and my assumption seems correct that they won't be able to find content from each other, when both are running as DHT client. So while it's nice that the router doesn't die anymore, it kind of makes the use-case of p2p useless, as it needs to reach something that is not just a DHT client.
Honestly, a local peer exchange seems like the best solution for issues like this.
Is this something that already exists or you're talking about something that might exists in the future?
I've done some tests with the DHT Client and my assumption seems correct that they won't be able to find content from each other, when both are running as DHT client. So while it's nice that the router doesn't die anymore, it kind of makes the use-case of p2p useless, as it needs to reach something that is not just a DHT client.
MDNS should cause these peers to connect to eachother automatically. However, we may need to improve MDNS support.
We could also try using the DHT for local rendezvous (i.e., who else has my same public IP?).
Has there been any recent efforts to tackle this issue. This issue still renders IPFS useless.
Has there been any recent efforts to tackle this issue.
Yes.
Thanks @Stebalien! That sounds quite promising. While IPFS keeps progressing, is there a set of settings that could help to avoid the issue, while not compromising on IPFS features? There are many suggestions in this thread but some could be outdated or just ineffective. A standard set of settings, preferably in IPFS documentation, would be awesome.
Unfortunately, the best solution is to replace your router.
Baring that, you can try running IPFS in client-only mode by running the daemon with ipfs daemon --routing=dhtclient
. This will lead to significantly fewer inbound connections.
@Stebalien any list anywhere of what you consider a bad/good router?
I've never found a router I've been truly happy with. I use a tp-link (TL-R600VPN) and it seems to hold up to IPFS just fine, but it has its own problems (we actually have a work-around in libp2p to deal with them).
Its 2020 this started in 2016, and it still crashes routers. have a fiber ISP router (cannot change it). symmetric 100mbs up/down the connection number goes up really fast and I cannot even ping the router (local lan continues working fine). any other p2p software in a good number of machines never locked up the router. take a time to come back after kill the daemon, probably a watchdog reseting the router. ipfs ver 0.5.1
Same issue here using Linksys CG7500 cable modem/router on Comcast network. The router reboots frequently when ipfs daemon is running. The network is more stable after I set the following config properties, which had kept the number of peer connections below 500 most of the time.
However, the performance for curl https://ipfs.io/ipfs/mycid
gradually degraded with time for a file pinned on the local node, and it returns only an error Couldn't connect to server
after 15 minutes. Does it mean that the local daemon became non-responsive to inbound requests even though it was still running with no errors?
"Swarm": {
"ConnMgr": {
"GracePeriod": "10s",
"HighWater": 90,
"LowWater": 60,
"Type": "basic"
},
I suspect that the issue is not the number of open connections, but the amount of connection attempts to private networks, which the router tries to push to his default route. Since the ISP has no route for them, the probably doesn't get answered with ICMP but just dropped. This causes the router to be unable to clear them up in a timely manner.
By default around 1/3 of all connection attempts go to private networks.
I wrote about this issue here.
So we know that ipfs can kill people routers. We should do a quick write up of what the causes are, which routers are normally affected, and maybe propose a couple ideas for solutions.
@Kubuxu do you think you could handle doing this at some point?