Use multiple connections to a node simultaneously, with dynamic load balancing based on speed

MirceaKitsune commented 6 years ago

I recently started a discussion on the forum and the IPFS subreddit, in the wake of which I decided to open an issue here as well. The subject was out-of-the-box LAN support, and it spawned further discussion about IPFS being able to use multiple connections at once.

https://discuss.ipfs.io/t/communication-via-lan-rather-than-just-internet

I wanted to suggest the ability for IPFS nodes (go-ipfs and js-ipfs) to use multiple network connections simultaneously (over LAN and internet alike), even when communicating with the same node. I believe this can be used to further decrease bandwidth and increase networking speeds, by distributing the load to all possible sources that can be used to network data.

To offer a practical example: Let's presume that I have two computers in my house, both connected to the same WiFi router (192.168.1.1 and 192.168.1.2). Both are connected to the internet, however they can also communicate through the LAN cable connecting them... for the sake of example let's say they are also connected through WiFi. Both computers therefore have the following 3 connections:

An internet connection (10 MB/s).
A local cable connection with each other (100 MB/s).
A local wireless connection with each other (50 MB/s).

Next the IPFS node on one computer wants to download something from the IPFS node on the other computer. From what I understand, IPFS can detect and is thus able to use any of those 3 connections (LAN + internet), so the question is about how it will use them.

Current functionality: IPFS will ping all 3 connections and see which replies first. The first that answers (cable LAN @ 100 MB/s) is considered the fastest and wins, the others are ignored. From what I understand, IPFS will stick to that connection indefinitely for that communication.
Proposed functionality: IPFS will ping all 3 connections and take note of how fast each one replies. It will then transfer bits of the data through all of them, distributing the load so that the fastest handles the most traffic. New pings occur in the background every few seconds, re-determining the speeds and re-adjusting the balance dynamically. In this case: The 100 MB/s connection handles 60% of the load, the 50 MB/s connection handles 30%, the 10 MB/s connection handles the remaining 10%.

I'm inclined to believe that proportionally distributing traffic across all available connections to another node is faster than using only the fastest connection detected, granted the distribution formula is efficient enough. The impact shouldn't be great, however even a small difference can often have a noticeable effect for the end user.

makew0rld commented 6 years ago

It'd like to add that in the discuss link posted, someone mentioned that IPFS does work over LAN, it just not might be enabled over default. I couldn't find evidence of that anywhere. Is that true?

I've thought about IPFS over LAN before, and here's what I said:

IPFS uses a DHT (distributed hash table) which maps content hashes (Qm...) to peers that have that content cached, essentially a list of who has what. The DHT is constantly being updated, and allows you to access content. You want , and the DHT says that and have it, so you get it from them. At least that's my understanding. I also believe the bootstrap nodes are the ones who get you the DHT in the first place. If they are down, and your node isn't saving peers, then there's no IPFS for you. For IPFS over LAN to work, your node would scan the network for machines with a specific port open, and then test the port to make sure they get the proper response. This would only happen if there was no Internet. It would then use those nodes as bootstrap, getting a local DHT from them. If there are no other nodes, and no Internet, it will still run, creating it's own DHT for others to access of they join the LAN, but obviously it wouldn't be able to access much, only previous cached stuff.

I think @MirceaKitsune 's response neglects the fact that getting content from peers is reliant on the DHT, and the DHT can't handle local peers. (Or am I misunderstanding?) My example above mentioned using a separate DHT if there was no Internet, and neglecting LAN connections when there was, but what @MirceaKitsune mentions about taking data from LAN and WAN simultaneously sounds really good, and could make the network much faster when they're are multiple nodes on the same network. I'm not sure how that would work in the contexts of DHTs though. Maybe there could be a local DHT and global one? And then your node would coordinate data between the two?

Kubuxu commented 6 years ago

@Cole128 we don't rely on DHT, in local network IPFS with use mDNS to discover peer. When we are connected to a peer we can fetch the content from it directly without using DHT.

makew0rld commented 6 years ago

@Kubuxu Thanks. Does it work without Internet, but also with Internet as @MirceaKitsune said? And how do you know as who has what without a DHT?

mib-kd743naq commented 6 years ago

When we are connected to a peer we can fetch the content from it directly without using DHT.

@Kubuxu I would also like to know how this works. As an example I have a pathologically-badly-chunked 32mb tgz testfile on an unfirewalled node (QmZPX4mQ9tx1mhUj3RZCkF3Pf986qZXpSfGYcJsvXavGg3) which reprovides its pins every 6 hours. I have been experimenting grabbing the data out of it over the past week at different times: it is always dog-slow ( we are talking kilobytes / second ). Is there a way to instruct ipfs to pull all blocks from my node directly at respectable speeds?

Edit: Here is the complete config of the server node in question

Kubuxu commented 6 years ago

@mib-kd743naq in short, start both nodes with ipfs daemon --dht=none connect them directly and use ipfs pin --progress for transfer. Should be the fastest, no DHT overhead on the receiver side, pin uses Bitswap sessions for additional optimization.

makew0rld commented 6 years ago

@Kubuxu but as we've been talking about shouldn't IPFS work just as fast with the DHT or without? ie it checks local peers first to see if they have the content requested and gets it from them if they do. It could then check the global DHT to see if other peers have it to get it from them simultaneously. Also, wouldn't having a local DHT be helpful in this scenario?

Kubuxu commented 6 years ago

Most of the overhead is publishing to the network that we have given data. We are working on resolving this by allowing the user to select provider strategies and improving the DHT.

makew0rld commented 6 years ago

Wouldn't it be helpful for it to work autonomously though? So that local peers are automatically used?

ipfs / notes

Use multiple connections to a node simultaneously, with dynamic load balancing based on speed #346