Equiping the Hydra-Booster with the BFR (Accelarate the .Provide for really large files -> millions of records)

Our baby hydra -- https://github.com/libp2p/hydra-booster -- is growing up to become a super useful type of node that can accelerate significantly the .FindPeers and .FindProviders in the IPFS network.

What is missing to complete the full picture, is the ability to accelerate the .Provide queries as well, so that nodes that are storing lot of data can tell the network that they storing without incurring in a huge time and bandwidth cost.

The particular challenge with providing a large file, is that you need to provide one record for each block (each IPLD node) to support random access to the file. Just for reference, a 100GB file transforms roughly into a 1M blocks when adding to IPFS. That means that 1M different records have to be put in the DHT in several different locations.

What makes things worse is that we end up crawling the DHT multiple time to keep finding nodes that match the "XOR metric closest peers to CID", sometimes resulting in having to dial to the same peer multiple times. This is highly inefficient.

A way to improve this (that has been proposed) is to have nodes with very large routing tables, so that the number of hops from provider to the node that will be hosting the provider record is 1~3 hops max. This does improve things but still not ideal, specially for services that provide IPFS pinning, as they will have to dial a ton of times to the network to put those records.

So, a question arises: What if they were already all over the network? That's where the hydra-booster comes in. An hydra-node has many heads across the network, and can answer to .FindProviders queries from multiple locations simultaneously.

If we pair a Pinning Service with a Hydra node, the pinning service would only have to tell the hydra node of its records and then let the hydra node do its jobs. That's it. For this what would have to have is a special flag that would tell an ipfs node to put all the provider records in a hydra node.

This would be the first step, the second would be to add a Thermal Dissipation load balancing strategy that replicates records to closest peers. What this enables is for the Hydra Nodes to replicate the record to the closest peers of each of its sybils (i.e. each hydra head) so that nodes in that surrounding have the copies of the record as well, increasing redundancy and resilience to churn.

a special flag that would tell an ipfs node to put all the provider records in a hydra node

Maybe do this without a special flag, but instead have the hydra send a special flag and a unique ID for the whole hydra?

This would allow a client to mark this connection as more important, avoiding that it will be closed soon after the first provide.

And if there's a larger queue of stuff to provide, the node could ask the hydra for a list of the node IDs for their heads, with a special query.

This would allow addressing a hydra with just one persistent connection, for the whole provide process for all of its heads. If there are enough hydras in the network, this would drastically reduce the amount of connection that needs to be established and terminated, as well as crawling the DHT will get much quicker too. This without losing the fallback to connect to random nodes, which makes a healthy DHT worth all the trouble, as well as avoiding building a centralized infrastructure.

If you're a pinning service and you run some hydras, you can just add a random node-id of each hydra to your bootstrap or your persistent connection list, and if the heads of the hydras are distributed enough and there are enough of them, you would end up with the same result, but without having to rely on a single hydra doing the node's job for it.

ipfs / notes

Equiping the Hydra-Booster with the BFR (Accelarate the .Provide for really large files -> millions of records) #430