Provide fewer nodes - Githubissues

eingenito commented 5 years ago

Problem

Providing is an expensive operation and even with significant DHT speedups is likely to remain so. The growth of the IPFS network has made this problem more acute. It would be beneficial if IPFS instances could choose to provide only a subset of all nodes without dramatically impacting the availability and retrieval speed of the data.

Notes

The Go-IPFS team began discussion of this problem in our October 2018 London meetup.

@Kubuxu devised an algorithm and associated visualization that demonstrated the selection of good nodes for providing.
A requirement for all partial providing strategies is associated changes in bitswap (or whatever) that allow it to walk 'up' the dag from a node, searching for providers of ancestor nodes (since the current node may never have been provided).

eingenito commented 5 years ago

Some conversation paraphrased from IRC:

@steb: I think the higher priority work will be the reverse link graph. We're going to run into problems with provider strategies unless we have some way to traverse abck up a dag and find providers for parents. @eingenito: I wonder if the reverse link graph needs to be persisted or if it can be transient and owned by a bitswap session? Note: I think that there are other features enabled by a persistent database of per node metadata and this might a good place to start.

@eingenito: I think Kuba's algorithm was maximally efficient but required analysis of an entire dag to select optimal nodes for providing? Is that right? @Kubuxu : no, it was nowhere near maximum efficiency but it required minimal information about the DAG. It just walked a DAG and made decisions depending on current node (and single number about the past state of the tree as we walked it).

@eingenito - Also, given how expensive providing is always likely to be I wonder if it isn't more useful to assign a priority to a given node wrt providing and then just provide in priority order knowing that many will simply never get provided.

Edit(kubux): s/@kuba/Kubuxu/ (someone else has handle kuba on GH.

Kubuxu commented 5 years ago

The core idea about probabilistic providing bases on the fact that when someone has a given node there is a good chance he has its children. When we are missing a node we can try walking up the tree of nodes and try to resolve providers as we backtrack.

The essential part is having information about backlinks from the current node. This can be implemented in a number of fashions:

backlink/graph DB (hard/costly)
tracking backlinks as we recurse a tree
storing root and path to the current node as we recurse a tree

The probabilistic providing strategy has to optimize which nodes to provide depending on a number of factors:

A: target ratio of nodes we want to provide
B: maximum+average walk up backlinks to find a single provider
C: average walk up backlinks to find M of N providers

Why I propse probabilistic instread of deterministc providing strategy is that in the former the C is much smaller than B, this isn't true in the latter (while maintianing the same A).

I've also discovered a way to describe basic probabilistic providing strategies as markov chains, I used it in my simulations (as markov chains simplify to matrix multiplications). I will explain it if there is interest in it.

eingenito commented 5 years ago

@Kubuxu thanks for the synopsis, it's very helpful. I have an observation, already mentioned above, but I want to restate it. Providing is so expensive, and the load on IPFS nodes so variable that I think it might always be important to model providing (and reproviding) as a best effort service. A probabilistic scheme for choosing good blocks to provide might be best used as a source of weights or priorities for all blocks so the re/provider system can appropriately order actual writing to the DHT. That way we can optimize providing under changing conditions and backfill lower priority provider records when conditions allow. Does that make sense?

Kubuxu commented 5 years ago

It makes a lot of sense. Another thing is, we still have a lot of headway to improve our DHT.

ipfs / kubo

Provide fewer nodes #6155

Problem

Notes