Open eingenito opened 5 years ago
Some conversation paraphrased from IRC:
@steb: I think the higher priority work will be the reverse link graph. We're going to run into problems with provider strategies unless we have some way to traverse abck up a dag and find providers for parents. @eingenito: I wonder if the reverse link graph needs to be persisted or if it can be transient and owned by a bitswap session? Note: I think that there are other features enabled by a persistent database of per node metadata and this might a good place to start.
@eingenito: I think Kuba's algorithm was maximally efficient but required analysis of an entire dag to select optimal nodes for providing? Is that right? @Kubuxu : no, it was nowhere near maximum efficiency but it required minimal information about the DAG. It just walked a DAG and made decisions depending on current node (and single number about the past state of the tree as we walked it).
@eingenito - Also, given how expensive providing is always likely to be I wonder if it isn't more useful to assign a priority to a given node wrt providing and then just provide in priority order knowing that many will simply never get provided.
Edit(kubux): s/@kuba/Kubuxu/ (someone else has handle kuba on GH.
The core idea about probabilistic providing bases on the fact that when someone has a given node there is a good chance he has its children. When we are missing a node we can try walking up the tree of nodes and try to resolve providers as we backtrack.
The essential part is having information about backlinks from the current node. This can be implemented in a number of fashions:
The probabilistic providing strategy has to optimize which nodes to provide depending on a number of factors:
Why I propse probabilistic instread of deterministc providing strategy is that in the former the C is much smaller than B, this isn't true in the latter (while maintianing the same A).
I've also discovered a way to describe basic probabilistic providing strategies as markov chains, I used it in my simulations (as markov chains simplify to matrix multiplications). I will explain it if there is interest in it.
@Kubuxu thanks for the synopsis, it's very helpful. I have an observation, already mentioned above, but I want to restate it. Providing is so expensive, and the load on IPFS nodes so variable that I think it might always be important to model providing (and reproviding) as a best effort service. A probabilistic scheme for choosing good blocks to provide might be best used as a source of weights or priorities for all blocks so the re/provider system can appropriately order actual writing to the DHT. That way we can optimize providing under changing conditions and backfill lower priority provider records when conditions allow. Does that make sense?
It makes a lot of sense. Another thing is, we still have a lot of headway to improve our DHT.
Problem
Providing is an expensive operation and even with significant DHT speedups is likely to remain so. The growth of the IPFS network has made this problem more acute. It would be beneficial if IPFS instances could choose to provide only a subset of all nodes without dramatically impacting the availability and retrieval speed of the data.
Notes
The Go-IPFS team began discussion of this problem in our October 2018 London meetup.