Open whyrusleeping opened 8 years ago
I'm writing these thoughts here from a very naive point of view, I don't have a deep understanding of the libp2p protocol, and you guys have obviously thought a lot more about this than me.
One problem I have concluded with the current libp2p/ipfs stack is that peer connections are long lived, they stay connected for a long time, I assume to facilitate some bitswap session of sorts, but the protocol also expects every peer to have direct access to a peer who has the content. This makes it impossible to limit the connections because others cant connect.
If your going to have long lived connections then the peers need to relay the blocks between each other to cater for peers that are unreachable by others.
My preference would be for short lived connections, < a second or two, that way a single node could service hundreds of thousands of requests and always be accessible. But obviously bitswap accounting gets more tricky.
In your notes you also mention the DHT struggling to keep up with block provider records? Would it not be possible to use summarised provider records? i.e instead of sending 1000 records for a 500MB file you just send the one provider record indicating you have the whole file? Then during a session if the provider dissappears or lied he gets flagged as such and we move onto the next provider.
If you have any more notes that already cover all these questions i'm happy to be pointed in that direction :)
If your going to have long lived connections then the peers need to relay the blocks between each other to cater for peers that are unreachable by others.
Yeah, we're working on getting relays implemented and setup, but its tricky. You'll be essentially donating your bandwidth to other peers, which could get weird.
Would it not be possible to use summarised provider
The problem here is that you still want to have random access to the subblocks. This is the hard part. If i want hash QmCCC that is a sub-block of QmAAA, and I don't know that QmCCC is a subblock of QmAAA, i won't be able to find it because i don't know to ask for QmAAA. One thing that would help here is a way to do backlinks, but that gets really expensive and complex as well (and ends up not really solving the problem). What would be interesting would be to have an efficient way to do backlinks, or some other way to do random access into subgraphs without explicitly announcing everything.
An idea would be to add an indirection level and split the merkle-dag into limited trees that could be considered as a single unit. A record in the DHT would contain the list of all the peers interested in any node from that tree. This record would be the logical root of the tree. When a peer wants to access a block knowing it is below that record, it would fetch the record and use a separate, smaller DHT that is only shared between the nodes interested in that limited merkle-tree.
Advantages: The main DHT is smaller and consumes less bandwidth because individual blocks are not referenced in it. When connecting to a sub-DHT, your peer is closer to other peers interested in closest data you are probably interested in. This leverages the principle of data locality.
Disadvantages: Not all blocks are accessible directly, you must first know the record identifier.
Variant to still reference all the individual blocks in the main DHT: Instead of associating the block id with the list of peers that can provide this (dynamic information that is subject to frequent changes, this more bandwidth), the main DHT only makes an association from the block id to the record id tracking the tree the block is included in. This information is much more static, or barely changes at all, and will require less bandwidth to keep up to date.
Disadvantages: An extra level of indirection.
@mildred Thanks for the input!
I do really like the idea of essentially 'sharding' the dht. One hard part here is deciding how these record groups are formed. Ideally this would be a deterministic process so that it would not require any consensus or messaging to other peers for discovery.
Your second proposal is essentially something we've discussed in the past and called "Backlinks", that discussion was about using the dht to store which hashes are children of which other hashes. Our conclusion there was essentially that the scaling here actually tends to be either exactly the same, or worse in terms of messaging complexity due to the cost of having to announce blocks into each of the 'groups' they might be in.
I'll have to put more thought into dht subgroup formation. Perhaps theres something cool here involving probabilistic sets... (maybe treat the whole network as a bloom filter somehow?)
Here's a stray thought: instead of sharding based on parent / child relationships between nodes, why not shard based on the hash itself? Would it be feasible to subdivide the DHT networks by a hash prefix? Say, in an extremely simplified example, an IPFS node elects to route for hashes beginning with 'Q' and only announces itself as a provider to other nodes that are interested in that prefix? I feel like basing it on the hash of the data itself makes it deterministic, and would naturally result in a fairly balanced distribution of resources, but I dunno.
When I proposed this idea to a friend, he suggested that this approach may weaken the network overall, and make it susceptible to Sybil attacks, but I feel like that's true of any sharding scheme. There will need to be a balance between redundancy and performance.
Hrm... I don't know if that would save too much bandwidth. A large put of many different hashes ( from adding a large file) would end up having to interact with pretty much every shard.
On Mon, May 29, 2017, 20:31 zeta0134 notifications@github.com wrote:
Here's a stray thought: instead of sharding based on parent / child relationships between nodes, why not shard based on the hash itself? Would it be feasible to subdivide the DHT networks by a hash prefix? Say, in an extremely simplified example, an IPFS node elects to route for hashes beginning with 'Q' and only announces itself as a provider to other nodes that are interested in that prefix? I feel like basing it on the hash of the data itself makes it deterministic, and would naturally result in a fairly balanced distribution of resources, but I dunno.
When I proposed this idea to a friend, he suggested that this approach may weaken the network overall, and make it susceptible to Sybil attacks, but I feel like that's true of any sharding scheme. There will need to be a balance between redundancy and performance.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ipfs/notes/issues/162#issuecomment-304766643, or mute the thread https://github.com/notifications/unsubscribe-auth/ABL4HIHuZUGhnA-1S1ELhiABvOSsFC5qks5r-414gaJpZM4JuixA .
Hi. I’m new here and might misunderstand something. Would this kind of summarized provider records work?
The idea is, if a node has a whole sub-tree, it makes sense to only publish a provider record for that sub-tree to DHT, without publishing individual nodes in it.
Let’s say we have a file represented by the DAG
A
/ | \
B C D
/ | \ / | \ /
E F G H I J K
A publisher of the file only publishes A (the list of links to B,C,D) and the fact that it has the whole tree.
A node that wants to download the file finds the record of A and nodes that have the whole file. It can start downloading directly from them, but to avoid bottlenecks when a file has just been published, it traverses the tree and prioritizes downloads from nodes that have smallest individual parts. Once it has E, it publishes a provider record for E. Same goes for F. Once it also has G, it publishes a provider record that says it has the whole sub-tree B and never again publishes records for E, F or G. And so on.
Thus, nodes that have the whole file send out one provider record (or two). A node that starts up and that has only downloaded a file partially, typically publishes C*log(filesize) records.
At the same time, a node can start downloading a file only after one DHT request. The speed might be slow because of the bottleneck mentioned above, but it could do a simultaneous download of the same blob from nodes discovered through the tree traversal. (download blob E from node1 from left to right, blob E from node2 from right to left, blob E from node3 starting in the middle and going in both directions and so on).
As for the disadvantages, it is not possible to access an individual block without knowing its ancestor. So, an address to E would probably look like “hashof(A)/hashof(B)/hashof(E)”, or even “hashof(A)/hashof(B)/hashof(E);hashof(X)/hashof(Y)/hashof(E)” if E is part of two files. If someone decides that E is an important piece on its own though, he can pin E separately and that would maintain a provider record for E in the DHT.
Another drawback is that if an address for an individual object is given in the form of “hashof(A)/hashof(B)/hashof(E)”, a node might need to send lengthof(“hashof(A)/hashof(B)/hashof(E)”) requests to the DHT, which is C*log(filesize). It is still a good compromise.
Thank you for reading.
@aszinovyev summarized provider records arent a bad idea, and it might be something we move forward with, but the drawback is that you lose out on random access. If most nodes only say they provide A, but some node is trying to find B directly, and doesnt know that B is a child of A, it will be very difficult for them to figure out who to ask for data.
I think that this compromise is probably okay on a per file basis, where you can announce the roots of each file and say you have the entire thing, since requesting sub-pieces of files without the context of the root may be a very small usecase (though i really don't know how true that is)
Stupid question maybe, did you have a look at the work done on the topic around Gnutella, for example http://rfc-gnutella.sourceforge.net/src/qrp.html ? I know DHT was developed to avoid the problems of the gnutella routing, but maybe it is good to look back. Delegation, sharding and routing search data for others seems to be the only way how to solve this problem in my opinion.
@ingokeck thanks for the link! I'll read into it and report back.
@ingokeck hrm... most of that is pretty specific to gnutella, but one potentially relevant bit is having nodes keep track of the routing tables of other nodes. The ipfs DHT could potentially keep track of nodes that are in the routing tables of nodes in its routing tables (information exchanged lazily). This could at least help jumpstart queries by a few hops.
I fear if you do not shard on block hash level you will create a lot of additional copies of blocks in the network, that may be identical but are below different sub-trees? Or do I get this wrong?
How much thought has been put into using bloom filters and other probabilistic datastructures?
Bloom filters are used within leveldb to great effect. Perhaps the same principles can be applied to provide random access to who has which blocks across the DHT? I haven't done the math yet, but a few things about BFs that are a nice fit: some formulations support the ability to subtract one BF from another, so you could announce a BF that includes items you have and those that your peers within one hop can provide while still keeping it up to date.
The nice thing about this is you can tune how much background activity there is vs actual request traffic for blocks by varying the size of the BF and its corresponding false positive rate. Since most files will be read infrequently, I imagine that you can achieve significant overall savings.
If this is interesting to the IPFS team, I can follow up via some higher bandwidth medium to discuss more details.
Let me know if there is a good set of papers/resources I should read to get up to speed with how content routing works explicitly today (I haven't recently kept up with IPFS development). I want to get my understanding in check around these issues.
The main problem is that there is the main DHT where content is stored and it fails to scale when nodes are publishing (say exponential) many records R with respect to the number of nodes in the DHT. This results in these nodes needing to publish R provider records to the network. These nodes publish these provider records (once?) to a DHT which indicates that they have the content. Scaling this way isn't working because nodes have varying degrees of connectivity to the network and communicating with the main DHT is taxing on the average node (which would mean delegating to stronger nodes is an improvement).
The end goal we are interested in would allow the following:
These nodes publish these provider records (once?) to a DHT which indicates that they have the content.
Once per $TIME_PERIOD, which is currently 24 hours. If the publishes lasted indefinitely, then stale content would quickly fill up the DHT.
Content is not stored in the ipfs DHT, only 'pointers' to content (a provider record).
Your goals sound about right. Primarily, the focus is maintaining random access and censorship resistance while the number of records published increases and the outgoing bandwidth is (hopefully) reduced, though if it is kept the same while increasing the number of records that is also good
Here's a random thought that may or may not make sense :P. It seems what has been published above does cover a decent amount of scaling solutions, I'm curious what we can achieve by combining them.
Is there any way to dynamically create shards without resorting to a software update, i.e. have some consensus layer between "supernodes" as to the breakdown of shards?
Consider the scenario where we have N supernodes, K shards where K is sublinear in N.
Now as part of sharding the network, supernodes need to agree on the fact that there is a need to add a shard (or begin sharding in the first place if we haven't yet sharded). I'm assuming we can dynamically shard portions of the network reliably. Some of the metrics we may have interest in as supernodes:
Now let's assume that our goal is to ensure uniformity of these metrics. Furthermore, we want to ensure that these metrics are bounded by some threshold T=(T_{throughput}, T_{response_time}, T_{shard_size}, ..., etc)
.
With these updatable parameters T, we run a multi-armed bandit simulation amongst supernodes. Our goal is to increase throughput while at least decreasing or holding fixed the bandwidth expenditure of the network. This is a measurable quantity AFAIK and so we can use it as our payoff for the multi-armed bandit simulation. Since we want to reduce this quantity we maximize the negative log of the average bandwidth expense: see here for a description of MABs. This, over time, will learn the optimal threshold for reducing bandwidth expense overall within this framework.
Imagine we want to pull the slot machine that reduces bandwidth expense the most. Essentially we run simulations until we find the best value T (which includes the number of shards we should have to balance all tradeoffs among other things).
The process works amongst supernodes who are central to messages passed within and between shards. I make the assumption that we want to form groups/shards to more uniformly distribute data and queries amongst the network, but this may not be desirable.
I think a good way to summarize my thoughts above are to parameterize the routing network somehow and use a consensus protocol between important nodes (or any nodes that want to participate) to decide when parameters should be updated. Not sure if this is well-defined.
Very interesting idea, i'll have to do some research into multi armed bandit simulations.
My main question is how to achieve consensus here? Either you run a fully byzantine algorithm (which have their own drawbacks) or you limit access by some centralized party (which may work for some groups, but not for the main open network).
Perhaps we can do sharding without consensus somehow? Similar to how coral's clustering works?
Yea, I can see consensus being a burden more so with the introduction of outright malicious nodes or simply added complexity. I recently stumbled upon this paper which seems to take similar ideas by looking at things from an optimization perspective but doesn't include sharding (just a different DHT.. haven't read it in its entirety).
I'll look more into coral too.
Coral seems quite interesting in relation to this. Perhaps consensus is also loosely achieved without any mechanism but because the incentives align and supernodes are rational. Can we design a mechanism so that it's a dominant strategy to reorganize?
Consider the introduction of a cryptocurrency incentive for storage. Additionally, let each supernode have a reputation amongst other supernodes such that we can penalize ones that behave selfishly (more on this later). Consider the case when the networks' shards are currently unbalanced with respect to metrics we are interested in: throughput, size, bandwidth expense, etc. This essentially means that some supernodes are expending more bandwidth than others and are unable to satisfy as many queries per second for regular nodes. If we have plentiful replication, supernodes experiencing these delays are incentivized to vote/attempt to reorganize the shards to balance out these metrics, i.e. they re-cluster themselves as in Coral (congregate so that 90% of peers have sufficiently low latency). I'm assuming that involved in the faster processing of queries is a greater reward with the underlying filecoin.
In the event that the selfishly benefitting node fails to organize with the rest of the network, these supernodes can label him unreliable. With enough replication, you would want to simulate a pseudo-punishment to this supernode but routing queries over longer routes to disincentivize this behavior, routes away from this bad supernode. @jbenet does this sound in sync with how rewards in filecoin could be distributed? I recall you can earn rewards by routing. In the punishment, we effectively redirect rewards if we can't redistribute rewards uniformly by sheer service capacity (storage, bandwidth, and routing). While this punishes everyone, it should correct bad behavior.
To this end a reputation is not really necessary, since it's an infinitely repeated game with a sufficiently long but finite punishment phase, but reputation could also be investigated if this makes sense.
This may still need a Byzantine algorithm. I would need to rethink things through but wanted to share my thoughts before they jump out of my head.
In addition, it seems Coral congregates based on sheer latency. Consider optimizing any vector of parameters: latency, throughput, etc.
Regarding summarized provider records, is there a mechanism similar to hierarchical deterministic wallets where we can summarize a tree of block hashes to a single root hash and salt? Given the root and salt, we can deterministically derive hashes which are designed to be id/record hashes of child blocks.
A user would have to brute force through something like the number of nodes in the tree but could do so deterministically, allowing them to match up whether the block they're looking for is a child.
EDIT: I don't think would work in the case of replication since parent hashes won't match up in every case.
@whyrusleeping Why not have content be hierarchically addressed? I know IPFS uses a flat-namespace but I'm having trouble finding why it needs to be completely flat in order to maintain utilization of a cryptographic hash. Is it because it might be hard to build a hierarchy over arbitrary content?
Perhaps use a smaller hash function (32 bit) to bucket content and then use the normal cryptographic underneath.
Consider the introduction of a cryptocurrency incentive for storage.
I like where this is going. The idea of an incentivized DHT has been floating around for a while now, but we have yet to take a deeper look.
does this sound in sync with how rewards in filecoin could be distributed?
Yeah, that does sound pretty similar to the filecoin retrieval market. It would be worth looking into how those algorithms overlap more concretely. We have a few workable plans for making the filecoin retrieval market fast, but they still need to be formally written down. They generally tends toward federated clusters of nodes (by latency region) using stake as a 'entry fee' to avoid spam.
Is it because it might be hard to build a hierarchy over arbitrary content?
Yeap. This throws wrenches.
Perhaps use a smaller hash function (32 bit) to bucket content and then use the normal cryptographic underneath.
I like this direction. My main issue is that we will end up with pieces of the same graph ending up in vastly different segments of the network (which is good and bad). The hard part is that you ideally want the hierarchy to match more closely to the graphs that are composed of these objects. One idea i had was to use compact proofs to prove that hash X is a child of hash Y, but that gets... expensive.
There's a type of hash function that Facebook created to solve problems like this, they call it social hash: https://research.fb.com/introducing-social-hash-partitioner-a-scalable-distributed-hypergraph-partitioner/
Their basic goal is to minimize fan out for common queries, but still adequately load balance across available resources. Sounds like a similar problem domain, no? :)
On Wed, Mar 28, 2018, 10:18 Whyrusleeping notifications@github.com wrote:
Consider the introduction of a cryptocurrency incentive for storage.
I like where this is going. The idea of an incentivized DHT has been floating around for a while now, but we have yet to take a deeper look.
does this sound in sync with how rewards in filecoin could be distributed?
Yeah, that does sound pretty similar to the filecoin retrieval market. It would be worth looking into how those algorithms overlap more concretely. We have a few workable plans for making the filecoin retrieval market fast, but they still need to be formally written down. They generally tends toward federated clusters of nodes (by latency region) using stake as a 'entry fee' to avoid spam.
Is it because it might be hard to build a hierarchy over arbitrary content?
Yeap. This throws wrenches.
Perhaps use a smaller hash function (32 bit) to bucket content and then use the normal cryptographic underneath.
I like this direction. My main issue is that we will end up with pieces of the same graph ending up in vastly different segments of the network (which is good and bad). The hard part is that you ideally want the hierarchy to match more closely to the graphs that are composed of these objects. One idea i had was to use compact proofs to prove that hash X is a child of hash Y, but that gets... expensive.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ipfs/notes/issues/162#issuecomment-376965949, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAcnbTBk1k259JLBJNBRoLVQTGYdcXiks5ti8XmgaJpZM4JuixA .
Maybe another piece of the puzzle could be to use IPLD selector to query content in bulk instead of each block individually.
I'm thinking more and more that the 'announce an ipld selector' approach will be less harmful to random access than i'm assuming. As someone else mentioned, if someone else thinks that some subtree is important enough on its own, they can announce it. One piece of that puzzle which I don't know how to do without zero knowledge proofs (or just plain old blind trust) is announcing a subtree, but also linking to a parent of that subtree. For example, in the tree @aszinovyev posted:
A
/ | \
B C D
/ | \ / | \ /
E F G H I J K
If everyone is announcing records for A/*
, and I think that E is important on its own, I could announce: E, descendant of A/*
, which would allow other peers to find relevant providers. The problem being that there is no guarantee there that E
is actually a descendant of A/*
without including a merkleproof (or some other proof), including proofs there would be expensive. It might be fine to just trust and verify... but i'm not sure.
Under what circumstances do you believe that it's important to announce the relationship between A and E?
On Wed, Jun 27, 2018, 08:52 Whyrusleeping notifications@github.com wrote:
I'm thinking more and more that the 'announce an ipld selector' approach will be less harmful to random access than i'm assuming. As someone else mentioned, if someone else thinks that some subtree is important enough on its own, they can announce it. One piece of that puzzle which I don't know how to do without zero knowledge proofs (or just plain old blind trust) is announcing a subtree, but also linking to a parent of that subtree. For example, in the tree @aszinovyev https://github.com/aszinovyev posted:
A / | \
B C D / | \ / | \ / E F G H I J K
If everyone is announcing records for A/, and I think that E is important on its own, I could announce: E, descendant of A/, which would allow other peers to find relevant providers. The problem being that there is no guarantee there that E is actually a descendant of A/* without including a merkleproof (or some other proof), including proofs there would be expensive. It might be fine to just trust and verify... but i'm not sure.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ipfs/notes/issues/162#issuecomment-400727160, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAcnZEA3D_D9U_5I4-ypuInsLkAB7onks5uA6okgaJpZM4JuixA .
@ajbouh if someone is looking up providers for E, and maybe only gets one record with an unavailable peer, they wouldnt know that if they searched for people who had A
they could find other peers who also have E
But if the provider of E is not available, it sounds like there's an assumption that the now-absent provider's announcement(s) is/are available via some more durable medium?
From the machine learning dataset perspective, it's common to want to look at a small slice of a massive dataset, but almost always with the knowledge of where that slice came from. This implies if I ever talk about E, it is always in the context of knowing that E is really A->B->E.
Are there many examples of E being coincidentally a part of A?
On Wed, Jun 27, 2018, 10:59 Whyrusleeping notifications@github.com wrote:
@ajbouh https://github.com/ajbouh if someone is looking up providers for E, and maybe only gets one record with an unavailable peer, they wouldnt know that if they searched for people who had A they could find other peers who also have E
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ipfs/notes/issues/162#issuecomment-400775258, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAcnfxhwlGmdSrG9kYh-B4W2ZjWx-Mjks5uA8f0gaJpZM4JuixA .
So this seems so simple that there is probably a good reason not to do it this way, but here's an idea nonetheless:
Instead of having specific delegates or trackers, just say that nodes that provide a given CID SHOULD also provide a list of other nodes that provide that same CID. In other words, nodes that are willing to provide a block will also provide routing for that block. Hopefully this isn't too much of an additional burden.
Now, once we use the DHT to find the first block we care about in a file, we can ask whichever node provides that block if it knows where to find the next block. If that block only exists in one single file, then by locality of reference that node probably has neighboring blocks for the same file as well, so it can route us to providers of those blocks too. If the first block is found in many different files, we probably won't be so lucky and will fall back to the DHT.
In other words, we just ask whoever has a given block about other blocks in the same file, and much of the time they will know.
What's wrong with this? Is the assumption that routing is cheap relative to providing a block wrong?
When a node tells a DHT server that it "provides" some content, that server could respond with all the other providers (and evict one of them to keep the list a constant size). The tricky part here is fitting this into the abstraction.
However, I'm not sure if that's really the bottleneck at this point. Right now, the primary issues are (a) sending too many provider records (bandwidth of the party providing the content) and (b) finding content (using the DHT is slow).
I'm coming to this topic late, but @whyrusleeping and I were talking about it at the Aaron Schwartz hackathon.
The concerns I have differ between two pain points. a) For censorship circumvention, any solution should avoid single points, or even multiple, points of failure - for example the problem with Trackers or their equivalent is that they can be easily blocked. b) For allowing people to access the Internet Archive's content via IPFS. We turned off using the DHT because we had 200 threads running simultaneously doing announces, and that was with just 100k of our something like 200M files in the DHT.
In our current situation, where all blocks are only known about by our super-peer, there is really no advantage (curently) for either censorship or load-shedding in using IPFS over raw HTTP.
One possible solution I discussed with @whyrusleeping was to use the application as the shard. If every client using our content asks a main DHT for "InternetArchive" (obviously by some hash) then that DHT can keep track of a relatively small number of shards each with a potential large number of users. Those clients then form a secondary DHT in which the blocks are asked for.
Nice characteristics are that:
One other potential extension for this to help the hybrid solution as apps move from centralized to decentralized would be to allow super-peers on the DHT, who announce a willingness to serve any block in the app without announcing each of them. This would allow apps with fewer users to ensure accessibility by just ensuring at least one super-peer was always online, and allow apps with a long-tail (such as the Archive) to only announce an order-of-magnitude fewer items while allowing all of the items to be accessible from IPFS.
I was taking a look at how content routing is currently used in ipfs, and from what I can tell, a request to get a bunch of blocks only looks up the first CID: https://github.com/ipfs/go-bitswap/blob/1e9b2c41c5ee322848c70be85464deb72b24226e/bitswap.go#L279
Is there something I'm missing? That sounds like the optimization I was describing a few comments ago, but without going back to the DHT if some later block isn't available.
That sounds like the optimization I was describing a few comments ago, but without going back to the DHT if some later block isn't available.
We do actually go back to the DHT later (after a delay) if we still can't find any providers for a block. This is just an initial search (assuming that if you ask for a bunch of blocks in parallel, they're probably related).
@Stebalien Can you point me to the later call to the DHT?
I think there are a few different things discussed together in this thread - in very interesting ways!
1) The first (not explicitly mentioned above) is how do you find content as close as possible to you. For this you need up-to-date content-provider records.
The delegated routing approach is similar to what has been referenced elsewhere as name-based routing. Name-based routing assumes an “on-path” node which is aware of content stored/cached locally and can be very fast in terms of content resolution - 1 (short) RTT. Scaling this up to the whole network is an issue in terms of routing table size, but also a challenge in terms of deployment (i.e., needs collaboration from the infrastructure provider/ISP).
DHTs on the other hand is a name-resolution-based routing system which however can co-exist with delegated routing. A hybrid of name-based routing with a multi-level DHT structure seems like the best approach to scale IPFS and libp2p.
In case of name-based routing there is the issue of trust (as mentioned above), but adding “routing incentives” in addition to storage incentives is something worth investigating.
2) The second is that actually having updated content provider records is very expensive.
Generally speaking, caching is different to storage. Having mechanisms to know what is cached where is not viable. IMHO, there is no need to have content provider records for cached content at all. For original providers (and therefore more stable and permanent storage), it is in their best interest to keep their content alive, hence, they only need to publish provider records once. Proactively replicating content to other nodes in order to increase chances of finding closer copies can be done and published only once too.
For totally ephemeral caching, i.e., in the storage of someone that consumed the content and is re-seeding we can use breadcrumb techniques. Assuming that content is delivered successfully (which we can assume if the rest of the system is working fine), the request can leave a “breadcrumb” in some index in the closest DHT or delegate node saying effectively “this content has recently been sent this way”. Depending on the nature of the end-node, e.g., amount of storage committed, uptime vs downtime etc. (which can be measured regularly?) and how recently has the breadcrumb been left, the delegate node can decide whether it is worth sending the request towards the caching node or not (because the content will likely have expired).
In this case, DHT or delegate nodes have to keep a table/index of “recently satisfied requests” which should probabilistically reflect what is in the cache of downstream nodes. We had a brief discussion on this with @raulk and it looks like it’s worth investigating.
3) The third is whether it is a good idea to inject “random access” features into the content naming scheme itself (through sharding or similar) in order to make content provider record updating less expensive.
The approaches discussed above are very interesting, but I’m not sure it is worth structuring the whole naming scheme to accommodate caching in (mostly) unreliable nodes. Name aggregation can be a very powerful feature for many reasons and the most straightforward way of doing it is through hierarchical naming IMO, as also mentioned above. That is, if QmCCC is subpart of QmAAA then the node that caches/provides QmCCC needs to send out Qm/AAA/CCC - then doing longest prefix matching you can have random access with reasonable granularity. Among others this can accommodate (time-shifted) multicast, if desired.
Beyond caching, one issue with the DHT is that DHT nodes are unreliable and the DHT forgets records. That's really why sending out provider records is so expensive: we have to keep sending out provider records for all the content we're storing.
So, the real problem then is the unreliable nodes (i.e., online vs offline time), which we're trying to fix with sending provider records frequently. But with some back of the envelope calculations @whyrusleeping shows that sending frequent provider records is unsustainable even now, nevermind the longer term.
And then there is the issue of whether a given node is still caching some content after some amount of time (during which it might or might not have gone offline), or it has discarded it to cache other content.
In my view these are two separate problems, so how about splitting the solution in two parts:
1) find a way to make sure that the nodes that provide routing info are reliable (i.e., are generally online), e.g., through incentives, or similar, and, 2) make sure we can find content in unreliable nodes that (ephemerally) cache (i.e., not permanently store) some content. e.g., through replication/redundancy to accommodate unreliability.
In this way, we don’t need to report provider records for routing nodes, because we know they’re generally online and we don’t need provider records for caching nodes, because we can find alternative caching nodes and they’re generally not reliable anyway, so it's not worth sending provider records for those.
I think we may be talking about different things when we say "cache". When we say that a node "caches" the content, we mean that someone decided to view the content on their local machine but didn't explicitly save (pin) it. These caches aren't services randomly caching content on the network.
find a way to make sure that the nodes that provide routing info are reliable (i.e., are generally online), e.g., through incentives, or similar, and,
:+1:
make sure we can find content in unreliable nodes that (ephemerally) cache (i.e., not permanently store) some content. e.g., through replication/redundancy to accommodate unreliability.
Replication of the content?
In this way, we don’t need to report provider records for routing nodes,
?
because we know they’re generally online and we don’t need provider records for caching nodes, because we can find alternative caching nodes and they’re generally not reliable anyway
How can we find them without provider records and without just dialing random nodes?
Currently ipfs uses a DHT for all content routing. This works quite well for many use cases and is generally reliable, fast, and durable. The problem we are now facing is that it does not scale well. Sending out provider records for every single block of every file to the DHT uses an obscene amount of bandwidth, not to mention the increased CPU and memory load. To continue improving ipfs, we need to find a better solution.
Delegated routing
The idea of delegated routing is fairly simple. Select another node in the network to perform all of your content routing for you. This is a good solution, but it assumes a few things: There are other nodes in the network willing to do this for you, you trust those nodes, and you can reach those nodes. This sort of routing is going to be just about required for mobile applications of ipfs.
Advantages
If you have a low latency connection to your selected delegate, and the delegate is well connected in the network, routing queries should complete very fast. From the users perspective, all that is involved is a single RPC between them and the delegate. The resources required here are very minimal.
Another nice feature is that if a single delegate is shared across multiple clients, results could be cached to further reduce the resource usage.
Disadvantages
In order for this to work, you have to have a node on the network willing to be your delegate. This means that you either have to control your own node out in the wild, or convince someone else to let you use theirs. This isn't always easy, especially for the casual or mobile users. One solution here is to allow the ipfs gateway nodes to be open routing delegates, but that may end up putting an extreme level of stress on the gateways and also makes the whole system just a bit more centralized (which happens to be something we're opposed to).
Trackers
The idea of trackers is quite similar to delegated routing. The primary difference here is that while the idea of delegated routing is agnostic to how the routing is performed, the intention was still that the delegates were DHT nodes on the main network. Using the idea of trackers would mean that the delegate just stores all routing information it deals with locally. This is what bittorrent uses (in addition to Mainline DHT).
Advantages
Compared to delegated routing in general, the only main advantage is that all routing queries will be resolved by the tracker, and not require extra work in the background to find the data. This, in an ideal world, is the fastest content routing system.
Disadvantages
Trackers have a limited knowledge of what all is available in the network. To find content using a tracker, other users who have the content you want must also be using that tracker. Trackers also don't scale well to larger scale systems, too much load on a single point of failure.
Hybrid Delegates
The idea of a 'hybrid' delegate system is something i've been thinking of for a little while now. The basic concept is to use multiple delegates, as well as falling back to the dht in certain cases. First, nodes can mark themselves as 'supernodes' and announce through some medium that they are willing to fulfill routing queries. Other peers can discover these through some mechanism, mdns, the dht, a preconfigured list, or randomly discover them when connecting for other purposes. Once discovered, if a node is configured to use this system, they can select the 'best' set of discovered supernodes to use and send all routing queries to those nodes. Depending on how much you trust each of those nodes you can select more or less of them to query and optionally fall back to making DHT queries. You should also be able to mark certain supernodes as trusted in your configuration so you remember them and can use 'just them' to fullfil queries.
This idea is really a few different things mashed together. Tiered routing, the ability to query multiple content routing providers intelligently, Delegated routing as discussed above, and integrated ways to discover these delegates.
One scenario i'm imagining with this system is that you could run a 'supernode' on your LAN or in your datacenter for all the other ipfs nodes around you to automatically discover (via MDNS) and use.
TODO: make this section more coherent.
Other things that could help
Batched Providing
Currently, the content routing interface only allows for a single object to be provided or have providers looked up for at a time. Changing this interface to allow multiple requests to be batched together should improve resource consumption significantly.
Multicast
The DHT sends the same message to a large number of different peers throughout the network, requiring it to form individaul connections to each of them and send out the message many times. If a multicast system were to be implemented in libp2p, it could be used to reduce the outgoing bandwidth required by the DHT.