qri-io / rfcs

Request For Comments (RFCs) documenting changes to Qri
MIT License
12 stars 6 forks source link

RFC Request: dynamic p2p dataset replication #37

Open b5 opened 5 years ago

b5 commented 5 years ago

We've been thinking about dynamic dataset replication it for a while. While this is definitely future work, @camhunt had the following question:

Has there been any thought/discussion in leveraging gossipsub so that any node that subscribes to a channel will automatically replicate any data set advertised?

So I think it's time we laid out some thinking on this & started roadmapping. This is all very preliminary, but here's what I've got so far.

First, here's what I think we want: Qri users should be able to express the kinds of dataset they're interested in, and Qri should automatically grab that kind of data when others publish it. Qri shouldn't do stupid things like eat the user's entire hard drive, or download stuff that could get the user in trouble

Some example kinds of dataset:

Users should also be able to define constraints on what they grab. Examples:

I don't think we need gossipsub for this (yet). We have a custom p2p protocol that's more efficient in terms of bandwidth consumption than gossipsub (simply by avoiding an extra source of bandwidth consumption), but more importantly, Qri p2p "understands" datasets, which we're going to need in the long term to be able to make rules like: "I'd like to replicate any dataset from peer steve that has the tag climate and is less than 100MB", most of those things require understanding what a dataset is, so it'll be easier to build in Qri from the start.

All the work we're doing on remotes are steps that build toward this. After we land the first remotes rfc (the over-http version of remotes), the next step will be to move dsync into p2p, so peers can sync only the blocks in a dataset they're missing in a pure p2p environment when they're directly connected, and fall back to pin-and-wait from the IPFS DHT when no direct connection can be established (publishing to places like Qri & non-NAT remotes will help this).

After that's in place, we'll need two things that will require some thought:

  1. Some way to notify peers about new datasets so they can decide weather or not to pin them
  2. A system for making pinning decisions we can grow into

Announcing New Datasets

Qri's p2p protocol already has the notion of announcing events. So far we can only announce that a peer has connected: https://github.com/qri-io/qri/blob/7fc0f8d44c32dbc89b850e986d40cc1afc801e10/p2p/connected.go#L40, so the distance here is shorter than it looks. Once we can do a dsync over p2p all we need is an Announce message for whenever a new dataset is created, and structure that message so that peers selectively propagate that message to all peers they're connected to (selectively to avoid DDOSing with announce messages). This is the exact same pattern gossipsub uses, but we can make smarter decisions about where to send those messages because Qri is more domain specific to datasets. We have lots to learn about how our network behaves at scale before going there.

When combined with IPFS public proxies, it would even mean that nodes that can't directly connect to each other (e.g., due to NAT issues) could still replicate data sets around

There are a few nice things coming down the bend on this topic, AutoNAT will land in go-ipfs 0.4.19, and we'll start setting up a NAT relay on our bootstrap nodes, which should mean Qri peers have an easier time finding each other. That's part of the reason I'd like to wait

Pinning Decisions

This doesn't solve the bad actor problem (which we could solve - crudely - by a whitelist of nodes which any given node would auto-replicate new data sets)

Totally agreed here. Once we can do all of the above, the last thing is a way to describe what you want to pin in a way that is "safe". Establishing manually maintained lists of trusted peers is the perfect place to start this pinning decisions stuff. All the filtering based on types of datasets can wait, because I think we can get a long way on just "pin whatever peer_a publishes. We can use that time to work on the other bit.

To me the big goal here is getting peers into a position of mutual support. That constraint of "don't pin this if the network has it in at least 3 places all the time" is the thing I've wanted since data rescue started.

I'm really intrigued by the work @nehbit has been doing on Aether. His notion of "neighborhoods" of slow-rotating peers for propagating content is really great work. I'm hoping we can come up with something that builds toward the notion of a neighborhood, where peers make pinning decisions not only based on what they want to keep, but also, to what degree a dataset is available on the network. This was part of the original promise of IPFS's bitswap algorithm and it hasn't lived up to the high-availability standard, so I don't consider this an easy problem.

I think the best place to start there is to ask people to manually tell us who they trust, which will help us understand how people want to compose themselves. We can use that to inform what will need to be a formally specified algo.

nehbit commented 5 years ago

Hey there 👋

Qri users should be able to express the kinds of dataset they're interested in, and Qri should automatically grab that kind of data when others publish it. Qri shouldn't do stupid things like eat the user's entire hard drive, or download stuff that could get the user in trouble

This is something I'm working on with Aether as well. Not because the dataset I have is large, but because people do have certain preferences in which types of content they want to host, regardless of the legal status of the content. So there are some people that don't want to host content from certain communities for their own personal reasons. I'm investigating several possible avenues for this as well.

I'm really intrigued by the work @nehbit has been doing on Aether. His notion of "neighborhoods" of slow-rotating peers for propagating content is really great work. I'm hoping we can come up with something that builds toward the notion of a neighborhood, where peers make pinning decisions not only based on what they want to keep, but also, to what degree a dataset is available on the network. This was part of the original promise of IPFS's bitswap algorithm and it hasn't lived up to the high-availability standard, so I don't consider this an easy problem.

For neighbourhoods to work, you have to build them across peers that are equivalents in terms of the data they want. Aether can do them over the whole network because the app itself is defined in such a way that all nodes will want all the data, so that anyone can be in a neighbourhood with anyone else.

If you want nodes that want different pieces of data (or tracking different data streams), you might want to create neighbourhoods in a pattern that groups people that want the exact same dataset combination, and make them more stable (fewer insert / ejects) so that in effect they become semi-permanent swarms. If you want dataset A B C, you either find a neighbourhood that is composed of people tracking all of A B C in the best case, or find three that tracks A B and C separately in three neighbourhoods in the worst.

The current implementation in Aether is very simplistic and aided massively by the fact that Aether nodes want all the data, so all potential neighbours are valid neighbours. When people start to choose the data streams they want to share with others via choosing communities they want to broadcast, then I'll have to expand on this.

b5 commented 5 years ago

For neighbourhoods to work, you have to build them across peers that are equivalents in terms of the data they want.

this ☝️ strikes me as a product of deep thought. Seems like the challenge here is expressive equivalency. Two equal things have only one permutation that works (either they're equal, or they aren't), great place to start.

In Aether today all peers are equivalent, because they're hard-wired to want the same data. It seems to me you've introduced nuance in around what peers do with that information (if I understand how you do variable length rolling graph retention correctly). More reasons I think Aether is the bees knees.

On the Qri side of things, we've started on the other end of the spectrum. By default all Qri peers want no data at all, and must be explicitly told to go get datasets, which is why it's not surprising to me at all that our users want ways to automate data grabbing.

But in both cases I think we're looking for the same thing, some way to broaden the definition of what they want.

Just jotting an idea down here to explore properties of the problem:

Anyway, that's my first crack at expressiveness-with-equivalency in terms of data interest. I have no idea if this will make sense 😄

nehbit commented 5 years ago

That's interesting! An alternative implementation that might be useful — using bloom filters. In this case, The whole network is one neighbourhood that is infinitely subdivisible, and for the data you want, you merge the fingerprints of the data (in Aether's case, communities) into a single bloom filter. Then, you can be a part of any neighbourhood with a bloom filter that includes yours. The fingerprint being in the bloom filter means that they will likely have at least the data you want.

The good thing is that these neighbourhoods don't need to exist as 'entities', they are, rather, emergent properties that occur when a local node picks their peers based on whether the remote node's bloom filter includes the local node's inside it.

There are some practical concerns with that, especially in the fact that the bloom filters can only answer 'might have it' and not 'will have' ones. The other issue is that they have a finite capacity before their error rate goes above the expected boundaries. But in the worst case, a bloom filter would assume everybody is in their neighbourhood (false positive rate of 100%), which would exactly be the current behaviour of Aether, so any implementation of a bloom filter as a neighbourhood selector would improve the current state. The other thing is, the users can choose the size of their bloom filters, and a higher bloom filter size would allow for a more granular choice of neighbourhoods (i.e. less data fetched that you do not want that you discard) and reduced false positive rate, in exchange to a larger bloom filter that needs to be transmitted. Bloom filters also improve privacy of nodes, in that one of the major concerns with choosing a subset of data to take in and broadcast is that it publicly declares the communities you are tracking, which bloom filters avoid.

At this point in Aether, I think there aren't enough peers (it has just reached about 500 concurrent for a brief time a week ago) to warrant this kind of logic for choosing between peers, it is still more efficient to just connect to everybody and discard the data you don't actually want. But as it grows further than that, this might be a good way to go about finding peers that are similar to you, without needing to know how similar exactly, beyond the fact that they are likely tracking the data you want to track as well.