Gossip protocols - useful direction?

atacratic commented 5 years ago

I've been gripped by the urge to try and write a gossip-based protocol in Unison. (I'm OK with the fact I might not be able to typecheck or run it for a little while!)

Mostly this is (a) for fun (b) so I can road-test the type checker as it firms up (c) to try using the Remote API. I'm interested in gossip approaches because I find scalable p2p pretty cool, and because some gossip algorithms are pretty simple to implement.

But it would be great if it could also produce some code that is a useful library for unison / unison.cloud. So I'm looking for some steers as to what to aim for.

I'm not so inclined to work on a Raft implementation instead (as per #114), partly because Arya's already had a crack at it.

In a network that uses gossip-based protocols, the nodes are communicating in a peer-to-peer fashion, disseminating information from one to another so that it spreads through to all the other nodes. Gossip-based protocols often depend on a 'peer sampling service' (as explained in 2.2 and 3 of Gossip-based Networking for Internet-Scale Distributed Systems. It's actually a peer sampling service that I'm thinking of implementing for starters.

A peer-sampling service lets each node that is participating in the protocol discover other nodes. It gives each node a 'partial view' of the full network membership list.

Applications?

Section 1 of Gossip-based peer sampling summarises the sort of things you can implement with the help of a peer sampling service. Paraphrasing:

multicasting information / publish/subscribe groups
aggregation (e.g. finding the node running on a machine with most spare CPU; counting the number of nodes in the network; collecting stats)
load-balancing (i.e. agreeing work handoffs between peers)
building overlay routed networks, e.g. of the kind you need for a distributed hash table.

So that makes me think it would be neat to have one.

Can we think of any applications within Unison? The one that springs to mind is cluster membership management. That's a basic library we need. The alternative approach to implement it would be a consensus protocol like Raft (and #114 proposes Raft as the membership service to support DIndex). But consensus protocols won't scale beyond hundreds of nodes (citation needed...) whereas gossip approaches could scale to millions. Is there any plausible need for a swarm of millions of unison nodes? I guess if you were imagining reimplementing p2p messaging/telephony (like tox) on top of unison then yes.

Even in the absence of a specific million-node-scale application, I think it would be cool if this is something Unison and its ecosystem enabled. "Yeah sure, you can build a p2p app based on Unison and it can grow to internet-scale, with negligible footprint in each node."

Intrusion-tolerance and trust

Where I get stuck on this line of thinking is around intrusion-tolerance and trust. Not every gossip protocol is able to cope with malicious nodes participating in the protocol in invalid ways to try to bring the system down. I'm still chasing references to find the state of the art and what's possible here - so far I've found BRAHMS, plus a few others.

Supposing for the moment it is possible to have an intrusion-tolerant gossip-based peer sampling protocol, is the intrusion-tolerance any use? If you are doing a Remote.transfer of a computation to a malicious node, you're in a pretty bad way. The situation is worse than if you were just writing regular code to send and receive network messages from a malicious peer - at least there you have a chance to parse the message they sent you and apply some checking to it. In unison, if they send you the computation back having ostensibly done some useful processing on it, they are actually sending you a request to run some code of their choosing (plus some data that code will operate on). Is that right? Can you do anything to be sure they're sending you code that you would want to run?

Does this say that any collaboration between peers in unison assumes you trust them to get you to do whatever they want (at least up to the limitations applied by your sandbox on resources and side-effects)?

If so then does communication need to be limited to islands of peers that have mutual trust, where the level of trust relates to their sandbox properties? How does that square with the unison.cloud idea? Does it mean that it needs to be a managed service delivered by a single trust-worthy entity?

Full vs partial views

A peer sampling service is about giving you a partial view of the network membership list - say, ten out of ten million. You could instead take the approach of saying that all nodes should get a complete view of the membership list. When Multi-Hop Peer-to-Peer Routing Matters makes a good argument that if you're implementing a distributed data-store, and you want to be able to serve files of a serious size, then the overhead of storing (say) ten million IPs (=57MB) is negligible, so you might as well go with full views.

I'm still thinking a partial view is the way I want to go because

any O(n) scalability means we can't think of this service as being lightweight at arbitrary scale - 57MB is too big to say "and sure, in any node you spin up you can run this service without a second thought, even if you're running a thousand nodes on the same device"
any distributed file-store implementation is welcome to not use this proposed bit of library code!

However, the choice of partial views interacts with the intrusion-tolerance point. Intrusion-tolerance seems to be easier in the full-view approach - for example fireflies seems to have it well covered.

If you've got this far, thanks for reading. I hope you're OK with using the issue tracker for this kind of chat. I look forward to your comments/suggestions/insights!

aryairani commented 5 years ago

Hey thanks for this write-up; I wanted to reply if even to say I haven't had a chance to go through it yet, but it's definitely appreciated, and we're looking forward to checking it out!

aryairani commented 5 years ago

Right now we don't feel it would be fair to recommend that anyone write Unison programs, given the low quality of error messages coming out of the the type-checker (We are routinely asking ourselves: "Do I have a bug? Does the type-checker have a bug? What is this even telling me?"), but hoping to get it into a less-embarrassing state ASAP.

aryairani commented 5 years ago

P.S. We are definitely hoping that you will still be interested in writing Unison libraries when the tooling is ready! 😄

atacratic commented 5 years ago

Thanks for the reply! I've been following the commits closely enough to know what state stuff is in - am not under any illusions that stuff will compile or run at the mo.

If I get through the planning stage to wanting to code something, I'd been thinking that when I hit typechecker bugs, I could do reductions and add them to the tests on a branch, for someone to get round to. Would be all part of the fun.

aryairani commented 5 years ago

Can't argue with that! 😃

pchiusano commented 5 years ago

Also @atacratic I really love idea of having generic set of distributed computing libraries, all in pure Unison, so excited to get things to a place where you and other folks could start working on this. Like I imagine “I went and read this distributed computing paper and ported it to a super reusable and tiny Unison library” becoming a regular thing around here. 😀

atacratic commented 5 years ago

Here are some more notes-to-self, having thought through the concern under 'Intrusion tolerance and trust' above.

The problem is how to communicate with a possibly malevolent node, where the Unison implementation itself has been tinkered with.

I had been thinking about the standard Unison rendering of the request/response pattern, which I think is as follows.

example = let alice = Remote.here
          Remote.transfer bob
          let x = bob_processing()
          Remote.transfer alice
          do_stuff x

In that pattern Bob can mount quite a few possible attacks:

transfer back to Alice with a continuation do_evil_stuff x instead of do_stuff x
never transfer back to Alice
transfer back to Alice multiple times
transfer back to Alice with some incorrect result x.

The first one allows Bob to entirely hijack Alice's code flow. So better instead to do this:

untrusting = let result = Box.empty
             Remote.fork (request bob result bob_processing)
             let x = await result timeout
             do_stuff x

...where we're using the following helper functions:

await: Box a -> Duration -> {Fail} a
await box t = withTimeout t '(Box.read box)

request : Node -> Box a -> {Remote} 'a -> ()
request target result c = let origin = Remote.here
                          Remote.transfer target
                          let x = !c
                          Remote.transfer origin
                          Box.put x result

That lets Alice cope with all the attacks, at least to the extent she can spot and cope with the final one, the result x being incorrect.

So that's good, but there are also plenty of attacks Bob could mount, which don't map directly to anything at the Unison language level.

Send a computation to Alice whose hash pulls in a large or even infinite amount of code.
Subvert Unison's distributed garbage collection mechanism (that clears up unreferenced Box instances), perhaps by sending too many/few keepalives, or monkeying with the visited-lists in the keepalives, or sending keepalives for boxes it's not entitled to be referencing.
Send a computation that runs for ever, and/or uses lots of memory or durable storage space. (The sandbox mechanism limits this, although that's cold comfort for other, innocent tasks trying to run in the same sandbox.)
Send a computation which finds and shares secret/private information that exists on Alice. (Again, sandbox should prevent this.)
Send a computation which has unwanted effects. (Presumably Sandbox should include policy about what effects a computation is allowed to invoke.)

This all seems like stuff where the impact could be contained appropriately, with the right smarts in the sandbox mechanism, and possibly some blacklisting of bad peers.

But where are you at if someone subverts the sandbox you're using for communication with an untrusted node, for example by maxing out its resource limits? I guess you want a sandbox for each possibly-untrusted node you're talking to, so that one bad node doesn't scupper your dealings with other nodes. And you want some way for the rest of the system to hear that a sandbox, or a task in it, is poorly. Again, seems like it should be soluble.

So I'm coming to the conclusion that it would be possible to engineer Unison so it works in a zero-trust environment, i.e. to allow Unison nodes that don't trust each other to collaborate. I don't think there's anything about Unison's special code/computation transfer model that prevents it.

Which is good :-) and I can keep looking for an intrusion-tolerant gossip-based peer sampling protocol...

pchiusano commented 5 years ago

Blocked until language and editor are a bit further along I'd say.

aryairani commented 2 years ago

Maybe they are further along now?

unisonweb / unison