RFC: Unified APIs for distributed programming, durable state, fault-tolerance and live redeployments (v2)

I put together a new version of the API, based on the discussion in #141 and some conversations in the Gitter room. Thanks to everyone who participated in that discussion!

I've checked in the latest version here, so we can track revisions to it:

https://github.com/unisonweb/unison/blob/4b4da116cb10d35e93e2f6d46c01d738072ae6af/docs/distributed-programming-rfc.markdown

The history section of that doc has info on changes I incorporated. Things got simpler, and API got smaller.

Looks really good! :smiley:

Have put some comments/questions inline here: https://github.com/unisonweb/unison/commit/f0d0d9f9206f45b0b8ae968725a9d548f3c19f14

@atacratic once again appreciate your detailed comments and questions, I'm going to try to get to these today or else Real Soon Now.

@pchiusano OK cool no rush! :-) Am particularly keen to hear what you think on the Durable comments - really want to get my head round that side of things.

Hey @atacratic just fyi I didn't get through all your comments yet... will pick up either this weekend sometime or early next week.

Been pondering this a bit more, not liking the complexity around heartbeats. Might have an idea for how to just eliminate them from the API without losing expressiveness! Will do writeup this week.

Ha, I've been pondering as well but only just saw your note above! Looking forward to writeup :smiley:

Okay, here's a quick writeup that I'm becoming fond of (I like deleting stuff!!!). Get rid of Heartbeat entirely, instead have:

-- Fork a computation, returning a `Task` handle that can be used for supervision
Remote.fork : Remote a -> Remote Task

-- Stop a task with the given cause
Task.stop : Cause -> Task -> Remote Unit

-- Be notified when the task completes (possibly call this `Task.await`?)
Task.supervise : Task -> Remote (Remote Cause)

-- Ensure if either task stops, other task is stopped as well, with the same cause
-- Exercise: implement in terms of `Task.stop` and `Task.supervise`!  
Task.link : Task -> Task -> Remote Unit

-- Pause the current thread for the given duration, then return
Remote.sleep : Duration -> Remote Unit

(Another possible name for Task: Process. I kinda liked vague analogy with Erlang, but they are different concepts as an Erlang process is an actor with a mailbox, whereas Task is more like a handle to a lightweight thread, think like ThreadId in Haskell)

That's it. Some notes:

Notice that Task is now a tree. A Task can fork subtasks, recursively. But, you can have multiple things supervising a Task.
Blocking operations like Box.put/take, whatever can just block indefinitely. You can always kill the parent Task if you don't think it's making progress. You can attach timeouts pretty easily, too. (Exercise - try implementing a timeout function using these primitives)
Lifetime of a Box will just be the period that it's referenced on the node where the box was created (so not like tied to the task where it was created, since we expect boxes to get used for communication between forked computations). Simple to explain, also kinda makes sense - the Box is local to a node, and if that node is no longer interested in that Box, it goes away. I'm sure there will be some usage patterns that emerge - like if you want a Box to be managed by another node, you keep a little "server" task on that node which references that Box and which you can also remotely kill.
The scheme I gave for keeping track of whether a Task is live or has failed still seems like it works here, it just gets simpler since we don't have these mutable heartbeats in the picture. Each node keeps a Map Task (Time, Status, Optional Node). The Optional Node is the most recent node to which the Task was transferred, None if it's running locally.
If the Task runtime value itself contains the original node where the Remote.fork occurred, I feel like I could prove inductively that Task.supervise is guaranteed to find a node with a recent update for that Task - just keep chasing the "transferred to" node links in the task map. Since you only need a "recent" update (within 10ms, say), you don't even need to "catch up" to the task if it keeps hoping around.
Clock skew between nodes doesn't matter here, since the timestamps are computed and compared locally at a single node (to determine if a status update is "recent enough"). We never transport timestamps between nodes and expect them to mean the same thing.
In practice we probably want to prune that task map, expiring old tasks after 30s or something else ad hoc like a FIFO cache just so it doesn't grow without bound. So if you do something unusual like a fork on node A, then pass the Task to node B halfway across the internet, then wait 10 minutes before beginning to supervise, it's possible you won't be able to establish the connection. I don't think this usage case is worth trying to engineer for.

Something like Heartbeat, if useful, could still be an ordinary pure Unison library.

I like it :+1:

The parent/child tree setup is nice and familiar from UNIX processes and erlang.
fork and supervise are familiar from pthreads (create and join).
It means the Unison concurrency primitives are basically Box and Task, which are both fairly well-trodden ground in other languages/systems. This is a good thing!

The reason this is easier to deal with than Heartbeats is that a Heartbeat came with a set of forks to supervise, which (I think) is inherently a distributed consensus sort of problem. The new proposal dodges this by cutting out any 'supervise a set of things' API. You just supervise a single Task at a time, so your task-chasing polling mechanism works a treat - each poll is just chasing one thing around.

Questions/issues:

Task chasing is not resilient to loss of nodes along the path, right? If a computation has been careering round some path from node to node, then the poll path (from the original starting node say) is going to follow the task path (with loops snipped out). Until it hits a node which has gone down. Isn't this a problem? Tasks might briefly visit a node on their way to some longer term home, and not want to die days later if the briefly-visited node goes down. (Maybe each transfer could send an update to a few carefully chosen previously-visited nodes in order to add some resilience - gossiping back along the path. This could be tricky to make resilient to cases of partition or where a bunch of nodes fail together.)
Task.supervise will only work on nodes that are on the target task's visited path. Is this OK? You nod to this in your 'something unusual' point but I think it actually needs to be mentioned in the API spec. It's a consequence of not having a single registry (DHT...) of running tasks. Or there is another fix: special-casing serialization of Task IDs so that when they are transferred some forwarding map info gets transferred too. Anyone who can refer to a given Task in order to call supervise on it is then definitely running on a node which has an entry (/some entries) for that Task in the Task map.
Unless I'm missing something, the Task.link implementation also needs to use Remote.fork (in order to start some supervisors without blocking the main thread that Task.link was run on.) But then those supervisor threads are going to be separated from the main thread if the main thread later decides to transfer. What if both the supervisors are left alone on a node that then fails? The linked tasks (off on some other nodes) are then no longer actually linked. This is where I was going with my note "These forked computations are guaranteed to stay up and watching as long as the node the parent is on is up. They move nodes with the parent." I was thinking that the forked supervision helpers need to trail around after the Tasks they are watching, as part of the Remote.transfer process. No, I don't particularly like that idea. Hopefully you have a better one, or maybe I've just missed the right impl of Task.link?
Is supervision still all poll and no push? Can we make it so that non-asteroid Task failure sends a push message to trigger supervisor completion? Otherwise people will have to roll their own supervision mechanisms to get timely failure handling in the non-asteroid case.
I still think your task chaser poll messages need a TTL. A poll could chase a Task for ever in a huge loop (around a massive parallel processing cluster say - imagine someone's written a token-ring task to collect stats or something), never catch up, and never see a status that was less than 10ms old.
Is the canonical usage example that a parent does a Task.link whenever it forks a child, to link itself to it? Unless it is explicitly a supervisor in which case it does a Task.supervise instead?

I still have a nagging feeling that we are butting up against problems which are fundamentally distributed computing problems, and will need some distributed computing machinery (e.g. a distributed hash table, or a raft cluster, or a commutative event log, or whatever) in order to robustly solve. E.g. supervision only working if done from a node on a Task's path history; and path node failures breaking task-chase polling. Unison is a distributed computing platform, so it would not be so shocking if it needed some machinery of the sort I mentioned under the covers. But, all the same, it's completely right to want to avoid adding that machinery unless and until it's proven completely necessary. I do like the idea of the Unison primitives being built on simple soft-state foundations, and that being enough to build robust distributed systems out of in userland code. If it's possible...!

Slight diversion: are you thinking that each Task will occupy an OS thread in the Unison node OS process 100% of the time, even when blocked or sleeping? Am hoping you're going to say that the answer is somehow 'no'. It's very much a first-order property of the Task API, whether they are cheap enough not to have to ration. I have a suspicion that Go has nailed this question with its segmented stacks but I guess we might be limited to what Haskell can easily offer the interpreter implementation?

Task chasing is not resilient to loss of nodes along the path, right?

Only before starting supervising. Once you've started supervising, you will follow the task as it hops around, you don't go all the way back to start node. You are almost always going to start supervising right away or not at all, so I think this is fine. Also node failures are not THAT common. Don't want to overengineer this.

Task.supervise will only work on nodes that are on the target task's visited path.

No that is not correct. It can be called from any node. The Task value itself can contain the starting node of the forked computation, and then supervisors can chase that to the task current location. Does that make sense?

Unless I'm missing something, the Task.link implementation also needs to use Remote.fork

I think so. I'm not too concerned about the node implementing the linkage going down. I mean, if you are concerned about that, that is a good reason to establish the linkage on two different nodes on different machines. That seems like something that is a user decision. I'm not sure I fully understood your concern here...

Is supervision still all poll and no push

That's an implementation detail, but I'd say status updates not caused by an asteroid strike can be pushed to supervisors as promptly as possible. Asteroid strikes will eventually be discovered also, but will need to wait for timeouts.

I still think your task chaser poll messages need a TTL.

I don't think so. Maybe I should try to produce that proof and then it will be more clear what assumptions go into it or if my reasoning is wrong. 😀

Is the canonical usage example that a parent does a Task.link whenever it forks a child, to link itself to it?

It probably depends, but I think common case would be to fork then supervise.

I don't have the same nagging feeling... yet... 😀

Re: threads, yea they should be super lightweight, just like Haskell, not actually blocking OS threads.

On Sun, Feb 19, 2017 at 7:57 PM Chris Gibbs notifications@github.com wrote:

I like it 👍

The parent/child setup is nice and familiar from UNIX processes and erlang.

fork and supervise are familiar from pthreads (create and join).

It means the Unison concurrency primitives are basically Box and Task, which are both fairly well-trodden ground in other languages/systems. This is a good thing!

The reason this is easier to deal with than Heartbeats is that a Heartbeat came with a set of forks to supervise, which (I think) is inherently a distributed consensus sort of problem. The new proposal dodges this by cutting out any 'supervise a set of things' API. You just supervise a single Task at a time, so your task-chasing polling mechanism works a treat

each poll is just chasing one thing around.

Questions/issues:

Task chasing is not resilient to loss of nodes along the path, right? If a computation has been careering round some path from node to node, then the poll path (from the original starting node say) is going to follow the task path (with loops snipped out). Until it hits a node which has gone down. Isn't this a problem? Tasks might briefly visit a node on their way to some longer term home, and not want to die days later if the briefly-visited node goes down. (Maybe each transfer could send an update to a few carefully chosen previously-visited nodes in order to add some resilience - gossiping back along the path. This could be tricky to make resilient to cases of partition or where a bunch of nodes fail together.)

Task.supervise will only work on nodes that are on the target task's visited path. Is this OK? You nod to this in your 'something unusual' point but I think it actually needs to be mentioned in the API spec. It's a consequence of not having a single registry (DHT...) of running tasks. Or there is another fix: special-casing serialization of Task IDs so that when they are transferred some forwarding map info gets transferred too. Anyone who can refer to a given Task in order to call supervise on it is then definitely running on a node which has an entry (/some entries) for that Task in the Task map.

Unless I'm missing something, the Task.link implementation also needs to use Remote.fork (in order to start some supervisors without blocking the main thread that Task.link was run on.) But then those supervisor threads are going to be separated from the main thread if the main thread later decides to transfer. What if both the supervisors are left alone on a node that then fails? The linked tasks (off on some other nodes) are then no longer actually linked. This is where I was going with my note "These forked computations are guaranteed to stay up and watching as long as the node the parent is on is up. They move nodes with the parent." I was thinking that the forked supervision helpers need to trail around after the Tasks they are watching, as part of the Remote.transfer process. No, I don't particularly like that idea. Hopefully you have a better one, or maybe I've just missed the right impl of Task.link?

Is supervision still all poll and no push? Can we make it so that non-asteroid Task failure sends a push message to trigger supervisor completion? Otherwise people will have to roll their own supervision mechanisms to get timely failure handling in the non-asteroid case.

I still think your task chaser poll messages need a TTL. A poll could chase a Task for ever in a huge loop (around a massive parallel processing cluster say - imagine someone's written a token-ring task to collect stats or something), never catch up, and never see a status that was less than 10ms old.

Is the canonical usage example that a parent does a Task.link whenever it forks a child, to link itself to it? Unless it is explicitly a supervisor in which case it does a Task.supervise instead?

I still have a nagging feeling that we are butting up against problems which are fundamentally distributed computing problems, and will need some distributed computing machinery (e.g. a distributed hash table, or a raft cluster, or a commutative event log, or whatever) in order to robustly solve. E.g. supervision only working if done from a node on a Task's path history; and path node failures breaking task-chase polling. Unison is a distributed computing platform, so it would not be so shocking if it needed some machinery of the sort I mentioned under the covers. But, all the same, it's completely right to want to avoid adding that machinery unless and until it's proven completely necessary. I do like the idea of the Unison primitives being built on simple soft-state foundations, and that being enough to build robust distributed systems out of in userland code. If it's possible...!

Slight diversion: are you thinking that each Task will occupy an OS thread in the Unison node OS process 100% of the time, even when blocked or sleeping? Am hoping you're going to say that the answer is somehow 'no'. It's very much a first-order property of the Task API, whether they are cheap enough not to have to ration. I have a suspicion that Go has nailed this question with its segmented stacks https://blog.cloudflare.com/how-stacks-are-handled-in-go/ but I guess we might be limited to what Haskell can easily offer the interpreter implementation?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/unisonweb/unison/issues/142#issuecomment-280964657, or mute the thread https://github.com/notifications/unsubscribe-auth/AAArQkEtizf9j7RdUYTgi5s3C3tAgAsPks5reOTigaJpZM4MCFjn .

The Task value itself can contain the starting node of the forked computation

Cunning!

[re Task/link] I'm not too concerned about the node implementing the linkage going down. I mean, if you are concerned about that, that is a good reason to establish the linkage on two different nodes on different machines.

Ah yes, and if you want to move a supervisor, you can start a new one and then kill the old one.

I was fretting about the parent starting a supervisor and then being split up from it, because I saw the supervision as something the parent was doing. But that's not the case.

I should probably stop talking about parent/child, because that calls to mind supervision trees, but here we're not being forced into a supervision tree model and we're certainly not being forced to make supervision follow parent/child (who-forked-who) relationships. The parent/child setup created by forking is only relevant when killing tasks.

People are naturally going to gravitate towards the assumption that parents supervise children, so I think I won't be the last person slightly confused by this.

Do we talk about parent tasks and child tasks? What's the terminology? Ah you used 'subtasks'.

BTW I think your comment above Task.stop should say "Stop a task (and its subtasks)".

TTL

Probably the extra assumption you're making is that task chaser poll messages move faster on average than the top speed of the tasks they're chasing - but probably that's a fair assumption.

I think common case would be to fork then supervise

Or maybe it's actually to fork and then fork a supervisor.

Other stuff:

I guess there are plenty of communication/concurrency abstractions you could build on top of Box, fork and transfer. Do you have a main one in mind which you think will be part of the bedrock of the Unison ecosystem? (Do you want to clone fs2 into unison and would that make sense?)
[Implementation detail] You might be annoyed if you had a million supervisors on one node all supervising a million computations on a second node, and you were getting a million polls per second (or whatever) between the two. Maybe one day it will be worth adding an optimization where the keepalive is only at node (or node container) scope, and nodes are aware of what supervisions are pointing into them and promise to push updates outwards when failures/transfers happen.

Closed for #144, progress continues, thanks everyone!!

unisonweb / unison

RFC: Unified APIs for distributed programming, durable state, fault-tolerance and live redeployments (v2) #142