Discussion: cloud workflows and the distribution effect

eiriktsarpalis commented 9 years ago

Drawing on the discussion started in this issue, I would like to share a few thoughts on the programming model.

As you may know, cloud workflows are used in every aspect of the MBrace API, from parallel combinators to store operations. For instance, the ICloudDisposable interface has the following signature:

type ICloudDisposable =
    abstract Dispose : unit -> Cloud<unit>

An interesting question that arises here is, how one can know if a dispose implementation does not introduce distribution? While it makes sense that all primitive store operations should not introduce distribution, this cannot be guaranteed by their type signature. A workflow of type Cloud<unit> could either signify asynchronous store operation or it could contain a massively distributed computation. In other words, there is no way to statically detect if a workflow carries the distribution effect.

Currently, this is somewhat mitigated using the following primitive:

Cloud.ToLocal : Cloud<'T> -> Cloud<'T>

This has the effect of evaluating the input workflow with thread pool parallelism semantics, thus giving a dynamic guarantee that the nested computation will never exit the current worker machine. It offers relative sanity, but is hard to reason about and does not work correctly in conjunction with forking operations, like Cloud.StartChild.

My proposal to amending this issue is to introduce two brands of computation expressions for MBrace workflows, for local and distributed computations. A distributed workflow can compose local workflows, but not the other way around. Store operations will be local workflows and the parallelism primitives will necessary return distributed workflows. This would allow to statically reason about the distribution effect, while potentially complicating the programming model.

I have created an experimental branch that attempts to develop these ideas: Workflow definitions Builder declarations Store operations using local workflows Cloud.Parallel primitive

Thoughts?

dsyme commented 9 years ago

Perhaps an alternative is this naming:

type Cloud<'T>
type Cloud<'T, 'Where> : Cloud<T>
type Local  // a tag type

So what we call Local<T> today is always seen as Cloud<T, Local>. Then the average user only ever sees

 Cloud<T>
 Cloud.Parallel: seq<Cloud<T>>  -> Cloud<T[]>

And the power users sees:

cloudLocal { ... }  (or local { ... } if you like)
Cloud<T,Local>
CloudLocal.Parallel: seq<Cloud<T,Local>>  -> Cloud<T[],Local>

For the power user the machinery is at least reusable if they choose to define their own tags, with the added bonus that combinators like CloudLocal.Parallel also preserve localness.

palladin commented 9 years ago

The structure as it is right now works beautifully, maybe we need to rename Local<'T> to CloudLocal<'T> in order to give it some context. But local {} is too beautiful to change it.

isaacabraham commented 9 years ago

@palladin @eiriktsarpalis: I personally really like the local { } abstraction. Once I "got" it, it allowed me to more easily reason about my code.

However, I'd also strongly recommend acting on this feedback, even if it's negative or not necessarily in line with what was hoped for. If people are struggling now - and I assume from Don that most of the individuals he's coaching are MSR people - then the average developer will probably experience the same. Most of those individuals won't have an F# expert sitting by their side either. Some of those might well just give up if they get stuck on something like this.

So whilst I think I'm in agreement with you regarding the effectiveness of local { } I'm also worried about users struggling to get up to speed with the different abstractions - something I've seen as well.

palladin commented 9 years ago

@isaacabraham I agree that local {} is not for the average user. I think that the problem is that Local<'T> appears in many entry level APIs. Maybe if we prefix the Local type with Cloud CloudLocal<'T> we can make it more regular as a member of the family of Cloudxxx types and of course more digestible for the newcomer.

eiriktsarpalis commented 9 years ago

I'm not sure if renaming Local<'T> would somehow ease understanding for novices. Combinators that explicitly require local as arguments would type error if supplied with cloud. So if the goal is to delay introduction of the concept, I find it unlikely that it will be achieved in this way.

Having played with local workflows quite I bit, I can say that the local/cloud duality is a central point of the programming model as it is. Perhaps it would make sense to promote this distinction from the very beginning in tutorials.

dsyme commented 9 years ago

Here's an overview of the puppy image real-world scenario from yesterday. Basically the work came down to transporting 2GB (N * M * Size) of (string * int[]) data to the cloud, running (N_N_M) Set.ofArray/Set.intersection, operations, checking if the intersection size was greater than some threshold (indicating similar or duplicate images), and returning strings (indicating the duplicate image names). It turned out it could be composed to M independent jobs, each running at most 12 hours, so we used a 150 machine cluster to do it There were probably a whole lot of optimizations we could have done (bit sets etc.), but we didn't need to bother.

In this scenario the final solution just ended up using nothing but cloud { ... }, CreateProcess, AwaitResult() and that's all. Any data transport to the cloud was implicit in the cloud { ... } blocks.

We experimented with Cloud.Parallel but the overall size of the serialized job was too big, so we broke the work into the M independent CreateProcess calls (this is also why we were trying to parallelize calls to CreateProcess hence our bug report about that). For a while we thought we might have to store the data in the cloud so we started to do M jobs doing CloudCell.New, but then we realized that storage was temporary and could be fused out.

Most of our actual work was preparing/shaping/trialing small trial jobs (e.g. M=3, N=3) to estimate how much compute and process-upload time we were going to need in total.

This scenario seems very typical of the "medium-data-plus-big-compute-in-the-cloud" scenario that MBrace will absolutely excel at. For this scenario, the magic of MBrace is in REPL scripting and seamless-data-plus-code-transport-to-the-cloud. We were pleased with the ease and simplicity of that - Vagrant + MBrace + Brisk is an amazing, simple, exploratory, playful cloud scale-out programming environment.

I'll experiment with a PR for this idea: https://github.com/mbraceproject/MBrace.Core/issues/9#issuecomment-88632280, I'm fairly positive about this.

eiriktsarpalis commented 9 years ago

@dsyme Great, we would love to have his testimonial on the website once he's done.

dsyme commented 9 years ago

@eiriktsarpalis @palladin

Reopening this old chestnut again...

Unfortunately I've had the feedback that "Learning local ... is really confusing" once again.

I'm still not sure the local { ... } v. cloud { ... } distinction is hitting the sweet spot for users (as opposed to combinatory-implementors). That is compared to alternatives like "everything is cloud { ... }".

First, the word "local" is still confusing people - is it "local to the worker" or "local to the client" or "local to a machine" or ... People seem to be interpreting it as "local to the client" because of terminology like cluster.RunLocally.

Second, I'm just not sure that cloud v. local seems is a distinction that's so important to the majority of users. People seem to be really, really confused by it and don't understand what it's giving them. The majority of uses of MBrace are where there is nothing but either cloud flows or "start lots of jobs and wait for them".

dsyme commented 9 years ago

Do you think it might be possible to somehow enable this distinction optionally, for authors of libraries like MBrace.Azure and MBrace.Flow, by opening extra namespaces? If that were possible it might deal with the problem.

eiriktsarpalis commented 9 years ago

I think this problem essentially boils down to the naming of the Local<_> type. The local expression builder as well as the Local.* methods can be easily hidden away in separate namespaces and never be noticed by novice users. The Local<_> type however is pervasive and cannot be ignored. Perhaps a good rename of this type would solve this ambiguity. We go for the phantom type approach or just call it LocalCloud<_>.

palladin commented 9 years ago

What if we remove the static typing of local {} and bring back dynamic typing-checking! Example

let mapCloud (f : 'T -> Cloud<'R>) (x : 'T) = cloud {
let! r = local (f x) // local : Cloud<'T> -> Cloud<'T>
}

mapCloud (fun x -> cloud {
                                let! y = Cloud.Parallel[] // exception boom runs in local context
                                return y
                                })

eiriktsarpalis commented 9 years ago

@palladin I'm strongly opposed to such an approach.

dsyme commented 9 years ago

:)

I spiked two possible changes here: https://github.com/mbraceproject/MBrace.Core/compare/master...dsyme:fix-local

Abbreviate Local<T> = Cloud<T> in release mode, with mapLocal --> mapCloud
Rename Local --> Cloud0 and local --> cloud0

and use the terminology "single-machine cloud workflow"

isaacabraham commented 9 years ago

Wasn't there a discussion about replacing local with async at some point as the semantics are somewhat similar?

An alternative is to go back to cloud { } but with some way of indicating that some cloud workflows only operate locally. I don't like this idea though.

I'm of the opinion that local { } has real value in being able to reason about your code and where it executes - which is one of the harder bits of Mbrace for beginners - and perhaps this problem can be solved by simply hiding local { } from higher namespaces and putting some xml comments for the beginner on local to say "just treat this as cloud if you're a beginner" :)

palladin commented 9 years ago

I like the terminology "single-machine cloud workflow"

dsyme commented 9 years ago

The way I look at it is like this

	CPU	Async I/O + cancellation + single thread	Async I/O + cancellation + multi-thread	Cloud I/O + cancellation + multi-thread + single machine	Cloud I/O + cancellation + multi-thread + multi machine
Notes				Supported as work specs in MBrace APIs. No scheduling of nested cloud computations, safe to use shared memory and unserializable objects up to multi-threaded concurrency safety	Supported as work specs in MBrace APIs. Scheduling operations may serialize, dangerous to use shared mutable memory and unserializable objects unless the work is effectively `cloud0`.
normal F# code	x
`async { ... }`	x	x	x
`cloud0 { ... }`	x	x	x	x
`cloud { ... }`	x	x	x	x	x

The advantage of the "cloud0" name is that it implies "it's like cloud { ... }, and it's for cloud programming, but more restrictive than cloud { ... }".

[ As an aside, when looked at this way, you can also imagine there being an async0 that is single-thread (which can't start new child tasks in the thread pool, and only supports StartImmediate) . ]

[ As an aside I suppose the distinction between Async I/O and Cloud I/O isn't really very meaningful - indeed Async I/O for web requests is likely to be a stronger effect (= longer delays, more chance of failure) than stores to/from cloud storage in the same data center. ]

dsyme commented 9 years ago

@eiriktsarpalis - FWIW I gather from #117 (and previous discussions) that the cloud0/local row in the table above should not have an x in "Serialized semantics".

This would mean that the only real difference between cloud0/local and async is that cloud0/local carries an extra computation-local data map - do I recall that correctly?

To put it another way, if you could edit the table above until it reflects what's accurate that would be great :)

dsyme commented 9 years ago

Copied from #117:

@dsyme says:

I see, so the entire continuation is serialized, so

cloud {
    let v = ref 0 // or some non-serializable thing
    let! x = SomethingThatMayUseCloudParallel()
    ... reference to v ...
}

results in the possibility of invalid serialization of the continuation.

In this setting I still like the cloud0 name - explained as "a cloud { ... } that starts precisely zero nested tasks, and no part of which gets serialized, and which can consequently use shared memory and unserializable in-memory objects safely"

@palladin says

I was wondering what that cloud0 suffix zero means...

" a cloud { ... } that starts precisely zero nested tasks, and no part of which gets serialized, and which can consequently use shared memory and unserializable in-memory objects safely" it is exactly the motivation behind local {}

dsyme commented 9 years ago

(I've edited the table above to reflect my understanding)

palladin commented 9 years ago

I think that "Cloud I/O + cancellation + multi machine" implies "Scheduling operations may serialize (dangerous to use shared mutable memory and unserializable objects)"

dsyme commented 9 years ago

columns consolidated

eiriktsarpalis commented 9 years ago

Slightly related, the master branch of MBrace.Core includes support for local execution that emulates the effects of distribution. See this example.

dsyme commented 9 years ago

I took another less dramatic attempt to make progress on this issue here: https://github.com/mbraceproject/MBrace.Core/pull/119

This tries to replace the language "Remote" by "Cloud" and "Locally" by "Client". Minimally I think we have to be really careful not to use "Local" to mean "Client"

isaacabraham commented 9 years ago

Given today's discussion, can this issue be closed as well?

eiriktsarpalis commented 9 years ago

Yep, closing.

mbraceproject / MBrace.Core

Discussion: cloud workflows and the distribution effect #9