mbraceproject / MBrace.Core

MBrace Core Libraries & Runtime Foundations
http://mbrace.io/
Apache License 2.0
211 stars 46 forks source link

Discussion: cloud workflows and the distribution effect #9

Closed eiriktsarpalis closed 9 years ago

eiriktsarpalis commented 9 years ago

Drawing on the discussion started in this issue, I would like to share a few thoughts on the programming model.

As you may know, cloud workflows are used in every aspect of the MBrace API, from parallel combinators to store operations. For instance, the ICloudDisposable interface has the following signature:

type ICloudDisposable =
    abstract Dispose : unit -> Cloud<unit>

An interesting question that arises here is, how one can know if a dispose implementation does not introduce distribution? While it makes sense that all primitive store operations should not introduce distribution, this cannot be guaranteed by their type signature. A workflow of type Cloud<unit> could either signify asynchronous store operation or it could contain a massively distributed computation. In other words, there is no way to statically detect if a workflow carries the distribution effect.

Currently, this is somewhat mitigated using the following primitive:

Cloud.ToLocal : Cloud<'T> -> Cloud<'T>

This has the effect of evaluating the input workflow with thread pool parallelism semantics, thus giving a dynamic guarantee that the nested computation will never exit the current worker machine. It offers relative sanity, but is hard to reason about and does not work correctly in conjunction with forking operations, like Cloud.StartChild.

My proposal to amending this issue is to introduce two brands of computation expressions for MBrace workflows, for local and distributed computations. A distributed workflow can compose local workflows, but not the other way around. Store operations will be local workflows and the parallelism primitives will necessary return distributed workflows. This would allow to statically reason about the distribution effect, while potentially complicating the programming model.

I have created an experimental branch that attempts to develop these ideas: Workflow definitions Builder declarations Store operations using local workflows Cloud.Parallel primitive

Thoughts?

eiriktsarpalis commented 9 years ago

To give a more concrete example, the proposed change in the programming model would essentially contain two types, Local<'T> and Cloud<'T>, which would both be subtypes of the abstract type Workflow<'T> (names are indicative and subject to revision, of course). Each of the two subtypes would come with distinct expression builders:

let readFile (reader : Stream -> 'T) (f : CloudFile) : Local<'T> = local {
    use! stream = CloudFile.ReadAsStream f
    return reader stream
}

and

let readParallel (reader : Stream -> 'T) (files : CloudFile []) : Cloud<'T []> = cloud {
    return! files |> Array.map (readFile reader) |> Cloud.Parallel
}

The suggested bind operation in local workflows will have type signature Local<'T> -> ('T -> Local<'S>) -> Local<'S>, whereas in cloud workflows it will carry the signature Workflow<'T> -> ('T -> #Workflow<'S>) -> Cloud<'S>. In other words this would mean that local workflows freely compose with other local workflows, whereas they cannot bind to cloud workflows. On the other hand, cloud workflows can freely bind to both local and cloud workflows.

The Cloud.Parallel keyword will have type signature seq<#Workflow<'T>> -> Cloud<'T []>; since the primitive introduces distribution by definition, it will force 'elevation' of any workflow into a cloud workflow. At the same time, any use of Cloud.Parallel inside a local workflow will introduce a type error. For threadpool-based parallelism, a Local.Parallel primitive will be introduced, of type signature seq<Local<'T>> -> Local<'T []>. This will offer static assurance that the this local parallelism combinator cannot compose with child computations that introduce parallelism.

In other words, the proposed amendment to the programming model aspires to constrain distribution effects at compile time.

palladin commented 9 years ago

Given the fact that Cloud<_> and Local<_> are defined as

type Local<'T> = Workflow<Local, 'T>
type Cloud<'T> = Workflow<Distributed, 'T>

I was thinking that CloudBuilder.Bind could be something like

member __.Bind<'Ctx, 'T, 'S> (f : Workflow<'Ctx, 'T>, g : 'T -> Workflow<Distributed, 'S>) : Workflow<Distributed, 'S>
eiriktsarpalis commented 9 years ago

Well, that's essentially identical to the current signature.

palladin commented 9 years ago

I'm referring to this (f : #Workflow<'T>) https://github.com/mbraceproject/MBrace.Core/blob/local-workflows/src/MBrace.Core/Continuation/CloudBuilder.fs#L240 I think it is better for all types of bind to be of the same form Workflow<, >

eiriktsarpalis commented 9 years ago

I have a few updates on the proposed changes. I have pushed a new branch that proposes an alternative representation for local workflows.

The initial branch uses a phantom type approach

[<Abstract>] type Workflow<'T>

type Workflow<'Ctx, 'T when 'Ctx :> SchedulingContext> = inherit Workflow<'T>

type Local<'T> = Workflow<Local, 'T>
type Cloud<'T> = Workflow<Cloud, 'T>

This approach comes with the drawback that phantom types may be confusing to novice users. Type errors when doing incompatible compositions may also be difficult to read. What's more, the type alias declarations are only visible from F# code; a C# library would have to define separate types of wrappers which would possibly hurt interoperability.

The new branch uses the following, more straightforward approach:

[<Abstract>] type Workflow<'T>

type Local<'T> = inherit Workflow<'T>
type Cloud<'T> = inherit Workflow<'T>

This has better readability, but restricts scheduling context definition to the core library. It also does not allow context polymorphism. For instance, the combinator

Cloud.Catch : Workflow<'Ctx, 'T> -> Workflow<'Ctx, Choice<'T, exn>>

has to be rewritten to two concrete implementations:

Local.Catch : Local<'T> -> Local<Choice<'T,exn>>
Cloud.Catch : Cloud<'T> -> Cloud<Choice<'T,exn>>

It is not clear to me which of the two approaches is simplest.

isaacabraham commented 9 years ago

I must admit, I had until now no clue what the term "phantom types" even referred to. Now I think I see what you're saying - basically, in the first implementation both Cloud and Local are both really the same type, but at compile time you have the notion of two distinct types. In this way you can write operations that can apply across both simply by targeting the "real" type, or just one of them by using one of the two phantom types. OK. So this probably has lots of benefits to the implementation I would imagine in terms of code reuse and simplification? So from an F# perspective it's only positives I guess? Could you give examples of "bad" error messages caused from this?

Alternative is to create two concrete types with an inheritance chain - with means from a runtime perspective we get two types but there's a duplication of code etc..

What I like is that both approaches tend to suggest that the beginner can simply use Cloud everywhere and not worry about locality, and as their understanding improves they can adopt Local for efficiency etc.. The C# point is also valid and is important. What's going to be the cost in terms of the wrapper API if you're using phantom types? Is it going to be massively more costly to implement the wrapper then this is something you should think very carefully about. If it's only a little bit more costly to do, I would tend to go down the phantom types approach.

eiriktsarpalis commented 9 years ago

Well, it does result in some code duplication but this really only affects the internals of the core library.

In terms of error messages, given the following piece of code:

local {
    let! x = Cloud.Parallel [ cloud { return 42 }]
    return x
}

The former approach would give:

Error   1   Type mismatch. Expecting a
    Local<'a>    
but given a
    Cloud<int []>    
The type 'Context.Local' does not match the type 'Context.Cloud'

whereas the latter woud give:

Error   1   This expression was expected to have type
    Local<'a>    
but here has type
    Cloud<int []>

The first case is not terrible, but the last line might be a bit confusing.

As regards the C# api, I don't think it's so much an issue of performance as one of composability. It should be possible to compose workflows defined in C# and F# libraries interchangeably, without having to wrap or unwrap types.

isaacabraham commented 9 years ago

OK. The first error isn't actually that bad really, but yes, the second is more what I would be used to. Regarding the C# API, I was more meaning the issue of effort to write wrappers etc. rather than performance.

Yes, it's a nice goal to compose workflows in either language without wrapping but this might also limit your options in terms of what types you expose in F# because you're thinking about how they look to C# e.g. option types, discriminated unions etc.. Wrapping might be the only course to achieve an idiomatic C# library. I believe that Akka .NET have done the same thing albeit the other way around. Orleans on the other hand....

I've not had a look at the C# API yet though so I might be spectacularly off course here :-)

eiriktsarpalis commented 9 years ago

Yes, I think you're right. It still seems to me that phantom types would make a cool F# snippet, but feel a bit strange when used as the centerpiece of a large framework.

isaacabraham commented 9 years ago

If the shoe fits... :-) Until we're at 1.0 there's nothing wrong with putting stuff in and removing it IMHO. Particularly if this is going to speed up development time - as we approach a stage when you feel that the core feature set is ready, that's when you might want to review this stuff.

Again, IMHO.

dsyme commented 9 years ago

@eiriktsarpalis - Can Local workflows wait on cloud I/O actions? e.g. reads from CloudRef, or Cloud.Log?

eiriktsarpalis commented 9 years ago

Yes. Only Cloud.Parallel/Choice/StartAsCloudTask require explicit use of cloud {}

dsyme commented 9 years ago

So what happens if Local<T> and Cloud<T> are completely separate types and Workflow<T> is completely hidden? It sounds like there is very little code that operates over Workflow<T>, i.e. generic w.r.t. annoation?

eiriktsarpalis commented 9 years ago

Most combinators that return Cloud<> take Workflow<> as input. For instance

Cloud.Parallel : seq<#Workflow<'T>> -> Cloud<'T>

and

cloud.Bind : Workflow<'T> -> ('T -> Cloud<'S>) -> Cloud<'S>

Having a base type essentially simplifies APIs that take any workflow as input.

dsyme commented 9 years ago

OK, I see. Tricky choice!

eiriktsarpalis commented 9 years ago

Plus, it would be impossible to have this otherwise reasonable expression

Cloud.Parallel [ cloud { return 1 } ; local { return 2 } ]
isaacabraham commented 9 years ago

@eiriktsarpalis Hmmm. Does that mean "run these two workflows in parallel across the cluster - first one is a cloud workflow, ok. second one is a local workflow, promote it to a cloud workflow"?

eiriktsarpalis commented 9 years ago

@isaacabraham Not exactly. It means that both types can be run remotely, however the first one may encapsulate further distribution, while the second comes with a static guarantee that it will be constrained to a single machine if run. For a more illustrative example:

Cloud.Parallel [ Cloud.Parallel ... ; local { return! CloudRef.New 42 }]

The first argument and left argument are of type cloud since their execution might span multiple machines, while the right argument will not. Attempting to invert things, for instance

Local.Parallel [ Cloud.Parallel ... ]

will result in a type error. Having type local is a guarantee that no distribution will occur as far as MBrace is concerned inside that particular workflow. However, having type cloud does not necessarily imply distribution, for instance cloud { return 1 }.

eiriktsarpalis commented 9 years ago

This addition is merely a type-level safety blanket. It does not affect the underlying implementation or execution model. It has actually helped uncover a few bugs, in some cases where distribution had been accidentally introduced in store APIs. Overall it has increased my confidence in working with the model, my only hope is that it has not been rendered incomprehensible to outsiders in the process.

isaacabraham commented 9 years ago

OK, yes, got it. It's not incomprehensible IMHO, but it needs to be introduced carefully within tutorials etc. so that users understand how to reason about local { } versus cloud { } - it's definitely not a beginner "feature". The beauty is that because everything works under the Cloud expr., there's no need to introduce it immediately IMHO.

palladin commented 9 years ago

Based on my experience, is very helpful to have the static guarantee but also the visual cues (cloud {} vs local {}) which is necessary to understand the dynamics of your code.

eiriktsarpalis commented 9 years ago

Exactly! The novice need not even notice that there is a local expression builder. It could be introduced further on, in a 'constraining distribution' chapter.

isaacabraham commented 9 years ago

@eiriktsarpalis the only thing I would suggest you think about is namespacing of this expr. - if cloud { } is at the top level of MBrace etc., I would suggest putting local { } in a sub-namespace. The novice should (ideally) be able hit "." in intellisense and just see the constructs that would be useful to him for 80% of his or her needs etc.

eiriktsarpalis commented 9 years ago

Great! Let's see if we can settle the phantom type dispute then and merge the 'correct' branch to master. @dsyme ?

dsyme commented 9 years ago

The final position seems reasonable. This is the sort of addition which you have to be prepared to unroll if usability evidence indicates that's necessary, or if it collides with some other concern, but what you've got is worth running with for now.

dsyme commented 9 years ago

Some feedback on this design...

I was teaching fresh users for the first time today with this enabled (experienced F# programmers).

I still think the additional safety is worth it, but do wonder why "Local" and "Async" aren't the same thing.

eiriktsarpalis commented 9 years ago

Good question. Let me first clarify that Async values are perfectly serializable and can be composed/executed in MBrace workflows with the expected shared memory semantics.

In fact, Async used to play the role of Local until very recently. However, Async lacks the dependency resolution mechanism used by MBrace workflows. For example:

type CloudFile = { Path : string }
    member file.ReadAllLines() = local {
        // resolve file store implementation particular to current execution context;
        // in azure it would be a blob store client bound to the user's connection string
        let! fileStore = Workflow.GetResource<ICloudFileStore> ()
        return fileStore.ReadAllLines(file.Path)
    }

Dependencies are threaded through the workflow in a reader monad fashion. The CloudFile instance contains only its path and connects to it using dependencies from the execution context. It would be impossible to do the same with Async without somehow encapsulating store logic.

With the update to the programming model, there are many things that need to brushed up. For example, the Cloud.OfAsync : Async<'T> -> Local<'T> primitive has a type signature that seems to contradict its name. We have kept it like this out of backwards compatibility for now, but we would love to have more feedback on how to improve things.

palladin commented 9 years ago

I think that maybe the phantom type design is more clear for experienced and "maybe" to newbies as well. One possible simplification is to rename Workflow to Cloud<'Ctx, 'T> Some examples:

Cloud.OfAsync : Async<'T> -> Cloud<Local, 'T>
Cloud.Parallel : seq<Cloud<'Ctx, 'T>> -> Cloud<Distrib, 'T []>
CloudBuilder.Bind : Cloud<'Ctx, 'T> -> ('T -> Cloud<Distrib, 'R>) -> Cloud<Distrib, 'R>
LocalBuilder.Bind : Cloud<Local, T> -> ('T -> Cloud<Local, 'R>) -> Cloud<Local, 'R> 
eiriktsarpalis commented 9 years ago

But you would still use the Local<'T> alias for Cloud<Local, 'T> right? If yes, then confusion persists. If not, you confuse even more by forcing people to explicitly work with phantom types.

palladin commented 9 years ago

My idea is to remove the type aliases, I think that with the phantom types exposed is more clear (to me at least).

dsyme commented 9 years ago

One concept is dropped (Workflow) at the expense of the phantom type parameter

Cloud.Parallel : seq<Cloud<'Ctx, 'T>> -> Cloud< 'T []>
eiriktsarpalis commented 9 years ago

We would also need to have a Cloud<'T> base type (to replace Workflow<'T>). Thus:

Cloud.Parallel : seq<#Cloud<'T>> -> Cloud<Distrib, 'T>

Not sure which is less confusing TBH.

palladin commented 9 years ago

The confusion arises because we hide the context. Maybe we can keep the Cloud<'T> as a type alias of

Cloud<Distrib, 'T>

and delete the Local<'T> completely.

palladin commented 9 years ago

For the scenario of non-uniform Cloud.Parallel

Cloud.Parallel [ cloud { return 1 } ; local { return 2 } ]

we can provide two cast operators

Cloud.ToLocal : Cloud<Distrib, 'T> -> Cloud<Local, T>
Cloud.AsDistrib : Cloud<Local, 'T> -> Cloud<Distrib, T> 

and we can keep one uniform Cloud.Parallel

Cloud.Parallel [ cloud { return 1 } ; Cloud.AsDistrib <| local { return 2 } ]
Cloud.Parallel : seq<Cloud<'Ctx, 'T>> -> Cloud<Distrib, 'T []>
palladin commented 9 years ago

Another possible idea is to have something like

type Context
type Distrib = inherit Context
type Local = inherit Context

type Cloud<'Ctx, 'T when 'Ctx :> Context>

Cloud.Parallel<'Ctx when 'Ctx :> Context> :  seq<Cloud<'Ctx, 'T>> -> Cloud<Distrib, 'T []>

let first : Cloud<Context, int> = cloud { return 1 }
let second : Cloud<Context, int> = local { return 2 }
Cloud.Parallel [  first; second ]
dsyme commented 9 years ago

FWIW I'm not at all sure the current situation is a critical problem as such - people did notice "Workflow" and "Local", and asked about "Async", but equally they understood my explanation and saw that the distinctions brought value.

The real test will come when

In both these situations you will notice that the complexity of what can be communicated is lower than you expect, and that simplicity is a total virtue. To be successful, MBrace must be virally communicable as a simple-but-powerful big data technology - it's not just whether we can teach it, but whether the people we teach can teach it.

palladin commented 9 years ago

Valid point. The teaching aspect is the critical one.

eiriktsarpalis commented 9 years ago

What if I made Cloud<> the base type and had Local<> be its subtype? We could probably get rid of Workflow<> without changing much else.

On Friday, March 13, 2015, Nick Palladinos notifications@github.com wrote:

Valid point. The teaching aspect is the critical one.

— Reply to this email directly or view it on GitHub https://github.com/mbraceproject/MBrace.Core/issues/9#issuecomment-78690275 .

Sent from Gmail Mobile

palladin commented 9 years ago

Simplification is always a good thing, one less thing to worry about.

dsyme commented 9 years ago

But then a Local could be used where a Cloud is expected?

eiriktsarpalis commented 9 years ago

Yes, that is in fact expected behaviour, even now. It is the opposite situation that we are looking to avoid.

Actually, my original planning was to do it like this, but then I went for the phantom type approach out of convenience. I complete forgot about this afterwards. The more I think about this, the more it makes sense to do it like this.

dsyme commented 9 years ago

Yes, sorry, brain fail :)

eiriktsarpalis commented 9 years ago

Ok, this refactoring worked out smoothly: https://github.com/eiriktsarpalis/MBrace.Core/commit/ddb77c29c65c055408b22291f560f269c42cb901

If everybody agrees, I'll proceed with a PR.

eiriktsarpalis commented 9 years ago

As an added bonus, this is now possible:

Cloud.Parallel [ cloud { return 1 } ; local { return 2 } :> _ ]
palladin commented 9 years ago

Yep, I think it works elegantly!

dsyme commented 9 years ago

Feedback time:

First the very good :) I just took a non-PL F# user through MBrace - we concluded the session by running the scale-out comparison of feature extraction sets for 700,000 images over 200 Azure machines (currently running). It was really fun to use the system in anger over data. We'll write up more details in due course.

However, I do have to mention that both Local<_> and the # in #Cloud<_> in signatures are definitely noticed they add non-trivial learning burden for users. It is really quite hard for people to grok why subtyping suddenly makes an appearance, especially since they are in the process of learning an already abstract notion like Cloud.

When someone knowledgeable is on hand to get them over this hurdle with confidence, it is ok. But if they are on their own, I think most entry-level F# users are going to struggle with Cloud v. Local v. Async v. CloudFlow. It's just hard for people to learn these things and understand why so many new control constructs are needed.

So the feedback is similar to before: Local<_> reduces bugs, I like it. I'm not saying we should remove it. But Local<_> adds a non-trivial burden to the learning curve, even for F# programmers, and I'm concerned about that. People accept an explanation when it is given but if left to their own devices they might not like it. It's quite possible people will reject MBrace as "too complicated" when they see this alone.

One option may be to keep pushing Local<_> and #Cloud<_> to the fringe so you don't really notice it until you need it. The specific place where #Cloud<_> gets noticed is in the signature to Cloud.Parallel and Cloud.Choice. To be honest we could just make this accept type Cloud<_> and have a Local.Parallel and Local.Choice as well.

The specific place where Local<_> gets noticed is in the return type to CloudCell.New<_>. This is harder to hide.

isaacabraham commented 9 years ago

Where is there any details on CloudFlow?

dsyme commented 9 years ago

CloudFlow is the new name for CloudStream

eiriktsarpalis commented 9 years ago

Perhaps the root of problem lies in the naming of Local<_>. Just by name, it would be impossible to guess that this could be a subtype of Cloud<_>. What if we called it LocalCloud<_> or Cloudlet<_> or Vapor<_>? We would probably have to rename local { } which I have grown incredibly fond of.

dsyme commented 9 years ago

That's certainly one issue, though the root of the problem is simply the existence of more concepts for people to learn rather than less. 2 concepts are just harder to learn than 1 (while equally, 2 concepts may catch more bugs than 1 concept).

I assume a C# version of all this would likely just recycle Task<T> = Local<T> = Cloud<T> and be called "task execution", no matter what the correctness problems with that. But people would probably flock to it - partly because it's C# and partly because it's perceived to be simpler. Then a few people would realize the problems with it and start to refine the types used, but most wouldn't....

Another POV is that the Local v. Cloud distinction may be of more use to us as implementors (and also to power uses) than it is to 90% of potential users. If that were deemed to be the case then perhaps it's something that could be present in debug mode (when building the library), and absent in release mode, though that's somewhat tricky to arrange.

Still another POV is that you could opt-in to a "stricter" API using "open MBrace.MoreExactTypes" or something, and without that you just get Cloud<T>

But I'm really not sure if we need a change here - I'm really just giving feedback from 1:1 learning sessions. We'll learn more as we teach it more. However I've definitely noticed that some F# frameworks suffer from "too much type-level complexity" and "too many things to learn", and I believes it can hinder their adoption.

For the names above - I think local is better than those - people learn "local" in juxtaposition to "cloud" so they get the difference.