mbraceproject / MBrace.Azure

Microsoft Azure PaaS implementation for MBrace
http://mbrace.io
Apache License 2.0
46 stars 23 forks source link

Discussion thread: Version tolerance #53

Closed krontogiannis closed 9 years ago

krontogiannis commented 9 years ago

Edit :

As of 0.6.5 runtime state will be created in new version specific containers. This means older state (processes, logs, etc) will not be accessible.

Cleaning up the older state can be done using:

let cluster = Runtime.GetHandle({ config with Version = "0.6.4.0" })
cluster.Reset(reactivate = false) // also check the other optional arguments

When using this version override only a limited set of APIs can be used like Reset/ShowWorkers/ShowProcesses/etc


Initial post (ignore this): As of 0.6.5 workers/clients perform version checks, so running jobs from older runtimes will not be allowed.

Workaround 1 Use different Configuration.Id value from the default (0). This value should be the same in both workers and clients.

Workaround 2

  1. Stop workers.
  2. let runtime = Runtime.GetHandle(config, ignoreVersionCompatibility = true)
  3. runtime.Reset() Clear ALL state. Cleaning only Queues and tables should work fine too if you need to keep older CloudFiles etc in config.UserData folder, but use at your own risk.
  4. Start workers.
isaacabraham commented 9 years ago

Is it worth making this a single call e.g. Runtime.Upgrade(config)?


From: Kostas Rontogiannismailto:notifications@github.com Sent: ‎23/‎03/‎2015 11:50 To: mbraceproject/MBrace.Azuremailto:MBrace.Azure@noreply.github.com Subject: [MBrace.Azure] Worker / Client fails with MBrace.Azure.Runtime.IncompatibleVersionException (#53)

As of 0.6.5 workers/clients perform version checks, so running jobs from older runtimes will not be allowed.

Workaround

  1. Stop workers.
  2. let runtime = Runtime.GetHandle(config, ignoreVersionCompatibility = true)
  3. runtime.Reset() Clear ALL state. Cleaning only Queues and tables should work fine too if you need to keep older cloudcells etc, but use at your own risk.
  4. Start workers.

Reply to this email directly or view it on GitHub: https://github.com/mbraceproject/MBrace.Azure/issues/53

eiriktsarpalis commented 9 years ago

I think I would prefer if runtime state was isolated by version number. Naming of runtime tables/queues could be prefixed by a version or even build number. In this way clusters of different versions could operate on the same accounts simultaneously. This would eliminate bugs occurring in extreme cases where multiple parties using different versions act on the same azure accounts.

This can be easily achieved by changing the fields in the Azure.Configuration record. Perhaps these values should be specified in the Default instance.

isaacabraham commented 9 years ago

The way that you should have several versions operating in isolation is just to change the ID field of the config though?

krontogiannis commented 9 years ago

Yeah using different Configuration.Id is a nice (and easier) alternative to have different runtime versions in the same storage/sbus accounts.

krontogiannis commented 9 years ago

I think I like changing the configuration ID more as a solution to this at least for now that we are in prerelease mode. How about changing the default Configuration.Id from 0 to sth like Minor*1000 + Major*100 + Build?

isaacabraham commented 9 years ago

I'm a little uncomfortable with that for two reasons. Firstly it seems a bit of a smell to try to encode this stuff into a string which is designed for one thing and one thing only - isolating two clusters, irrespective of version. Second, are you suggesting that this is just some convention that will be adopted or that the config type itself will generate this value?

Also how will we isolate two clusters of the same version on the same storage account + service bus?

eiriktsarpalis commented 9 years ago

But a cluster is essentially its storage and service bus states. In that respect, it is not possible to have two clusters of the same version running on top the the same accounts.

I think that enforcing isolation in this way is sensible, however we need be careful w.r.t. client code. If we implemented this naively, incompatible workers would be invisible to the client, leading to almost certain confusion.

isaacabraham commented 9 years ago

But it is, since the last release introduced multi tenancy on top of storage accounts via the ID property.

With the cluster nodes themselves pushing back when there's a version mismatch I think that that should be enough, at least to start with. Wait for feedback, if it proves to be an issue, then change it.

eiriktsarpalis commented 9 years ago

This has already proven to be an issue. Current behaviour will cause provisioned workers to crash if registered to configuration used by previous versions in the past, unless the user resets the state manually beforehand. This is both undesirable and overly common, it is bound to confuse users.

This is why I think that resources should be prefixed by version number, with this being non-configurable. Multi tenancy could be achieved by optionally appending a user-specified prefix after the mandatory version number. I understand that there are restrictions in the sizes of paths, but I reckon we can work out something in this direction.

dsyme commented 9 years ago

Current behaviour will cause provisioned workers to crash ...

Will it be possible to achieve version-tolerant worker upgrade, so that, for example, at worst the jobs queued into the previous MBrace fail gracefully? And a reset and/or new storage is needed on major/minor upgrades 0.6 --> 0.7, but not 0.6.4 -> 0.6.5?

Likewise achieving version-tolerant client upgrade would be great , so you can submit jobs to 0.6.x workers using 0.6.x clients :)

On paths, I don't have strong opinions. Perhaps version-tolerance would mean you prefix 0.6 to the paths, but not 0.6.4

krontogiannis commented 9 years ago

Renaming this into Runtime.Upgrade or something similar will cause confusion I think. User might except somethink like migration from the old runtime state to the new one. At least calling Reset is pretty honest about what it does: deletes the current state.

dsyme commented 9 years ago

BTW what error message is given in a version-mismatch? Do you get a nice error suggesting to run "cluster.Reset()" and/or upgrade your client software?

eiriktsarpalis commented 9 years ago

@dsyme As it is, I do not think it is possible to have MBrace nodes interoperate across minor upgrades, or even different builds of the same version. This happens because of the nondeterministic assignment of numerical suffixes to closure types by the F# compiler. But even if this were not the case, I do not think that there can be reconciliation between closure serialization and version tolerance in principle.

Given this extremely sensitive nature of MBrace, we must be very careful as to how such incompatibilities are handled, since they are going to be common occurrences. I think that build number isolation plus some augmented administration capacities in the client API is a good direction to consider.

krontogiannis commented 9 years ago

A solution we are thinking is:

Sounds good?

/cc @dsyme @palladin @eiriktsarpalis @isaacabraham

This way after 0.6.5 a new clean state will be created by default, without any errors on worker provisioning or gethandle

isaacabraham commented 9 years ago

What will happen to the old state - is it up to the user to manually clear it up?

isaacabraham commented 9 years ago

I see the benefit of guaranteeing version isolation on a storage account. I just am not sure if users will expect this behaviour. Any user data will be lost etc., and every time you upgrade the old stuff will be left floating about. I would prefer to see a upgrade routine which just took the old storage account details and cleaned out anything that wasn't compatible. Or perhaps separate out the user data folder so it is not bound to a cluster.

dsyme commented 9 years ago

@eiriktsarpalis

As it is, I do not think it is possible to have MBrace nodes interoperate across minor upgrades, or even different builds of the same version. This happens because of the nondeterministic assignment of numerical suffixes to closure types by the F# compiler. But even if this were not the case, I do not think that there can be reconciliation between closure serialization and version tolerance in principle.

Hmmmm... I'm more optimistic we can get a strong story here, at least in the long run.

I'm sure that there is a reasonable mode of use of MBrace where process specifications (as opposed to cloud task specifications, which may include more closures) are easily specified (and serialized) against a purely public surface area of the FSharp.Core and MBrace DLLs (this surface area can be mapped forward to future versions of DLLs using binding redirects as normal). I'd even be willing to switch a flag to require this, so serialization of CreateProcess would fail-fast if private closures are serialized as part of the process specification..

Once that surface area settles (that will come naturally over time) then this issue becomes much easier to make progress on.

eiriktsarpalis commented 9 years ago

It goes without saying that a subset of the client API can be used throughout a reasonable range of versions of MBrace. However, I believe these will be restricted to administrative actions, like listing processes or resetting state.

On Tuesday, March 24, 2015, Don Syme notifications@github.com wrote:

@eiriktsarpalis https://github.com/eiriktsarpalis

As it is, I do not think it is possible to have MBrace nodes interoperate across minor upgrades, or even different builds of the same version. This happens because of the nondeterministic assignment of numerical suffixes to closure types by the F# compiler. But even if this were not the case, I do not think that there can be reconciliation between closure serialization and version tolerance in principle.

Hmmmm... I'm more optimistic we can get a strong story here, at least in the long run.

I'm sure that there is a reasonable mode of use of MBrace where process specifications (as opposed to cloud task specifications, which may include more closures) are easily specified (and serialized) against a purely public surface area of the FSharp.Core and MBrace DLLs (this surface area can be mapped forward to future versions of DLLs using binding redirects as normal). I'd even be willing to switch a flag to require this, so serialization of CreateProcess would fail-fast if private closures are serialized as part of the process specification..

Once that surface area is locked down (that will come naturally over time) and the serialization format fully stabilized then this issue becomes much easier to make progress on.

— Reply to this email directly or view it on GitHub https://github.com/mbraceproject/MBrace.Azure/issues/53#issuecomment-85640672 .

Sent from Gmail Mobile

dsyme commented 9 years ago

Can we go through some examples? For example, it seems reasonable to expect the serialized version of this to be version resilient?

let quickText = 
    cloud { return "Hello world!" } 
    |> cluster.Run

thanks

eiriktsarpalis commented 9 years ago

I think it is reasonable to assume that delayed computation expressions are not version resilient. They tend to encapsulate closure objects and are sensitive to both the build of MBrace.Core (which defines the builder) and builds of the code that consumes it. I'll run some experiments and come back with more concrete data on this.

On Tuesday, March 24, 2015, Don Syme notifications@github.com wrote:

Can we go through some examples? For example, it seems reasonable to expect the serialized version of this to be version resilient?

let quickText = cloud { return "Hello world!" } |> cluster.Run

thanks

— Reply to this email directly or view it on GitHub https://github.com/mbraceproject/MBrace.Azure/issues/53#issuecomment-85663908 .

Sent from Gmail Mobile

krontogiannis commented 9 years ago
eiriktsarpalis commented 9 years ago

@dsyme I ran the following from the MBrace demo solution

open Nessos.FsPickler
let types = FsPickler.GatherTypesInObjectGraph (cloud { return 42 }) // walk the object graph gathering type occurrences

which yields

val types : Type [] =
  [|MBrace.Cloud`1[System.Int32]; MBrace.Cloud`1[T]; System.Type[];
    System.Int32; MBrace.Builders+Delay@287[System.Int32];
    MBrace.Builders+Delay@287[T]; FSI_0006+types@65|]

As you can see, it encapsulates closure types both from the core assembly and FSI. I would this makes serialization extremely brittle w.r.t. version tolerance: even adding an empty line above the builder implementation would break it.

dsyme commented 9 years ago

Interesting.

The ones like types@65 from FSI are in user code (non-core assemblies), so they aren't a problem. In those cases the closure classes are necessarily part of the deliverable package and there's no versioning consideration.

In theory the other closure names like Delay@287 can be "stabilized" (by giving them names - e.g. by using non-function class types only every implemented by public classes), though that's somewhat ugly. Another technique is to "lift the closure classes into user code" by inlining the implementation. However that's not always technically feasible (or may cause leaking of other implementation details by forcing implementation details to become public).

I guess what I'm saying is that it's (clearly) not impossible to have an API whose use only results in objects whose serialization is version-stable w.r.t. future binary-compatible updates of the DLL: many simple APIs have this property.

But equally it is not easy or particularly well supported by F#, and may require special techniques (such as giving strong public names to closure implementations, which become part of the public API), or inlining (which can leak implementation representations).

Either way a separate code analysis or a disciplined change in implementation style of the implementation of the core DLLs would certainly be needed.

It's an interesting problem and something I'll consider more, I think we all agree that a code serialization format that is robust version stability would be useful to achieve, it likely depends at what cost to the codebase and the potential explosion of the public surface area.

eiriktsarpalis commented 9 years ago

The problem is that most of the API the core library exposes makes use of builders. So in essense, we would have to desugar everything (lambda implementations and all). The same is true for MBrace.Flow or any third party library that comes in multiple versions. This is bound to turn the libraries into unmaintainable spaghetti code.

I think the best way this could be partially fixed is to contribute improvements to closure naming in the F# compiler. If I remember correctly, compiler generated types are suffixed by line number and an additional incremental number in the case of conflicts. The incremental number is generated by a counter that is global to the given source file. These rules make naming extremely sensitive to superficial, peripheral changes to source code. What if we adopted a naming scheme that only reflects localized changes to code? We could for example eliminate the line number and use a counter that only spans the current type definition/method body.

dsyme commented 9 years ago

I know you know this, but adjusting F# is not really a great fix partly because it would take so much time to propagate - MBrace needs to be usable with VS2013 for a long time. And we may not get a fix for this into VS2015 at this point. Also the F# compiler still has a habit of inserting and removing closure implementations which may defeat this I suppose. (BTW if we added an F# language feature, perhaps it should be to allow explicit naming of closure implementations)

I went through MBrace.Core and looked what it would take to stabilize the core closure implementations by changing the core type definitions like this:

type SuccessCont<'T> = StabilizeSuccessCont of (ExecutionContext -> 'T -> unit)
type ExceptionCont = StabilizeExceptionCont of (ExecutionContext -> ExceptionDispatchInfo -> unit)
type CancellationCont = StabilizeCancellationCont of (ExecutionContext -> OperationCanceledException -> unit)
type internal Body<'T> = StabilizeBody of (ExecutionContext -> Continuation<'T> -> unit)

Right now there are 32 closure implementations (there may be more in the rest of the library). All of these would need to become class definitions with stable names, implementing the objectization of the function types above:

type SuccessCont<'T> = 
    abstract Call : ExecutionContext -> 'T -> unit
type ExceptionCont = 
    abstract Call : ExecutionContext -> ExceptionDispatchInfo -> unit
type CancellationCont = 
    abstract Call : ExecutionContext -> OperationCanceledException -> unit
type internal Body<'T> = 
    abstract Call : ExecutionContext -> Continuation<'T> -> unit

So stabilizing these is certainly invasive but not impossible. Note that objectizing in this way can have some minor perf benefits (curried invokes can become a little faster) so it's not all bad.

The closure implementations could still be internal as long as people making modifications understand that they are effectively a logical part of the serialization format of the computations.

eiriktsarpalis commented 9 years ago

The problem is that even if we went with concrete continuation implementations, compiler-generated types would still leak on account of computation expressions. Let me illustrate with an example:

type Cont<'T> =
    abstract RunWith : 'T -> unit

type CM<'T> =
    abstract Call : Cont<'T> -> unit

type Return<'T>(t : 'T) =
    interface CM<'T> with
        member __.Call c = c.RunWith t

type Bind<'T, 'S>(f : CM<'T>, g : 'T -> CM<'S>) =
    interface CM<'S> with
        member __.Call(c : Cont<'S>) =
            let bc = new BindCont<'T, 'S>(c, g)
            f.Call bc

and BindCont<'T, 'S>(c : Cont<'S>, g : 'T -> CM<'S>) =
    interface Cont<'T> with
        member __.RunWith(t : 'T) = let m = g t in m.Call c

type ContBuilder() =
    member __.Return(x : 'T) : CM<'T> = new Return<'T>(x) :> _
    member __.Bind(f : CM<'T>, g : 'T -> CM<'S>) : CM<'S> =
        new Bind<'T, 'S>(f, g) :> CM<'S>

let cont = new ContBuilder()

let workflow = cont {
    let! y = cont { return 1 + 1 }
    return y
}

Then as before we run

open Nessos.FsPickler
let types = FsPickler.GatherTypesInObjectGraph instance

receiving

val types : System.Type [] =
  [|FSI_0020+Bind`2[System.Int32,System.Int32]; FSI_0020+Bind`2[T,S];
    System.Type[]; System.Int32; FSI_0028+workflow@91;
    FSI_0020+Return`1[System.Int32]; FSI_0020+Return`1[T]|]

The leak occurs because of the type signature of .Bind, which takes an F# function as argument. Computation expressions desugar let! bindings by passing a local compiler-generated lambda to the combinator. The same is true for the .Delay combinator. In order for this scheme to work, we would have to banish explicit use of computation expressions from all the core libraries.

dsyme commented 9 years ago

@eiriktsarpalis Note that if you add

type Delay<'T>(f: unit -> CM<'T>) =
  interface CM<'T> with member x.Call(cont) = f().Call(cont) 

...

member __.Delay(f : (unit -> CM<'T>)) : CM<'T> = new Delay<'T>(f) :> _

then you get

[|FSI_0017+Delay`1[System.Int32]; FSI_0019+workflow@90-4|]

which is progress since the only actual closure is associated with the workflow delay itself

dsyme commented 9 years ago

Here's a full sample that shows how you can strongly name the workflow closures for all cases where a public API function returns a workflow:

type Cont<'T> =
    abstract RunWith : 'T -> unit

type CM<'T> =
    abstract Call : Cont<'T> -> unit

let call contf (c:CM<'T>) = c.Call(contf)

type Return<'T>(t : 'T) =
    interface CM<'T> with
        member __.Call c = c.RunWith t

type Delay<'T>(f: unit -> CM<'T>) =
  interface CM<'T> with member x.Call(cont) = f().Call(cont) 

type Bind<'T, 'S>(f : CM<'T>, g : 'T -> CM<'S>) =
    interface CM<'S> with
        member __.Call(c : Cont<'S>) =
            let bc = new BindCont<'T, 'S>(c, g)
            f.Call bc

and BindCont<'T, 'S>(c : Cont<'S>, g : 'T -> CM<'S>) =
    interface Cont<'T> with
        member __.RunWith(t : 'T) = let m = g t in m.Call c

type ContBuilder() =
    member __.Delay(f : (unit -> CM<'T>)) : CM<'T> = new Delay<'T>(f) :> _
    member __.Return(x : 'T) : CM<'T> = new Return<'T>(x) :> _
    member __.Bind(f : CM<'T>, g : 'T -> CM<'S>) : CM<'S> =
        new Bind<'T, 'S>(f, g) :> CM<'S>

let cont = new ContBuilder()

[<AbstractClass>]
type APIWorkflowBase<'T>() = 
   interface CM<'T> with 
       member x.Call(contf) =
         x.Workflow|> call contf

   abstract Workflow : CM<'T>

type APIWorkflow1(someArg:int) = 
    inherit APIWorkflowBase<int>() 
    override x.Workflow =
        cont { 
            let! y = cont { return someArg + someArg }
            return y
        } 

let APIFunction1(someArg) = APIWorkflow1(someArg)

open Nessos.FsPickler
let types = FsPickler.GatherTypesInObjectGraph (APIFunction1(3))

giving

val types : Type [] = [|FSI_0021+APIWorkflow1; System.Int32|]
dsyme commented 9 years ago

(I've used an abstract base type just to make the job of writing a strongly-named workflow closure type a little more palatable)

dsyme commented 9 years ago

Note - we also have version dependencies on FSharp.Core, see https://github.com/mbraceproject/MBrace.Azure/issues/61#issuecomment-88615805 - the client and the server must match w.r.t. FSharp.Core

eiriktsarpalis commented 9 years ago

True, however this will require a significant refactoring effort on MBrace.Core and MBrace.Flow (and every other general-purpose library written on top of cloud workflows), which will possibly render them highly complicated, difficult to modify and virtually unreadable.

In the end, we might not need to take such drastic measures, given that the problem can be easily worked around. As you pointed out earlier, what defines almost every closure generated by cloud workflows is the delay lambda. Let me illustrate

open MBrace.Core
open MBrace.Flow
open Nessos.FsPickler

let getInternalClosureTypes(graph:'T) =
    FsPickler.GatherTypesInObjectGraph graph
    |> Array.filter (fun t -> t.Name.Contains "@")
    |> Array.map (fun t -> t.Name, t.Assembly.GetName().Name)

Suppose now we define the expression

let workflow = [|1 .. 100|] |> CloudFlow.OfArray |> CloudFlow.length

Passing this to the closure function yields

val it : (string * string) [] =
  [|("Delay@289", "MBrace.Core"); ("foldGen@50", "MBrace.Flow");
    ("length@451-4", "MBrace.Flow"); ("length@451-1", "MBrace.Flow");
    ("length@451-5", "MBrace.Flow"); ("length@451-2", "MBrace.Flow");
    ("ToCloudFlow@20", "MBrace.Flow"); ("collectorf@30-4", "MBrace.Flow");
    ("length@451-3", "MBrace.Flow"); ("length@451", "MBrace.Flow");
    ("zero@55", "MBrace.Core")|]

which is pretty much the definition of a closure that is not version tolerant. However, if we perform this seemingly trivial transformation of the workflow on the fsi side

cloud { return! workflow }

The closure dependencies are immediately reduced to

val it : (string * string) [] =
  [|("Delay@289", "MBrace.Core"); ("it@81-1", "FSI-ASSEMBLY")|]

So solving the version tolerance problem essentially boils down to giving an explicit closure implementation for just the delay combinator. This would then work seamlessly, provided that MBrace.Core and FSharp.Core binding redirects are set properly.

Provided that this is done, we could preempt any "internals leak" by making checks on the client side:

let tryFindLeakedLibraryClosure(wf:Cloud<'T>) = 
    FsPickler.GatherTypesInObjectGraph wf
    |> Array.tryFind(fun t -> t.Name.Contains "@" && not <| t.Assembly.IsDynamic)

type MBraceThespian with
    member runtime.Run'(wf:Cloud<'T>) =
        match tryFindLeakedLibraryClosure wf with
        | Some clo -> invalidArg "wf" <| sprintf "Workflow contains leaked internal closure type '%s' from assembly '%O'.\n Please wrap workflow in a cloud { ... } expression." clo.Name clo.Assembly
        | None -> runtime.Run wf
eiriktsarpalis commented 9 years ago

I just pushed a few changes to MBrace.Core (https://github.com/mbraceproject/MBrace.Core/commit/ba564dae2c2119a6394f9e0f7ea5e8ab99c29ade) that provide an explicit delay lambda implementation.

dsyme commented 9 years ago

This is good progress.

(There are presumably still numerous delay closures - the functions passed to Delay - in the MBrace.Core and MBrace.Azure library which can be given stable names when we decide that's necessary?)

eiriktsarpalis commented 9 years ago

Yes, pretty much most of the public API. BTW, I should point out that these remedies only affect version discrepancies between client and cluster. In the unfortunate case of having worker instances of different FSharp/MBrace.Core builds working together, we would have to start worrying about continuation lambda naming too.

Thinking out loud here, perhaps one possible solution would be to build mbrace releases with a modified F# compiler, one that generates internal type names differently.

dsyme commented 9 years ago

Closing this old discussion thread, we can reopen It when we need to discuss again