polydawn / repeatr

Repeatr: Reproducible, hermetic Computation. Provision containers from Content-Addressable snapshots; run using familiar containers (e.g. runc); store outputs in Content-Addressable form too! JSON API; connect your own pipelines! (Or, use github.com/polydawn/stellar for pipelines!)
https://repeatr.io
Apache License 2.0
68 stars 5 forks source link

Create a repeatr formula transmat #64

Closed timthelion closed 7 years ago

timthelion commented 9 years ago

It would be nice to be able to chain formulas like:

inputs:
    "/":
        type: "formula"
        hash: "lzcqJKln2_H4TIoizNBCr0qoh8u_Nb_LRwARTZL2RumfbChX031pVl46dcSCG4q3"
        silo: "file://./debian-base.frm"

Questions are:

warpfork commented 8 years ago

I have an update on this! :D Specifying which output to use has pretty solid answers now -- a set of changes just landed so that when repeatr run finishes a task, it prints out the same formula structure put in, just plus hashes on the outputs. The outputs stay in the same map and keep the same labels that you gave them on the way in, so you can parse it as json, and look them up by name.

So, we can build this kind of chaining on top of this API. Say you have a formula like this...

inputs: [...]
action: [...]
outputs:
  "myproduct":
    type: "tar"
    mount: "/task/build/bin/"
    [...]
  "logs":
    type: "tar"
    [...]

When that's run, we'll get json output where the "outputs" map contains "myproduct" and "logs". Here's an example of yanking out the hashes from the result (using jq for handy access to json from a shell script):

repeatr run -i the_formula.frm | jq .outputs.myproduct.hash

You can then template that hash into another formula. (I'm doing this with bash and jq right now, which is why the example is so handy :+1: )


Repeatr may add more direct support for formula chaining the future, for now here's where the thinking is...


    Stage 1:                           Stage 2:                          Stage 3:
  just outlines                     concrete plan                  everything computed

------------------                ------------------                ------------------

                     \                                 \                              
                      \                                 \                             
[                ]     \          [                ]     \          [                ]
[     update     ]      \         [     pinned     ]      \         [     pinned     ]
[    trackers    ]       \        [     inputs     ]       \        [     inputs     ]
[                ]        \       [                ]        \       [                ]
                           \                                 \                        
[                ]          \     [                ]          \     [                ]
[     action     ]           \    [     action     ]           \    [     action     ]
[                ]           /    [                ]           /    [                ]
[                ]          /     [                ]          /     [                ]
                           /                                 /                        
[                ]        /       [                ]        /       [                ]
[     output     ]       /        [     output     ]       /        [     pinned     ]
[  capture plan  ]      /         [  capture plan  ]      /         [     outputs    ]
[                ]     /          [                ]     /          [                ]
                      /                                 /                             
                     /                                 /                              

------------------                ------------------                ------------------

    x + y = z                         1 + 2 = z                         1 + 2 = 3

No fixed constants yet.       All the inputs are resolved.     When we actually execute,
Describes a thing to do,      This isn't as general of plan    *now* we get concrete
but isn't repeatable --       anymore -- the previous form     answers.  We can feed this
it's just a bunch of plans.   is better to describe updates    (checkable!) result forward
                              -- but it *is* very precise.     into new formulas.

Formulas can be filled in to various degrees. (These stages should maybe get some clearer, more exciting names later!) At first there's just a rough sketch -- a prototype for the real deal. Then there's filling in the values. Then there's running the whole process and seeing the results.

The Stage3 in this diagram is what you get from repeatr run when it's done. You can see how those can be uniquely identified, and are basically immutable data -- they're like git commits, everything's pinned. So, to point to a Stage3 formula is exactly as flexible as pointing directly to data hash (i.e. not at all).

The Stage2 in this diagram is what repeatr talks about most of the time, ready to run. What's interesting here is IF you have a deterministic process, the same story on flexibility goes for Stage2: 1 + 2 = _ is still obviously 3, right? So the only time pointing to a Stage2 formula will be different than pointing directly at a data hash is if the process isn't deterministic (and also, it's slower, because you have to run it to find out). We need a system for building pipelines of work that accounts for this -- pragmatically speaking, nondeterminism is absolutely important to plan on and it happens all the time legacy stuff... but we can't forget to do right by the deterministic case either! Which means pointing to some other formula with all filled-in input hashes doesn't always describe a useful thing.

So it seems like we need some new kind of references, so we can build a Stage1 to be really excellently useful and help us account for intentional updates to deterministic processes. I think when we talk about chaining results of one formula into another, that's conceptually the same thing as the Stage1 in this picture. If Stage2/3 are like git commits, the input pickers in Stage1 need to point to something like git branches -- they're effectively mutable, and we have to go look up a moving target to fully make sense of them.

Scaling this and doing it securely when build are distributed between multiple folks is another part of the design we'll want to get right in the long run. (e.g. Gentoo is amazing, but we want to encourage and enable building from source, not force everyone to melt their CPU on rebuilding the world every day.) So we may want to build some way of referencing updatable inputs which looks a lot like TUF.

And there's probably going to be more than one way to do this. TUF to track major open collaborations makes a lot of sense; but so does treating git branches themselves as a form of input we can hoover up; and working to string together stages in a local process probably doesn't need the overhead of either of those. A plugin system here to make chaining builds easy to use for all those situations seems likely.

So this is coming, but the design isn't cut in stone and will likely face several iterations. In the meanwhile, it's totally reasonable to template formulas in a script.


This isn't a direct part of an answer to your question, but it's also worth mentioning some other planned tooling that could help when building a bigger picture:

We can keep formulas and search back through them later: e.g. a repeatr explain [hash] command will go look up (stage3) formulas that produced that thing and return them; repeatr explain [formula] can go through all the inputs and do the same thing, drawing a whole graph if desired! Formulas can pretty much be seen as fitting into a relational database where the inputs and outputs are keys we can use to index in to our records of what fomulas we've run.

And this works regardless of what kind of third party tooling drove the original work. The explanations part is still always portable. Which means doing construction of bigger pipelines by manipulating formulas directly is always going to be supported -- we can always make it "make sense" when integrating with other stuff.


tl;dr We'll keep working on better tools for this, but for now, scripting brutally simple things around json is an excellent answer, and it's 100% planned that that will always fit into the ecosystem in a unix-y, composable, interchangable way.

Hope this is helpful!

timthelion commented 8 years ago

A formula has 4 things that can be filled in:

(Input hash)(Input data)→(Output data)(output hash)

Key:

?  = hash unkown
# = hash known
_  = data not yet aquired
D = data present

A human writes a formula, entering in the required URIs. But the data has yet to be aquired and the human does not know its hash.

?_ → _?

The the human then issues a command to fill in the first half of the formula.

$ repeatr scan-inputs formula.frm

Repeatr downloads the input data.

?D → _?

Then, repeatr computes a hash for that data.

#D → _?

Now we can run the formula.

$ repeatr run formula.frm

Repeatr runs the formula, thus aquiring the output data.

#D → D?

Now repeatr hashes the output data.

#D → D#

Now the user has a complete formula. The human can now send this formula to a friend.

The friend then has a formula that looks like this:

#_ → _#

It seems to me that at the data aquisition phase, it doesn't matter how repeatr aquires that data. Indeed, perhaps repeatr shouldn't have a set of built in methods for doing so! Maybe repeatr should always dole out that task to some other program...

But it seems to me, that in the data aquisition phase, there is no real difference between downloading data, copying data, or generating that data: perhaps by running a repeatr formula.

Indeed, it occures to me, that it would be possible to remove all transmats but tar, and ONLY allow for chaining. The AWS transmat would then become a (non-pure internet connected) formula that would be run and would output the downloaded content... But now that I think about it a bit, I would not be in favor of such an architechture, because I am still hoping for some nice CAS deduplication that would require far more shared caching than such a system could possibly allow.

That said, your proposed chaining method will work for subuser without problem. Subuser honestly doesn't care if it's built in, since your proposed method is trivial to implement.

timthelion commented 7 years ago

Closing this because its irrelevant with current reppl work

warpfork commented 7 years ago

:+1: