Building tasks with dynamic outputs using the restarting scheduler

snowleopard commented 5 years ago

This is an issue to discuss how the restarting scheduler can be used to build tasks with dynamic inputs and outputs, as opposed to tasks with dynamic inputs that are covered by the Build Systems à la Carte paper.

I'll start by sketching a proof that the restarting scheduler works for tasks with dynamic outputs. First of all, we need to assume that all build tasks are finite, i.e. that they terminate and have a finite number of input dependencies, which in turn guarantees that the restarting scheduler terminates. Why? Because it always makes some progress by either removing a task from the working queue, or unblocking one of the blocked tasks, in which case the latter is one step closer to completion.

Let's run the restarting algorithm with a working queue containing all build tasks.

When it terminates, we have one of two cases:

It succeeds and builds the target.
It fails to build the target, our working queue is empty, and there is a set of blocked tasks T.

Now we can argue that in the second case the target key cannot be built due to one of the two reasons:

There is a dependency cycle reachable from it.
The target, or one of the keys it transitively depends on, is not an input and has no task to build it.

Let k denote the target key. All tasks that are not in T have completed and did not produce k (the build failed), hence we know that all tasks that could possibly build k must be in T. Let t denote one such task (if there is no such t, then this is the case (2) above). Since t ∈ T, it is blocked by some key b, and we can repeat our argument by taking b as the target key: by doing this we will eventually either hit the case (2), or will circle back to a key we already examined (since T is finite), which would indicate the case (1).

The proof in non-constructive in the sense that we don't know which t could actually produce k and hence lead to a cycle. All we can say is that either such t does not exist (2), or it does exist but it will inevitably either lead to a cycle (1) or hit a dead end (2).

snowleopard commented 5 years ago

A note about so-called forward-defined build systems, like Fabricate:

If you have a forward-defined build system, it means you have a total order on build tasks, which automatically prevents cyclic dependencies. Furthermore, it means that you never need to block tasks, because by the time you reach them in the working queue, all their dependencies must have been either skipped or rebuilt, so you can decide on the spot whether to rebuild it or not. In this way, build systems like Fabricate can be thought of as a special (trivial, really) case when the scheduler is just const [1..], i.e. it just runs tasks in the specified order, and we only need a rebuilder!

ndmitchell commented 5 years ago

Interesting, my simpler and less formal way of saying it is:

Build every rule in sequence. Some pass, saying what they produced, some fail because their dependencies haven't previously been produced.
Repeat accumulating the set of produced files, until all rules have passed, or until in a run round no rules pass (indicating a cycle or dependency with no rule that produces it).

The interesting thing here is that it assumes a finite set of rules. Note that Shake doesn't have a finite set of rules as something like a rule producing *.o is really an infinite set of rules. However, if you see *.o from *.c as a rule, somehow "figuring out" the universe of possible rules that can be supported from a set of produced values, perhaps it becomes feasible in a more practical setting.

ndmitchell commented 5 years ago

Agreed on the Fabricate remark, although it feels like the degenerate case of this example, but it does capture the essence to some degree, which all our previous models failed to do. Maybe the powerful thing about Fabricate is actually that the rules are "finite", or more precisely calculated from the set of produced files?

snowleopard commented 5 years ago

The interesting thing here is that it assumes a finite set of rules. Note that Shake doesn't have a finite set of rules as something like a rule producing *.o is really an infinite set of rules.

@ndmitchell Agreed, this is an interesting aspect that I'm not sure how to deal with yet. In our current model, the map k -> Maybe (Task c k v) can be used to represent an infinite set of rules, like in Shake, but if we go towards a list of tasks whose outputs are not known statically it becomes unclear how we can express a "template rule" for compiling any *.c file into *.o. Perhaps, we could limit such tasks to being Applicative-only with respect to writes? I seems that combining template rules and dynamic outputs is not going to work.

Maybe the powerful thing about Fabricate is actually that the rules are "finite", or more precisely calculated from the set of produced files?

Indeed, Fabricate seems to be different from other build systems in that all build rules are "singular" (i.e. not "template") and are given exactly as a finite list (if we assume one can't write an infinite Fabricate script with some kind of recursion).

ndmitchell commented 5 years ago

Even if they were applicative, the fact you have an infinite number of them still seems problematic. And if they are finite, you don't need the applicative.

You could write an infinite fabricate script, but assuming it terminates, it will only be able to go down one path. It's a weird kind of finite, but definitely related.

snowleopard commented 5 years ago

Here is an example of how one could go about compiling a collection of files with read/write tasks:

type Get k f = forall a. k a -> f a
type Put k f = forall a. k a -> f a -> f a

type Task c k a = forall f. c f => Get k f -> Put k f -> f a

data Key a where
    Dir  :: FilePath -> Key [FilePath]
    File :: FilePath -> Key String

compileAllCFiles :: Task Monad Key ()
compileAllCFiles get put = do
    files <- get (Dir "src/c/")
    srcs  <- traverse (get . File) files
    let objs = [ (File (f ++ ".o"), compileC o) | (f, o) <- zip files srcs ]
    void $ traverse (uncurry put) objs
  where
    compileC = pure . id -- insert a C compiler here

An important aspect here is that traverse requires Applicative f, which means that if we use Haxl-like approach to inspecting computation trees, we get independent dependency tracking for each source/object pair, i.e. if one changes foo.c, only the file foo.o will be rebuilt.

Note also that compileC has type Monad f => FilePath -> f String, i.e. it is free to introduce intermediate dependencies on its own (for example, on *.d files with dynamic #include dependencies). This looks nicely compositional.

We could put this compileAllCFiles task into the list of all tasks and keep trying to run it. As soon as source files are available (i.e. some of them may be generated), it will succeed.

snowleopard commented 5 years ago

To elaborate the above example a bit more:

type Get k f = forall a. k a -> f a
type Put k f = forall a. k a -> f a -> f a

type Task c k a = forall f. c f => Get k f -> Put k f -> f a

data Key a where
    Dir  :: FilePath -> Key [FilePath]
    File :: FilePath -> Key String

compileAllCFiles :: Task Monad Key ()
compileAllCFiles get put = do
    srcs <- get (Dir "src/c/")
    void $ traverse (\src -> compileC src get put) srcs -- independent/parallel

compileC :: FilePath -> Task Monad Key ()
compileC cFile get put = do
    let objFile = cFile ++ ".o"
    src  <- get (File cFile)
    deps <- traverse (get . File) (cDependencies src)
    void $ put (File objFile) (pure $ compile src deps)
  where
    cDependencies _src = []  -- insert dependency analysis here
    compile src _deps  = src -- insert a C compiler here

ndmitchell commented 5 years ago

So the claim is that if the final step in a monadic dependency chain is an Applicative we can separate it and do partial recomputation? I'm not convinced that's true. Imagine we did a traverse with an index, so compiled files could see if they were the first/last file in the directory. Now you have dependencies that aren't fine grained. There is some level of isolation, but it's a lot more subtle.

What if you keep running the compileAllCFiles and it keeps doing different things? e.g. adding a single .c file makes all outputs change? Where are we going to find a fixed point?

snowleopard commented 5 years ago

So the claim is that if the final step in a monadic dependency chain is an Applicative we can separate it and do partial recomputation?

Yes!

I'm not convinced that's true. Imagine we did a traverse with an index, so compiled files could see if they were the first/last file in the directory.

Not sure what exactly you mean. Something like this?

data Key a where
    Dir  :: FilePath -> Key [(FilePath, Int)] -- We need to depend on index
    File :: FilePath -> Key String

compileAllCFiles :: Task Monad Key ()
compileAllCFiles get put = do
    srcs <- get (Dir "src/c/")
    void $ traverse (\src -> compileC src get put) srcs -- independent/parallel

compileC :: (FilePath, Int) -> Task Monad Key ()
compileC (cFile, index) get put = do
    let objFile = cFile ++ ".o"
    src  <- get (File cFile)
    deps <- traverse (get . File) (cDependencies src)
    void $ put (File objFile) (pure $ compile src index deps)
  where
    cDependencies _src        = []  -- insert dependency analysis here
    compile src _index _deps  = src -- insert a C compiler here

This doesn't seem to change anything. If this is not what you meant, could give an example?

What if you keep running the compileAllCFiles and it keeps doing different things? e.g. adding a single .c file makes all outputs change?

I think in this case the corresponding Dir key is not an input anymore, so the compileAllCFiles task will be aborted because one of its dependencies is not yet ready. There will be some task that will actually write this Key (when all .c files are finally in place), which will let the compileAllCFiles to finally succeed.

snowleopard / build

Building tasks with dynamic outputs using the restarting scheduler #19