Option to build longest dependency chain first (depth first mode?)

Janno commented 6 months ago

Desired Behavior

It seems dune currently schedules the easiest-to-get-to files first. I would like to have a depth first mode that assumes the longest chain is also the costliest chain and thus the first one to work on.

Desired Behavior

I work on a Coq project in which most proofs usually have dozens or hundreds of "flat" dependencies (files that can be build completely independently) and one very long, almost linear chain of more interesting dependencies. When building these proofs dune spends a lot of time build the trivial files and only then starts working on the long chain. In practice this can double the time it takes to get to the actual file since during the long chain all these smaller files could be built on idle cores.

rgrinberg commented 6 months ago

From a first glance, it sounds like a reasonable request. Though I have no idea how a potential feature that addresses this problem might even look like. cc @snowleopard

snowleopard commented 6 months ago

Implementation-wise, switching to a depth-first search ordering seems easy: replace a queue of jobs with a stack of jobs. However, it's unclear to me that this will actually speed-up this particular scenario.

@Janno Could you give this implementation a try and check if it speeds things up for you?

Janno commented 6 months ago

I can probably give it a try but I will need some guidance. So far I have only found Scheduler.Event.Queue but that doesn't look like a job queue to me. Is it the right thing to change?

snowleopard commented 6 months ago

For a prototype, it's probably easiest to replace this queue with a stack:

https://github.com/ocaml/dune/blob/b074bbef23ba2016650c9cfff0a17948b8da64a3/vendor/fiber/src/throttle.ml#L5-L9

Janno commented 6 months ago

Thanks for the pointer! It took me a while to set everything up but I can report some results now. The first result is that I have no idea what I am doing but I myself suspected as much when I went into this. So treat the rest with a big grain of salt.

My test setup is dunetest.tar.gz: 9 slow files (aaa1.v through aaa9.v) without dependencies that take about 5s each, 10 nested files (nested1.v through nested9.v + nested0.v without any dependencies) that take one second, and all.v that imports all of it.

Without exchanging the queue for a stack, this project always builds the same way regardless of how many jobs are run in parallel: it will always build aaa*.v first, then nested0.v and the rest of the nested files as soon as it runs out of aaa files to fill the slots. Here's the output I get:

Output using Queue.t (dune master) and -j2

``` coqdep test/.dunetest.theory.d coqc test/aaa1.{glob,vo} coqc test/aaa2.{glob,vo} coqc test/aaa3.{glob,vo} coqc test/aaa4.{glob,vo} coqc test/aaa6.{glob,vo} coqc test/aaa5.{glob,vo} coqc test/aaa8.{glob,vo} coqc test/aaa7.{glob,vo} coqc test/nested0.{glob,vo} coqc test/aaa9.{glob,vo} coqc test/nested1.{glob,vo} coqc test/nested2.{glob,vo} coqc test/nested3.{glob,vo} coqc test/nested4.{glob,vo} coqc test/nested5.{glob,vo} coqc test/nested6.{glob,vo} coqc test/nested7.{glob,vo} coqc test/nested8.{glob,vo} coqc test/nested9.{glob,vo} coqc test/all.{glob,vo} ```

With the stack things are looking a little better but only marginally so: dune builds however many aaa files as there are job slots and then picks up one nested (and enough aaa files to fill all the slots), then another full round of aaa, and then the pattern repeats.

Output using Stack.t with -j2

``` coqdep test/.dunetest.theory.d coqc test/aaa2.{glob,vo} coqc test/aaa1.{glob,vo} coqc test/nested0.{glob,vo} coqc test/aaa9.{glob,vo} coqc test/aaa8.{glob,vo} coqc test/nested1.{glob,vo} coqc test/aaa7.{glob,vo} coqc test/aaa6.{glob,vo} coqc test/nested2.{glob,vo} coqc test/aaa5.{glob,vo} coqc test/aaa4.{glob,vo} coqc test/nested3.{glob,vo} coqc test/aaa3.{glob,vo} coqc test/nested4.{glob,vo} coqc test/nested5.{glob,vo} coqc test/nested6.{glob,vo} coqc test/nested7.{glob,vo} coqc test/nested8.{glob,vo} coqc test/nested9.{glob,vo} coqc test/all.{glob,vo} ```

Now this might look a whole lot better since only half the nested files are built in a wasteful way. Unfortunately the effect diminishes with greater parallelism. The gaps between nested files are exactly as big as the number of job slots. With -j9 we are back to the original trace.

This change could still help a bit in practice since the number of aaa files in my work tends to outnumber the amount of cores I have and I should thus end up building at least some of the nested files earlier than I would otherwise. But it would still be much better if no aaa files were favored over nested files before we reach nested9.v (which has the same number of reverse dependencies as the aaa files).

snowleopard commented 6 months ago

@Janno Cool :) How about trying random ordering instead? I'd expect it to do best on average. (Resource management is tricky and doing it optimally is hard. But randomness can often find a way!)

Janno commented 6 months ago

I don't see how random ordering would be a good fit for this particular use case. The number of flat but slow files always outweighs the number of nested files that are ready to be worked on. I fear that if we select jobs randomly we will very likely delay the important files. At least with the stack we end up making guaranteed(?) gradual progress.

snowleopard commented 5 months ago

I wonder if one could formulate some simple build ordering strategy that's guaranteed to be within a constant factor of the optimal one. (Also, do we have a definition of the optimal one?)

Janno commented 5 months ago

Okay, I am 30 papers deep into a rabbit hole that's unfortunately not at all related to my field of research. So my understanding is still pretty limited. I'll try to summarize what I've learned.

Finding a makespan-optimal schedule (i.e. a schedule that minimizes the maximal completion time--timestamp, not duration--of any job) for a set of tasks with dependencies ("precedences") even in the deterministic offline setting (i.e. all jobs are known beforehand as are their durations and dependencies) on m identical machines (think "CPU cores"), a setting usually denoted by the triple P_m|prec|C_max, is NP-hard for m>=2. There exists a greedy 2-1/m-approximation (i.e. an approximation whose makespan is never worse than the optimal schedule by a factor bigger than 2-1/m) in Graham's list scheduling algorithm. The algorithm is as simple as can be: it takes a linearization (i.e. a list) of the tasks as input and always picks the first task in the list whose dependencies have already completed. The specific list used is irrelevant for the bound(s) derived for this algorithm and the bound is tight no matter what list is chosen. There exist no polynomial approximations for this specific scenario that improve on the 2-1/m factor.

If we ignore the fact that dune doesn't know how long tasks will take and that tasks can dynamically spawn new tasks (I think), dune seems to essentially already implement this algorithm. The list that Graham's algorithm presupposes is not fully materialized but it can be read off the trace of files being chosen. That laziness adds no potential for theoretical performance improvements as the theoretical framework already allows the list to be constructed based on all possible information, including the full list of tasks, dependencies, and task duration. This also implies that the choice between stack and queue or shortest paths and longest paths does not make a difference in terms of the approximation factor. Both choices admit scenarios in which the 2-1/m bound is tight.

If we try to factor in the the added complexity of the exact scenario that dune faces it becomes much harder to find published results. There's is a bit of work on precedence scheduling in the stochastic scenario but I don't think we will find a reasonable family of probability distributions that affords us any benefit over simply not assuming anything. In the case of truely unknown task durations there is very little research and most of it takes the uncertainty farther in that they also assume the full task set and their dependencies are unknown and only revealed once earlier tasks finish. I could not find useful algorithmic results here.

The takeaway for me is that we are unlikely to find a generic, (close-to) optimal algorithm that significantly improves on what is implemented right now in the worst case. Clearly, in practice and for my particular use case, preferring critical paths leads to favourable outcomes but counterexamples can be constructed that still produce the dreaded 2-1/m slowdown over the optimal schedule with that strategy. It might be justified, though, to assume that most Coq developments would still benefit from this strategy given that often their dependency graphs have very few leaf nodes at high levels (i.e. deep into the graph). In fact, since many developments formalize individual theorems or collections theoreof, the makespan of their dependency graph tends to be dominated by critical paths ending in one or maybe a handful of files.

It also occurred to me that it would probably be very easy for dune to collect a bit of transient timing information somewhere based on previous (successful) builds. Tying that information purely to the file's path (i.e. not invalidating it when files change) would allow us to estimate the true cost of critical paths and thus allow us to favor truely critical paths over those that are merely long. I don't know how most of dune is implemented so maybe this information is not easy to gather or to exploit but it could be worth a shot. If this doesn't work, I would still go with the longest path criterion as an approximation. (Assuming that is actually implementable.)

Janno commented 4 months ago

What would it take to truly implement the "longest dependency chain first" feature? From the discussion so far my takeaway is that there is no global scheduler per-se so it seems like some refactoring is in order? I am happy to have a go at it. But I am not yet familiar with the dune code base and I would appreciate some pointers to figure out what is involved in a change like that.

snowleopard commented 3 months ago

Thanks for the literature review!

It's hard for me to describe a good way to approach this change without trying, and sadly, I failed to find time for such an experiment myself in the last few weeks. (Hence my silence: I wanted to try but I'm giving up for now.)

I'd like to note though that at Jane Street we are considering making some changes to Dune that would make it record historic information about the time it takes to execute each build rule, and this information, as you point out, can be useful for optimal scheduling. Maybe we should come back to this discussion when that change lands? Happy to ping this thread when that happens.

Janno commented 2 months ago

Sorry, I lost my pinned tabs and thus also lost track of this issue.

Maybe we should come back to this discussion when that change lands? Happy to ping this thread when that happens.

Sounds good!

ocaml / dune

Option to build longest dependency chain first (depth first mode?) #10435

Desired Behavior

Desired Behavior