Optimize codegen scheduling for memory usage and compile time

tgnottingham commented 3 years ago

The codegen process keeps a limited number of LLVM modules in memory simultaneously. Specifically, it has modules which are being optimized by LLVM workers, and modules which are ready to be optimized, there to meet demand of LLVM workers as they finish their optimization jobs.

The process also codegens CGUs in decreasing order of size (basically*). This means that the largest LLVM modules are resident in memory simultaneously. This can increase peak memory usage when there are outsized CGUs. Also, this order isn't always ideal for total processing time (it's basically the LPT or Longest Processing Time solution to the minimum makespan problem, which is just okay). The thin LTO process also processes modules in decreasing order of size and is susceptible to the same issues.

So, there is room for improvement here both in terms of memory usage and compilation time.

@rustbot label T-compiler A-codegen I-compiletime I-compilemem

(*I changed this in #81736, before I understood the codegen process very well. That change isn't really effective, and I'll probably revert it. Even with that change, the largest LLVM modules still end up being in memory simultaneously.)

tgnottingham commented 3 years ago

I'm just going to brain dump over a few posts. :)

Academic Work

Scheduling is a heavily studied problem, so we might look to academic work for solutions.

The largest CGUs are codegen'd first in hopes that this does a decent job of minimizing the total time taken to codegen and optimize. It avoids the case where we codegen and optimize a large CGU last, and there isn't any other work to do in parallel. As I mentioned, this roughly implements the LPT (Longest Processing Time) solution to the miminum makespan problem, which our problem bears some relation to. This doesn't always produce the best schedule in terms of processing time, but it does an okay job.

The classic minimum makespan problem doesn't care about memory usage, though. We'd want to look at prior work that specifically addresses that.

Our problem is a bit closer to optimizing the schedule of a set of tasks forming a DAG, while limiting memory usage. Our tasks might include, for example, codegenning a CGU to an LLVM module, optimizing an LLVM module, generating LTO work, and performing LTO on a module. Limiting the memory footprint when dynamically scheduling DAGs on shared-memory platforms is an example of a paper that tackles this type of problem.

However, our problem has some unique characteristics. In particular, we have the limitation that codegen can only happen on the main thread (at least until the parallel compiler is enabled). Also, scheduling is complicated by the fact that the amount of concurrency available to a rustc instance depends on external factors, like what other rustc instances spawned by the same cargo instance are doing, since available concurrency is shared among them. Papers such as the one above don't account for constraints like this.

Furthermore, many solutions require relative estimates of task memory usage and running time. Our estimates are unlikely to be very accurate, though I'm sure we could improve them. Estimate accuracy is also likely to depend on hardware, platform, and allocator behavior. And estimates will always be a moving target as rustc and LLVM characteristics change.

It looks like it would be hard to adapt academic solutions to our problem. Still, it may be worth surveying them for inspiration.

tgnottingham commented 3 years ago

Heuristic Solutions

Any decent solution is going to be based on heuristics. We could develop and evaluate heuristics via simulation.

If we assume the basic structure of codegen scheduling doesn't change, there are roughly two places where it would be natural to use a heuristic to decide what to do.

1) At one point, we ask whether the main thread should codegen a module or optimize a module. A heuristic could be used to answer that question, as well as determine which module should be codegened or optimized.

2) At another point, we peel work items off the queue and set LLVM workers to work on them. A heuristic could be used to determine how many and which work items should be worked on.

Information available to the heuristic functions might include:

Set of CGUs not yet codegenned to modules, along with cost estimates.
Set of modules in work item queue, along with cost estimates.
Set of modules being optimized, along with cost estimates, and how long they've been undergoing processing. The heuristic function can have memory, so it can use this information to, for example, forecast when the next worker will finish and need more work.
Some measure of current memory usage.

Simulation parameters might include:

CGU count and cost estimates.
Maximum amount of concurrency available.
Relationships between CGU cost estimates and the actual time and memory usage during codegen, optimization, and LTO, and the memory usage after they complete (because some tasks drop memory when they complete).

That there are different kinds of cost estimates (codegen cost, optimization cost) and they might relate to actual time and memory usage differently. For the sake of simulation, you might just use one cost estimate per CGU, but have it map to the actual time and memory usage of various tasks differently.

Note that these relationships aren't necessarily linear. In theory, we could gather statistics to get estimates of the relationships and actually use them in rustc.

The simulation would obviously operate very similarly to the rustc scheduler. Additionally, it would probably maintain a priority queue of active tasks, ordered by finishing time. On starting a task, it would know its end time based on the task type and cost estimate, and the simulation parameters describing the relationship between the cost and actual run time. Instead of blocking on a message queue and waiting for tasks to finish, as it basically does in rustc, it would just pull the next task that would finish from the queue and act accordingly. It would adjust and record the memory usage at each step, and track total run time. And of course, it would consult the heuristic functions as appropriate. I probably wouldn't simulate the effects of the jobserver unless I had to.

Obviously hand-waving a lot here, but you get the idea.

tgnottingham commented 3 years ago

Partitioning Solutions

We could just dodge the scheduling issue entirely. For example, if codegen units are all close to the same size, then these scheduling concerns go away. Me might:

Factor in inlining before or during CGU merging

This can help merging produce a more even partition. See #65281.
Balance CGU sizes better during merging

Combining the two smallest CGUs until we're under the CGU limit, which is what we do now, might lead to outsized CGUs. For example, with a CGUs limit of 2, and four CGUs of size 3, 3, 2, and 1, the merge-smallest strategy leaves us with CGUs of sizes 6 and 3. But we could have merged CGUs such that we have CGUs of sizes 5 and 4.
Split large CGUs

Obviously this could help, but by splitting CGUs, we might also prevent inlining. Maybe that's acceptable, or at least acceptable in debug mode. But we could also try to avoid it.

It seems like it would be feasible to analyze the mono item graph and partition it into groups of mono items which cannot be reached from one another. Splitting the CGU on those partitions couldn't affect inlining.

wesleywiser commented 3 years ago

Balance CGU sizes better during merging

Just a note but I believe this is a multiway number partitioning problem.

In computer science, multiway number partitioning is the problem of partitioning a multiset of numbers into a fixed number of subsets, such that the sums of the subsets are as similar as possible.

philipaconrad commented 3 years ago

What specific metrics would you want to track for determining the "cost" of a given GCU? Or do you think it'd be better to empirically derive a metric from memory usage / compilation time over a wide number of crates or something?

tgnottingham commented 3 years ago

What specific metrics would you want to track for determining the "cost" of a given GCU? Or do you think it'd be better to empirically derive a metric from memory usage / compilation time over a wide number of crates or something?

Right now, we're using the number of MIR statements in the CGU as an initial codegen cost estimate (though IIRC, we may not be counting terminator statement of basic blocks, which isn't ideal).

After codegen to an LLVM module, we use the length of time that it took to codegen as the cost estimate for optimization. That may not be the best estimate, since the elapsed time is likely to be sensitive to whatever else is going on in the system. A better estimate might be the number of LLVM instructions in the module. But still, that's not an accurate estimate either, since the number of statements to actual time/space cost isn't always a linear relationship. Different code constructs have different time/space complexity optimization costs, and that can even differ between targets (e.g. see #81757). That all being said, it doesn't have to be perfect. These estimates may be totally fine for our purposes.

Finally, for thin LTO, we use the number of bytes in the bytecode of the module as the cost estimate for the LTO processing.

Dylan-DPC commented 7 months ago

Closing this as this is being tracked in https://github.com/rust-lang/rust/issues/89281, so better to keep things in one place

rust-lang / rust