rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
97.96k stars 12.69k forks source link

Optimize codegen scheduling for memory usage and compile time #82685

Closed tgnottingham closed 7 months ago

tgnottingham commented 3 years ago

The codegen process keeps a limited number of LLVM modules in memory simultaneously. Specifically, it has modules which are being optimized by LLVM workers, and modules which are ready to be optimized, there to meet demand of LLVM workers as they finish their optimization jobs.

The process also codegens CGUs in decreasing order of size (basically*). This means that the largest LLVM modules are resident in memory simultaneously. This can increase peak memory usage when there are outsized CGUs. Also, this order isn't always ideal for total processing time (it's basically the LPT or Longest Processing Time solution to the minimum makespan problem, which is just okay). The thin LTO process also processes modules in decreasing order of size and is susceptible to the same issues.

So, there is room for improvement here both in terms of memory usage and compilation time.

@rustbot label T-compiler A-codegen I-compiletime I-compilemem

(*I changed this in #81736, before I understood the codegen process very well. That change isn't really effective, and I'll probably revert it. Even with that change, the largest LLVM modules still end up being in memory simultaneously.)

tgnottingham commented 3 years ago

I'm just going to brain dump over a few posts. :)

Academic Work

Scheduling is a heavily studied problem, so we might look to academic work for solutions.

The largest CGUs are codegen'd first in hopes that this does a decent job of minimizing the total time taken to codegen and optimize. It avoids the case where we codegen and optimize a large CGU last, and there isn't any other work to do in parallel. As I mentioned, this roughly implements the LPT (Longest Processing Time) solution to the miminum makespan problem, which our problem bears some relation to. This doesn't always produce the best schedule in terms of processing time, but it does an okay job.

The classic minimum makespan problem doesn't care about memory usage, though. We'd want to look at prior work that specifically addresses that.

Our problem is a bit closer to optimizing the schedule of a set of tasks forming a DAG, while limiting memory usage. Our tasks might include, for example, codegenning a CGU to an LLVM module, optimizing an LLVM module, generating LTO work, and performing LTO on a module. Limiting the memory footprint when dynamically scheduling DAGs on shared-memory platforms is an example of a paper that tackles this type of problem.

However, our problem has some unique characteristics. In particular, we have the limitation that codegen can only happen on the main thread (at least until the parallel compiler is enabled). Also, scheduling is complicated by the fact that the amount of concurrency available to a rustc instance depends on external factors, like what other rustc instances spawned by the same cargo instance are doing, since available concurrency is shared among them. Papers such as the one above don't account for constraints like this.

Furthermore, many solutions require relative estimates of task memory usage and running time. Our estimates are unlikely to be very accurate, though I'm sure we could improve them. Estimate accuracy is also likely to depend on hardware, platform, and allocator behavior. And estimates will always be a moving target as rustc and LLVM characteristics change.

It looks like it would be hard to adapt academic solutions to our problem. Still, it may be worth surveying them for inspiration.

tgnottingham commented 3 years ago

Heuristic Solutions

Any decent solution is going to be based on heuristics. We could develop and evaluate heuristics via simulation.

If we assume the basic structure of codegen scheduling doesn't change, there are roughly two places where it would be natural to use a heuristic to decide what to do.

1) At one point, we ask whether the main thread should codegen a module or optimize a module. A heuristic could be used to answer that question, as well as determine which module should be codegened or optimized.

2) At another point, we peel work items off the queue and set LLVM workers to work on them. A heuristic could be used to determine how many and which work items should be worked on.

Information available to the heuristic functions might include:

Simulation parameters might include:

The simulation would obviously operate very similarly to the rustc scheduler. Additionally, it would probably maintain a priority queue of active tasks, ordered by finishing time. On starting a task, it would know its end time based on the task type and cost estimate, and the simulation parameters describing the relationship between the cost and actual run time. Instead of blocking on a message queue and waiting for tasks to finish, as it basically does in rustc, it would just pull the next task that would finish from the queue and act accordingly. It would adjust and record the memory usage at each step, and track total run time. And of course, it would consult the heuristic functions as appropriate. I probably wouldn't simulate the effects of the jobserver unless I had to.

Obviously hand-waving a lot here, but you get the idea.

tgnottingham commented 3 years ago

Partitioning Solutions

We could just dodge the scheduling issue entirely. For example, if codegen units are all close to the same size, then these scheduling concerns go away. Me might:

wesleywiser commented 3 years ago

Balance CGU sizes better during merging

Just a note but I believe this is a multiway number partitioning problem.

In computer science, multiway number partitioning is the problem of partitioning a multiset of numbers into a fixed number of subsets, such that the sums of the subsets are as similar as possible.

philipaconrad commented 3 years ago

What specific metrics would you want to track for determining the "cost" of a given GCU? Or do you think it'd be better to empirically derive a metric from memory usage / compilation time over a wide number of crates or something?

tgnottingham commented 3 years ago

What specific metrics would you want to track for determining the "cost" of a given GCU? Or do you think it'd be better to empirically derive a metric from memory usage / compilation time over a wide number of crates or something?

Right now, we're using the number of MIR statements in the CGU as an initial codegen cost estimate (though IIRC, we may not be counting terminator statement of basic blocks, which isn't ideal).

After codegen to an LLVM module, we use the length of time that it took to codegen as the cost estimate for optimization. That may not be the best estimate, since the elapsed time is likely to be sensitive to whatever else is going on in the system. A better estimate might be the number of LLVM instructions in the module. But still, that's not an accurate estimate either, since the number of statements to actual time/space cost isn't always a linear relationship. Different code constructs have different time/space complexity optimization costs, and that can even differ between targets (e.g. see #81757). That all being said, it doesn't have to be perfect. These estimates may be totally fine for our purposes.

Finally, for thin LTO, we use the number of bytes in the bytecode of the module as the cost estimate for the LTO processing.

Dylan-DPC commented 7 months ago

Closing this as this is being tracked in https://github.com/rust-lang/rust/issues/89281, so better to keep things in one place