rust-lang / cargo

The Rust package manager
https://doc.rust-lang.org/cargo
Apache License 2.0
12.48k stars 2.37k forks source link

Limiting the parallelism automatically #12912

Open weihanglo opened 9 months ago

weihanglo commented 9 months ago

tl;dr: Introduce a simple mechanism for limiting parallelism automatically in Cargo, to avoid consuming all system resources during the compilation.

Problem

Cargo by default uses all cores (std::thread::available_parallelism) and spawns off rustc or build scripts onto each core. This is not an issue when compiling on a decent machine. When working on low-end machines or large-scale codebase, developers often encounter issue like extremely high CPU loads or out-of-memory errors.

To solve these problem, developers can set --jobs from command line, or build.jobs in .cargo/config.toml to control the maximum parallelism Cargo can use. This is not ideal because

An "ideal" approach (but not now)

There are a couple of existing proposals trying to improve the situation. Some of them want to define a weight on a certain job, or tag jobs into a group. With weights and tags, job scheduler understands whether it should allocate a job. This is pretty much the ideal solution, as it maximizes the control of parallelism for developers, and the system could be extend to the job scheduling optimization.

However, such a system requires developers to fully understand the entire compilation of their projects. For now, the data is either missing or hard to get from Cargo. To incrementally build the system, there are prerequisites:

Start small

We should start small, and focus on monitoring the resource usage, additionally limiting the parallelism when the usage exceeds a threshold.

Some options we can do:

To minimize the impact of bad data points, these metrics will be sampled and averaged out within a period of time.

Instead of "usage", we can also leverage the concept "load average" from Unix-like, which might make more sense to manage computing resource loads.

I entirely don't know which one we want, or both, or none.

Library to use

Both of then introduce an excessive amount of code Cargo doesn't need at this moment.

Alternatively, we can use syscall lib directly to get these info.

Prior arts

Related issues

There are more issues regaring scheduling but I dont want to link to them here. These are issue of people trying to tell Cargo not to be that greedy.

And sorry I opened a new issue instead. Feel free to close and move to any existing one.

epage commented 9 months ago

9250 is an interesting alternative for the CPU load aspect. I've not done enough with nice to know how cross platform the concept is or if there are restrictions that might get in the way.

In general, with all of the security and docker-like technologies out these days, I wonder if there is more we can delegate to the operating system for this which would likely reduce complexity and overhead within cargo.

epage commented 9 months ago

local-memory = "3GiB" # or "95%" or "100% - 200MiB"

On the surface, percentages seem nice because you don't have to worry about the exact configuration of the local system. However, 90% of 64GB is a lot more usable of a system than 90% of 8GB. I feel like what will be most useful is "all except". We covered this with --jobs by allowing negative numbers. We could do similar here where -3GB means "all but 3GB"

epage commented 9 months ago

To minimize the impact of bad data points, these metrics will be sampled and averaged out within a period of time.

The meaning of such an average dramatically changes whenever a job finishes and a new one starts, especially if there are jobs or categories of jobs (e.g. linking) with dramatically different characteristics.

epage commented 9 months ago

With the parallel frontend rearing its head again, we should probably consider how that affects this.

the8472 commented 9 months ago

We should start small, and focus on monitoring the resource usage, additionally limiting the parallelism when the usage exceeds a threshold.

On linux specifically it might be better to monitor pressure rather than utilization. The downside is that that's even less portable.

clarfonthey commented 9 months ago

FWIW, I personally was looking into a way of cross-platform measuring load average when looking for a solution to this, since it's not just Linux that would benefit from that metric. It is doable, but annoying, and I personally don't know or care enough about Windows to formulate a proper solution.

It would be nice if whatever solution is made could be applied generically to projects, since I have a strong feeling more than just Cargo could benefit from it.

epage commented 9 months ago

btw doctests are another place where we can hit this (see #12916). We should keep in mind a way to carry this information forward to those.

clarfonthey commented 9 months ago

Part of the reason why I mention a general solution is because, although the implementation would be complicated, the actual API probably wouldn't. Right now, things just check the number of concurrent threads the CPU can run and only let that many threads run at a time. The biggest change is that, before spawning a thread, you have to verify that both the number of threads is low enough and the system load is low enough. The API could be something as simple as fn system_load() -> f64 in theory and you'd just verify that it's below the desired number.

Of course, the tricky thing is making sure that you compute that number and fetch it quickly, that it's consistent across platforms, and that you set the right load limit, which could very well depend on the system.

the8472 commented 9 months ago

I don't think an in-process API to query load would be sufficient because builds can involve multiple processes (cargo spawning rustc instances). To throttle entire process trees, potentially after processes have been spawned (e.g. to keep rustc from spawning even more threads after it has already started) we need IPC. So that basically means a more powerful jobserver protocol because currently the jobserver just consists of pouring some tokens (bytes) into a pipe and then having processes contend for them in a decentralized fashion.

If we had each process connect to a jobserver through a separate connection and signalled intent (spawn thread vs. spawn child process) then the jobserver could dole out tokens more intelligently, withhold them under system load and even ask child processes to go quiescent for a while. Supporting that and the make jobserver protocol at the same time (by shuffling tokens back and forth between them) would be quite complicated.

weihanglo commented 9 months ago

A zulip topic is opened for a similar discussion as well: https://rust-lang.zulipchat.com/#narrow/stream/246057-t-cargo/topic/parallel.20.28link.29.20jobs.20and.20OOM.20cargo.2312912.

luser commented 9 months ago

Instead of "usage", we can also leverage the concept "load average" from Unix-like, which might make more sense to manage computing resource loads.

FYI GNU make has long supported a basic version of this with its --max-load option: https://www.gnu.org/software/make/manual/make.html#Parallel . However, in practice it never seemed to work well, since by the time you measure that the load average is above your target, it's too late and the system is overloaded.

the8472 commented 9 months ago

I wonder how thread priorities interact with concurrent pipe reads. I.e. do the highest-priority readers get woken up first? If so cargo could run at a higher priority than rustc or the linkers and reclaim tokens under pressure. And rustc could release and reacquire a token every second to see if cargo wants it back.

This would allow more granularity than just not starting a new rustc/link job when load is high.

weihanglo commented 9 months ago

do the highest-priority readers get woken up first? If so cargo could run at a higher priority than rustc or the linkers and reclaim tokens under pressure.

Thanks for the suggestion. That's definitely and thing we can look into. I remember people did something similar in the past https://github.com/rust-lang/rust/pull/67398, but rustc turns to go the other way for parallelism story recently.

I tend to avoid a general mechanism touching every components under Rust project, which is harder to move forward. And in any case, cargo might still need a way to monitor some indicators to control the overall parallelism, from a build system perspective. They seems somehow independent and can be done separately.

weihanglo commented 9 months ago

However, in practice it never seemed to work well, since by the time you measure that the load average is above your target, it's too late and the system is overloaded.

Okay, load average sounds like a lagging indicator here. @luser do you know any other indicators might help? I've done a survey for some major build tool listed in the issue descritption, but can't see any other interesting indicator they expose in the CLI interface. Or if you know any of their implementation has a fancy resource monitoring logic, please let us know. Personally I am looking for automatic way without user interference first, then we can start thinking the interface and scheduling issues.

the8472 commented 9 months ago

Memory pressure could work to some extent because it includes page reclaims. If build processes gobble up enough ram that it forces the kernel to synchronously clean up caches or even paging that's an indication that memory reserves are running low some time before OOM conditions are reached. The question is whether it's early enough.

If a single linker job eats half the available memory but only counts as one job token then even 1 token too many can be problematic if a linker job is already running and that remaining token would be used to start another one. Ultimately job tokens are intend to regulate core utilization, not memory utilization so there's an impedance mismatch.

Core utilization is kinda easy to regulate and predict. 1 compute-bound process/thread = 1 token.

Memory utilization is more difficult because we lack estimators for that.

Some ideas:

weihanglo commented 9 months ago

Just post what I found from https://gcc.gnu.org/wiki/DebugFission:

As a rule of thumb, the link job total memory requirements can be estimated at about 200% of the total size of its input files.

Might help to predict/analyze possible memory consumption for linking.

soloturn commented 7 months ago

here a ticket towards rust-lang, when memory consumption was unpredictable when cargo used all threads to link binaries, and caused OOM when compiling cosmic: https://github.com/rust-lang/rust/issues/114037 . not sure what happened. but it seems that the situation improved in august 2023.

keeping the system responsive is a different matter, and we solve it by using "nice cargo ..." or "nice paru ...", in arch linux. giving lower prio to processes is operating specific and, at least in my opinion, needs to stay OUTSIDE of cargo. because what is "nice" in linux, is "start /low" in windows: https://stackoverflow.com/questions/4208/windows-equivalent-of-nice .

sunshowers commented 4 months ago

Wanted to add that nextest also has several knobs for this:

The context is that in nextest we also wanted to try and avoid test contention in high-core situations (e.g. https://github.com/oxidecomputer/omicron/issues/5380) -- we were looking to see if there was prior art for using an expression language to define concurrency limits, or other static/dynamic behavior. @epage kindly linked me to this thread -- thanks!

sunshowers commented 4 months ago

Memory utilization is more difficult because we lack estimators for that.

A practical approach may be to record and store historical metrics, and use them to predict future performance.