weihanglo commented 9 months ago

tl;dr: Introduce a simple mechanism for limiting parallelism automatically in Cargo, to avoid consuming all system resources during the compilation.

Problem

Cargo by default uses all cores (std::thread::available_parallelism) and spawns off rustc or build scripts onto each core. This is not an issue when compiling on a decent machine. When working on low-end machines or large-scale codebase, developers often encounter issue like extremely high CPU loads or out-of-memory errors.

To solve these problem, developers can set --jobs from command line, or build.jobs in .cargo/config.toml to control the maximum parallelism Cargo can use. This is not ideal because

build.jobs is bound to number of core in use. It is not immediately correlated with memory usage. Executing parallel builds might lead to out of memory before any CPU throttling happens, especially when several linker invocations happens.
build.jobs assigns core fairly to each unit of work (i.e. a crate build"). However, some crate builds might consume more computing resources. If those crate builds are bottlenecks of the entire compilation, we might want to throw more resources to complete them to unblock other crate builds.
Developers need to set build.jobs explicitly to control the parallelism. However, it's often a long trial-and-error to figure out a proper value. The value also varies in different environments. Not really user friendly.
Developers might want a full control of every dependency build. build.jobs is too coarse-grained.

An "ideal" approach (but not now)

There are a couple of existing proposals trying to improve the situation. Some of them want to define a weight on a certain job, or tag jobs into a group. With weights and tags, job scheduler understands whether it should allocate a job. This is pretty much the ideal solution, as it maximizes the control of parallelism for developers, and the system could be extend to the job scheduling optimization.

However, such a system requires developers to fully understand the entire compilation of their projects. For now, the data is either missing or hard to get from Cargo. To incrementally build the system, there are prerequisites:

Cargo can monitor the resource usage of the system and each unit of work during a build.
Cargo can persist the resource usage of each unit of work for each build.

Start small

We should start small, and focus on monitoring the resource usage, additionally limiting the parallelism when the usage exceeds a threshold.

Some options we can do:

Assign the maximum amount of resources that Cargo can use. This is how build.jobs works now. We might need an equivalent for memory usage. Something like
```
[build.limit]
local-memory = "3GiB" # or "95%" or "100% - 200MiB"
```
Set a system threshold. Cargo won't allocate any new job and wait for the entire system usage going down, even when the usage of Cargo itself is still under the assigned maximum.
```
[build.limit]
system = "3GiB" # or "95%" or "100% - 200MiB"
cpu = "100%"
```

To minimize the impact of bad data points, these metrics will be sampled and averaged out within a period of time.

Instead of "usage", we can also leverage the concept "load average" from Unix-like, which might make more sense to manage computing resource loads.

I entirely don't know which one we want, or both, or none.

Library to use

procfs — used by wider Rust web-dev community, via promethues and other metrics crates.
sysinfo — another popular crate for inspecting system info.

Both of then introduce an excessive amount of code Cargo doesn't need at this moment.

Alternatively, we can use syscall lib directly to get these info.

Prior arts

Bazel
- --jobs
- --local_{ram,cpu}_resources to assign resources a build can use
Buck
- --jobs
- link_weight to config how many job a link job consumes.
Cabel
- --jobs
- Got the same linker invocation issue https://github.com/haskell/cabal/issues/1529.
CMake
- -j to set max number of concurrent processes
GitHub Actions
- has concurrency.group
Go
- go build -p limits the number of programs, such as build commands or test binaries, that can be run in parallel.
- GOMAXPROCS to limit the number of OS threads that can execute user-level Go code simultaneously.
Gradle
- --max-workers — like --jobs
- Has a SharedResourceLeaseRegistry for registering a resource with its maximum lease numbers. Like a semaphore.
- Parallelism can be configured per-project on demand.
make
- -j to set max number of concurrent jobs
- --max-load to limit the start of a new job if load average goes above the value
- Read Parallel for more
Ninja
- has a pool concept that user can assign some stage of build to a pool with more restricted parallelism rules.
Nix
- max-jobs
sbt
- tasks are tagged, and each tag get a default weight of resource restriction.

Related issues

There are more issues regaring scheduling but I dont want to link to them here. These are issue of people trying to tell Cargo not to be that greedy.

And sorry I opened a new issue instead. Feel free to close and move to any existing one.

epage commented 9 months ago

9250 is an interesting alternative for the CPU load aspect. I've not done enough with nice to know how cross platform the concept is or if there are restrictions that might get in the way.

In general, with all of the security and docker-like technologies out these days, I wonder if there is more we can delegate to the operating system for this which would likely reduce complexity and overhead within cargo.

epage commented 9 months ago

local-memory = "3GiB" # or "95%" or "100% - 200MiB"

On the surface, percentages seem nice because you don't have to worry about the exact configuration of the local system. However, 90% of 64GB is a lot more usable of a system than 90% of 8GB. I feel like what will be most useful is "all except". We covered this with --jobs by allowing negative numbers. We could do similar here where -3GB means "all but 3GB"

epage commented 9 months ago

To minimize the impact of bad data points, these metrics will be sampled and averaged out within a period of time.

The meaning of such an average dramatically changes whenever a job finishes and a new one starts, especially if there are jobs or categories of jobs (e.g. linking) with dramatically different characteristics.

epage commented 9 months ago

With the parallel frontend rearing its head again, we should probably consider how that affects this.

the8472 commented 9 months ago

We should start small, and focus on monitoring the resource usage, additionally limiting the parallelism when the usage exceeds a threshold.

On linux specifically it might be better to monitor pressure rather than utilization. The downside is that that's even less portable.

clarfonthey commented 9 months ago

FWIW, I personally was looking into a way of cross-platform measuring load average when looking for a solution to this, since it's not just Linux that would benefit from that metric. It is doable, but annoying, and I personally don't know or care enough about Windows to formulate a proper solution.

It would be nice if whatever solution is made could be applied generically to projects, since I have a strong feeling more than just Cargo could benefit from it.

epage commented 9 months ago

btw doctests are another place where we can hit this (see #12916). We should keep in mind a way to carry this information forward to those.

clarfonthey commented 9 months ago

Part of the reason why I mention a general solution is because, although the implementation would be complicated, the actual API probably wouldn't. Right now, things just check the number of concurrent threads the CPU can run and only let that many threads run at a time. The biggest change is that, before spawning a thread, you have to verify that both the number of threads is low enough and the system load is low enough. The API could be something as simple as fn system_load() -> f64 in theory and you'd just verify that it's below the desired number.

Of course, the tricky thing is making sure that you compute that number and fetch it quickly, that it's consistent across platforms, and that you set the right load limit, which could very well depend on the system.

the8472 commented 9 months ago

I don't think an in-process API to query load would be sufficient because builds can involve multiple processes (cargo spawning rustc instances). To throttle entire process trees, potentially after processes have been spawned (e.g. to keep rustc from spawning even more threads after it has already started) we need IPC. So that basically means a more powerful jobserver protocol because currently the jobserver just consists of pouring some tokens (bytes) into a pipe and then having processes contend for them in a decentralized fashion.

If we had each process connect to a jobserver through a separate connection and signalled intent (spawn thread vs. spawn child process) then the jobserver could dole out tokens more intelligently, withhold them under system load and even ask child processes to go quiescent for a while. Supporting that and the make jobserver protocol at the same time (by shuffling tokens back and forth between them) would be quite complicated.

weihanglo commented 9 months ago

A zulip topic is opened for a similar discussion as well: https://rust-lang.zulipchat.com/#narrow/stream/246057-t-cargo/topic/parallel.20.28link.29.20jobs.20and.20OOM.20cargo.2312912.

luser commented 9 months ago

Instead of "usage", we can also leverage the concept "load average" from Unix-like, which might make more sense to manage computing resource loads.

FYI GNU make has long supported a basic version of this with its --max-load option: https://www.gnu.org/software/make/manual/make.html#Parallel . However, in practice it never seemed to work well, since by the time you measure that the load average is above your target, it's too late and the system is overloaded.

the8472 commented 9 months ago

I wonder how thread priorities interact with concurrent pipe reads. I.e. do the highest-priority readers get woken up first? If so cargo could run at a higher priority than rustc or the linkers and reclaim tokens under pressure. And rustc could release and reacquire a token every second to see if cargo wants it back.

This would allow more granularity than just not starting a new rustc/link job when load is high.

weihanglo commented 9 months ago

do the highest-priority readers get woken up first? If so cargo could run at a higher priority than rustc or the linkers and reclaim tokens under pressure.

Thanks for the suggestion. That's definitely and thing we can look into. I remember people did something similar in the past https://github.com/rust-lang/rust/pull/67398, but rustc turns to go the other way for parallelism story recently.

I tend to avoid a general mechanism touching every components under Rust project, which is harder to move forward. And in any case, cargo might still need a way to monitor some indicators to control the overall parallelism, from a build system perspective. They seems somehow independent and can be done separately.

weihanglo commented 9 months ago

However, in practice it never seemed to work well, since by the time you measure that the load average is above your target, it's too late and the system is overloaded.

Okay, load average sounds like a lagging indicator here. @luser do you know any other indicators might help? I've done a survey for some major build tool listed in the issue descritption, but can't see any other interesting indicator they expose in the CLI interface. Or if you know any of their implementation has a fancy resource monitoring logic, please let us know. Personally I am looking for automatic way without user interference first, then we can start thinking the interface and scheduling issues.

the8472 commented 9 months ago

Memory pressure could work to some extent because it includes page reclaims. If build processes gobble up enough ram that it forces the kernel to synchronously clean up caches or even paging that's an indication that memory reserves are running low some time before OOM conditions are reached. The question is whether it's early enough.

If a single linker job eats half the available memory but only counts as one job token then even 1 token too many can be problematic if a linker job is already running and that remaining token would be used to start another one. Ultimately job tokens are intend to regulate core utilization, not memory utilization so there's an impedance mismatch.

Core utilization is kinda easy to regulate and predict. 1 compute-bound process/thread = 1 token.

Memory utilization is more difficult because we lack estimators for that.

Some ideas:

tell users to add copious amounts of swap. Swap allows processes to live (even if under painful conditions), which gives us room to react to swap pressure if we can't proact, e.g. by temporarily quiescing processes until the remaining ones can make progress without thrashing. once pressure subsides the other processes can resume work
automatically recover from OOMs by reserving more job tokens and then restarting the child process, under the assumption that it'll succeed if it has fewer siblings
- variation: instead of waiting for system-wide OOM conditions we can set per-process memory limits (or cgroups or whatever) and let them die earlier and then retry them later with a higher limit
try to dial things back (by reclaiming tokens) at the earliest sign of pressure/high load/whatever. this won't help in worst-case scenarios where a single job needs all the available memory
develop memory predictors and then have a 2nd jobserver that doles out tokens where 1 token = 256MB or something like that

weihanglo commented 9 months ago

Just post what I found from https://gcc.gnu.org/wiki/DebugFission:

As a rule of thumb, the link job total memory requirements can be estimated at about 200% of the total size of its input files.

Might help to predict/analyze possible memory consumption for linking.

soloturn commented 7 months ago

here a ticket towards rust-lang, when memory consumption was unpredictable when cargo used all threads to link binaries, and caused OOM when compiling cosmic: https://github.com/rust-lang/rust/issues/114037 . not sure what happened. but it seems that the situation improved in august 2023.

keeping the system responsive is a different matter, and we solve it by using "nice cargo ..." or "nice paru ...", in arch linux. giving lower prio to processes is operating specific and, at least in my opinion, needs to stay OUTSIDE of cargo. because what is "nice" in linux, is "start /low" in windows: https://stackoverflow.com/questions/4208/windows-equivalent-of-nice .

sunshowers commented 4 months ago

Wanted to add that nextest also has several knobs for this:

heavy tests, some tests take up more concurrency slots than others: https://nexte.st/book/threads-required
test groups and mutual exclusion: https://nexte.st/book/test-groups
more generally, profiles to control runner behavior: https://nexte.st/book/configuration.html#profiles

The context is that in nextest we also wanted to try and avoid test contention in high-core situations (e.g. https://github.com/oxidecomputer/omicron/issues/5380) -- we were looking to see if there was prior art for using an expression language to define concurrency limits, or other static/dynamic behavior. @epage kindly linked me to this thread -- thanks!

sunshowers commented 4 months ago

Memory utilization is more difficult because we lack estimators for that.

A practical approach may be to record and store historical metrics, and use them to predict future performance.

rust-lang / cargo

Limiting the parallelism automatically #12912

Problem

An "ideal" approach (but not now)

Start small

Library to use

Prior arts

Related issues

9250 is an interesting alternative for the CPU load aspect. I've not done enough with nice to know how cross platform the concept is or if there are restrictions that might get in the way.