rust-lang / crater

Run experiments across parts of the Rust ecosystem!
https://crater.rust-lang.org
639 stars 90 forks source link

Add a CPU limit, raise example memory limit #599

Closed saethlin closed 2 years ago

saethlin commented 2 years ago

Part of addressing https://github.com/rust-lang/crater/issues/589

In short, the combination of no job limit and a low memory limit has been causing a lot of OOMs in crater runs. This adds a job limit and enforces it including for doctests, which are built by the test runner. If nothing else, setting this CPU limit provides reproducible OOMs because many crates have peak memory usage proportional to build parallelism.

I have locally verified that a limit of 4 GB is sufficient to build all tests in the top 1,000. Just the CPU limit alone fixes the majority if the OOMs, but 10 of the top 1,000 crates have a very memory-hungry link step that breaks past 1.5 GB no matter what other configuration is applied.

This PR contains https://github.com/rust-lang/crater/pull/598, I'll rebase it up after that is resolved.

Mark-Simulacrum commented 2 years ago

@saethlin Can you say more about how we should pick the CPU and memory limits here for a particular host? In particular, some of our systems have more or less memory and more or less threads configured (and variable amounts of physical/virtual CPU cores, as well) -- it would be nice to have some documentation in the PR description and perhaps replacing https://github.com/rust-lang/crater/blob/master/docs/agent-machine-setup.md or adding a new file there about this configuration.

saethlin commented 2 years ago

Somehow I missed that setup document. I'll try to provide some thoughts here, then maybe we can decide exactly what to do with the agent config. I've only run crater locally.

The problem I'm targeting here is that crater sees a lot of "spurious" OOMs. As far as I can tell, these aren't spurious at all, they're driven by the machine configuration. Due to cargo's automatic build parallelism, memory usage scales linearly with core count, albeit with a significant constant value. So when crater jobs are distributed with always the same memory limit but varying numbers of CPUs available, the ones that see a lot of CPUs and parallelize very well tend to OOM.

So the root cause of that problem is that available memory doesn't scale with available CPUs. If you wanted to be clever about it, you could try some scheme that adjusts the sandbox limit based on the properties of the runner. What I'm suggesting in this PR is more like the CI approach: everyone gets a small sandbox. Cargo still detects a lot of CPUs, but by setting --jobs and --test-threads we don't get into a situation where we're trying to link 16 or 32 test executables within just 512 MB. Most importantly though, we need to fix the relationship between the number of jobs/CPUs and the amount of memory because core counts are only going up over time, and also people are writing more and more tests. Over time, builds involve running more and more ld processes at once.

The second aspect to this is that the absolute memory requirements for the single link of a big project when doing cargo build or cargo test for the unit tests has grown. There are a handful of projects in the top 1,000 crates that ld simply can't link within even 3 GB of memory, thus my suggestion of 4 GB. So regardless of build parallelism, the crater sandbox memory limit needs to be raised, and will probably need to be inched up in the future as well. (memchr is an outlier and will still OOM in crater until my PR or something like it is merged).

In general, I think 4 CPUs + 4 GB is a minimal setup. It suffices for all of the top 1,000 crates (minus memchr), and 3 GB does not suffice for a handful of the top crates. Providing less than 1 CPU/GB is unlikely to be cost-effective in a cloud environment anyway, and increasing that ratio of CPUs per GB will start to encounter OOMs from crates that have a lot of big unit test binaries such as regex. If possible, I think these 4+4 jobs should be distributed to workers such that (nearly?) all the memory is accounted for by a job. So a host that has 8 GB of memory should run 2 jobs.

Mark-Simulacrum commented 2 years ago

Right. I think the key bit to document (or perhaps automate) would be the criterion you suggest that we likely want, for a machine with X GB of memory, X / (memory limit) workers, right?

We currently have 4 machines, with largely arbitrary stats in terms of cores/memory/worker threads; worker threads are the easiest to change.

On all of these machines, X / 4 GB would mean pretty few workers, which might still be fine, but I would expect to mean an increased percentage of idle CPUs, though I could easily be wrong. It's definitely the case that the current overcommit on CPU design is intended to minimize the idle time on any CPU core by ensuring there's some rustc that's probably ready to run (either regular compilation or parallel), and isn't sitting in e.g. cargo network fetches or something else.

Do you think we should be updating to ~7-8 worker threads and ~30 worker threads on both of the latter two instances? FWIW, a quick look suggests the first two instances are typically at ~100% CPU utilization, with 70% and 80% on the latter two respectively. But I don't have as much data for the two larger instances as they get restarted nightly and we don't currently retain this data over the restart.

Mostly a wall-dump of stats and questions -- curious to hear your opinions here.

saethlin commented 2 years ago

I agree that reserving 4 GB per job makes poor use of resources. I'm also quite selfishly interested in this sort of problem because I run a codebase also built off rustwide, but it runs Miri and thus I currently reserve 1 CPU + 64 GB per job.

You could try running with dramatic memory oversubscription and no parallelism limit similar to or exactly the current config, then pull the list of crates that encountered an OOM and re-run just those in the 4+4 configuration with no oversubscription. As it stands there are only ~1200 crates that OOM per run, which is ~1%. Bumping the default memory limit, even just a little, could probably bring that number down a fair bit so that the re-run isn't too large.

You could also add swap (in EC2, this requires attaching a new volume). I do not know how much on its own swap will help, but I have this kind of crazy idea that a background thread could monitor the free memory in the system, then if it gets tight it could docker pause containers, probably starting with the lowest-memory ones.

You could also record the peak memory of each crate and use that to inform scheduling crates onto hosts. A good choice would of course be some percentage above the previous usage. But this potentially adds a lot of complexity to running crater and you also have to keep some database of crate memory utilization around.

Mark-Simulacrum commented 2 years ago

I think anything non-automated isn't going to be viable; the failure rate this is trying to address is high but not really impeding at least our typical usage of Crater at this time. I would definitely be very happy to see a revision to Crater's code which would automatically retry spurious failures (up to some limit of attempts), and presumably we could push those towards the end and decrease parallelism and increase worker memory limits for these jobs.

We have large disks already associated with these instances (typically in the 100s of GB, at least), so using some of that space for swap would not be terribly difficult, but I'd rather avoid trying to oversubscribe the memory that way. My intuition says that swap or some more fancy memory-pressure sensing scheme is likely too difficult for us to deploy with the current resources -- it's likely to be rather finicky, I suspect, and require a bunch of work to get started with.

I think the automated rescheduling strategy -- with significantly reduced worker count, perhaps such that we have 4-8 GB dedicated memory available -- seems like a good idea to me. It's probably a good bit of work, but it would also likely be fairly reusable for the spurious test failure problem we have (e.g., due to randomness, etc.), so if we can do it that would be great. Would you be up for investing the time to do that?

saethlin commented 2 years ago

I agree with everything you've said. I'll try implementing that soon.

bors commented 2 years ago

:umbrella: The latest upstream changes (presumably #543) made this pull request unmergeable. Please resolve the merge conflicts.

saethlin commented 2 years ago

Closing this to indicate I do not have and don't expect to have time to work on it.