spack / spack-gantry

A Dynamic Resource Allocation System for Spack CI and Kubernetes
Other
2 stars 0 forks source link

Add resource limits #106

Open cmelone opened 2 months ago

cmelone commented 2 months ago

This is the first version of our prediction formulas for max cpu and memory.

This PR also sets SPACK_BUILD_JOBS equal to the CPU request (nearest core).

Using the included simulation script, I ran a scenario where we allocated resources for 8000 specs.


The max memory predictions includes a 20% "bump" to avoid the OOM killing of ~1100 jobs.

The ratio of actual usage/predicted usage (mem) was 0.6963, meaning that we are overallocating by 30%. However, 437 jobs were killed, representing an OOM rate of 0.055, far higher than we would like.

@alecbcs and I discussed alternative prediction strategies that include factoring the ratio of mem/cores.

For example, if we take a look at a job that was predicted to use 3x less memory at peak than it actually used, and the data used to make this prediction:

gmake@4.4.1 ~guile build_system=generic%intel@2021.10.0 gitlab_id=12859608
duration: 1262 cpu_mean: 0.621, cpu_max: 0.956, mem_mean: 2590.126, mem_max: 4448.702

samples:
duration: 181 cpu_mean: 0.169, cpu_max: 0.424, mem_mean: 105.722, mem_max: 168.37
duration: 149 cpu_mean: 0.531, cpu_max: 1.064, mem_mean: 702.054, mem_max: 1033.888
duration: 107 cpu_mean: 0.283, cpu_max: 0.415, mem_mean: 95.556, mem_max: 149.381
duration: 432 cpu_mean: 0.31, cpu_max: 1.051, mem_mean: 100.313, mem_max: 1300.226
duration: 396 cpu_mean: 0.268, cpu_max: 1.023, mem_mean: 172.576, mem_max: 1364.505

This package usually takes 4-5 minutes to build, but instead took 21 minutes and peaked at nearly 4x memory.

In my opinion, there is no data available to us that would allow us to make an accurate prediction in this scenario. This is the case for most of the outliers that I've seen. In this scenario, the job in question may have been manipulated by a noisy neighbor not respecting their allocation.

My vote is to keep the formula as-is, and tweak it once we deploy gantry to the staging cluster and with limits in place.


The ratio for max cpu was 0.9546.

cmelone commented 1 month ago

will rebase this as well as #93

cmelone commented 1 month ago

past thread on deciding # of build jobs: https://github.com/spack/spack/issues/26242

@HadrienG2 I figured you might be interested to know we're working on this for our CI, the approach is quite similar to your comment