Open cmelone opened 2 months ago
will rebase this as well as #93
past thread on deciding # of build jobs: https://github.com/spack/spack/issues/26242
@HadrienG2 I figured you might be interested to know we're working on this for our CI, the approach is quite similar to your comment
This is the first version of our prediction formulas for max cpu and memory.
This PR also sets
SPACK_BUILD_JOBS
equal to the CPU request (nearest core).Using the included simulation script, I ran a scenario where we allocated resources for 8000 specs.
The max memory predictions includes a 20% "bump" to avoid the OOM killing of ~1100 jobs.
The ratio of actual usage/predicted usage (mem) was
0.6963
, meaning that we are overallocating by 30%. However, 437 jobs were killed, representing an OOM rate of0.055
, far higher than we would like.@alecbcs and I discussed alternative prediction strategies that include factoring the ratio of mem/cores.
For example, if we take a look at a job that was predicted to use 3x less memory at peak than it actually used, and the data used to make this prediction:
This package usually takes 4-5 minutes to build, but instead took 21 minutes and peaked at nearly 4x memory.
In my opinion, there is no data available to us that would allow us to make an accurate prediction in this scenario. This is the case for most of the outliers that I've seen. In this scenario, the job in question may have been manipulated by a noisy neighbor not respecting their allocation.
My vote is to keep the formula as-is, and tweak it once we deploy gantry to the staging cluster and with limits in place.
The ratio for max cpu was
0.9546
.