cgroup support for resource budget

sifive / wake

The SiFive wake build tool

Other

85 stars 28 forks source link

cgroup support for resource budget #1123

Open mmjconolly opened 1 year ago

mmjconolly commented 1 year ago

Wake has a ResourceBudget type exposed via the CLI

$ wake --help | grep %
    --jobs=N   -jN   Schedule local jobs for N cores or N% of CPU (default 90%)
    --memory=M -mM   Schedule local jobs for M bytes or M% of RAM (default 90%)

As far as I can tell these aren't enforced limits, just a guide for wakes scheduler when it chooses to invoke a job or wait.

We may want to provide a hard limit - possibly implemented by creating cgroups within wake?

JakeSiFive commented 1 year ago

Correct right now these are only used for scheduling. It is a bit tricky to enforce them exactly with the current scheduling algorithm however. Right now the scheduler is allowed (and frequently does) go over the limit, it just isn't allowed to schedule any more jobs until it goes far enough under the limit that some to-be-scheduled job would hypothetically fit within the currently available space. This means that wake actually spends a lot of time just over those limits or very close to them, especially if we're given bad resource estimates.

We'd need a way to account for that I think.

mmjconolly commented 1 year ago

If we created user-specified-sized cgroups (for the whole wake invocation), the kernel would start invoking the out-of-memory killer on our behalf, rather than wake guessing the future about which processes use how much of what

JakeSiFive commented 1 year ago

Yeah but that's kinda bad isn't it? We want wake to schedule things so that we don't get OOM errors

Right now wake actually schedules up to 90% of those values to account for the possible slop but I fear the slop could be far worse than 10%

Like say someone scheduled 10000 jobs that each take 0.1-0.3gb and consume 10-30% CPU and all these jobs are in parallel. Wake assumes that each of them takes 0.20gb and 20% CPU. The expected variance is on the order of sqrt(10000)0.1gb == 10GB and sqrt(10000)10% == 1000% CPU. I think people would still reasonably expect wake to intelligently schedule these jobs. Right now it just eats the variance to some extent and says "that's ok" but if we added cgroups and the variance was that high we'd get failing jobs instead.