proposal: automatic resource controls

tonistiigi commented 3 years ago

https://github.com/moby/buildkit/pull/2049 is adding a simple parallelization limit. This proposal describes a more complex follow-up to this problem. The goal is that when a build is scaled up(eg. take a build that compiles app now and change it to a build that compiles the same app from 100 different commits), it does not cause the builder to crash and the machine to become unresponsive (or catch fire). This should happen without requiring manual configuration. If you move from high powered machine to a low-powered one (eg. rpi) build should slow down linearly without additional bottlenecks from inefficient execution.

In the simplest form, the scheduler should be combined with system state monitoring. When the monitor detects that machine resources have been reached, it starts blocking new jobs. In some cases, existing jobs may need to be paused.

This is different from cgroup controls that may be applied independently.

I'm only concentrating on cpu and memory. In the future, this could be extended to io/network. It's also probably too early to discuss making initial predictions for possible resource usage based on command arguments.

type SystemStats interface {
  CPUInfo()
  MemoryInfo()
}

NewResourceManager(SystemStats, maxMemory, memoryBuffer) + existing MaxParallelism semaphore

rm.Init(ctx, id, cpuHint, memoryHint) (error)
rm.Update(id, cpuLoad, memoryUsage) (wait chan struct{}, error)
rm.Leave(id) error

type Op {
   // current methods
   Acquire(ctx) (release, error)
}

New daemon config values:

MaxMemory - defaults/max to current available memory MemoryBuffer - a percentage of MaxMemory NumCPUs ? maybe but not very related

New values for ExecOp:

MemoryHint - How much memory process expects to need MemoryLimit - Limit memory under these bounds (with cgroup)

Every Op is initialized with a shared ResourceManager instance. Solver/scheduler has no knowledge of the limits and only calls the Acquire() method of the op that blocks until the op can start(or is canceled). This is important for the case where we have multiple workers. 2 vertexes on different workers have different resource managers and don't block each other. Also, solver/vertex definition is very generic and doesn't fit with very specific linux resources. Hopefully, this doesn't limit us from making smarter scheduling decisions.

Acquire() method initializes current ID with the ResourceManager. If the system is exhausted, then the function will block. During Exec Op will monitor its own resources (cgroup of the containers it created) and call rm.Update(). Update call may choose to return a channel. If that happens, op should pause its execution(eg with freezer cgroup) and wait for that channel to return.

ResourceManager compares system stats with the values sent by the ops and makes decisions when to block certain ops when they call Init/Update and when to restart them. ResourceManager should be unit testable with custom system stats provider implementation.

CPU

The main parameter to monitor here is if the system CPU is exhausted. This can be determined based on vmstat (/proc equivalent of it) by checking the length of the run queue compared to the number of cpus, and cpu idle time. If CPU is exhausted, then additional ops can't run. Values should be determined as a weighted average over a time period to minimize wrong decisions from quick changes. At least historical values should be taken into account when determining if CPU is free again. When CPU is exhausted, starting new ops can be blocked without historical data. When one op has finished, algorithm should be smart enough to understand that this CPU time is not used anymore.

The second problem that should be avoided is starting too many ops in parallel when the stats are low, and then as they start, they exhaust the CPU. This should be done by introducing delays if too many processes start at the same time. To predict the delays, I think we need to look at system load as well as count/speed of the CPUs. Eg. we should be able to detect that RPi needs longer delays. One way to think about this problem is that we set a prediction of CPU usage on every op we start. Initially, this prediction has a big std deviation. Over time, when we get actual values via Update(), that stddev gets smaller, and we can trust the data if it says that there is more CPU power left.

Generally, over-using CPU is not as big of a problem as doing the same with memory. Kernel's scheduler can balance quite well, and more parallel ops usually give faster build times. So we should not try to be very precise but avoid extreme cases.

Although rare, I do think we need some logic to also pause ops when needed. For example, let's say there is a lot of processes running ./configure && make. Configure is usually quite sequential, so CPU load will not be detected. It also takes a long time, so the startup delays do not have an effect. But make can likely take advantage of the whole CPU and create a big run queue. So if the run queue remains very long for a long time, Update should send a signal to op to pause it.

We can also look into scheduling priorities to give some ops more CPU than others.

Memory

Memory monitoring is somewhat similar, but we need to be more precise. The manager is configured with MaxMemory parameter, capped with maximum free system memory, and the goal is for the ops total memory usage to never go past that value.

Unlike CPU, once we pause op, it does not release the memory it already uses(at least without checkpoint/restore that is out of scope atm). This means that we need a buffer to allow memory to grow and that we need to predict how much memory an op will take in the future.

For a better prediction op can give MemoryHint and MemoryLimit value with definition. Hint assumes how much memory will be needed to avoid overflow when it is known that process uses lots of resources. Limit sets a cgroup limit and can be used as an upper cap for the prediction.

If no hint was set, prediction starts with a value based on a constant and some average of memory usage for previous builds (later args could be used for better historic prediction). Initially, the prediction has a high std deviation.

Once the process has started, it sends updates about its memory usage. This data can be used for a future prediction based on the changes in previous data. As we get more data we can trust it more and stddev gets smaller.

Examples:

Initial prediction: 200MB Process starts, takes 30MB instantly, 31MB in 10s, 32MB in 20s Prediction: 200MB, 5sec 100MB, 10s 40MB

Initial prediction: 200MB Process starts, takes 100MB instantly, 200MB in 10s, 300MB in 20s, 320MB in 30s, 325MB in 40s Prediction: 200MB, 10sec 300MB, 20s 1GB, 30s 500MB , 40s 400MB

It's unclear how much in the future the prediction should be. We probably need a lot of tunable parameters to determine the best values.

All the ops memory predictions are added together and compared with max available memory. A buffer is also applied to allow the processes that are left running to grow their memory.

If the prediction shows that the memory limit is about to be reached, one of the ops is paused. It probably makes sense to pause the op that was most aggressively acquiring new memory.

When Op pauses, we need a way for this to show up in the progress bar. This should be solvable with a new state in Vertex status structure. This part could be possible as a separate step and should be possibly done first as there may be a problem with backward compatibility with old clients.

As another follow-up resource manager should be able to return debug info. With analysis of that info, it should be possible to predict if the build needs more CPU, iops, memory etc and how much faster it would have been on a machine with different capabilities.

@vladaionescu @AkihiroSuda @hinshun @aaronlehmann @crazy-max

vladaionescu commented 3 years ago

I've noticed that it's more likely for io/network to be causing congestion more so than CPU or memory - especially with all the file-syncing that goes on typically. I feel like that's the more important bottleneck to address. In any case, this feels like the right direction and will be especially useful for managing multitenancy.

tonistiigi commented 3 years ago

@vladaionescu Do you mean more about the pulls/pushes being too much in parallel rather than the io bottlenecks in exec? I had https://github.com/moby/buildkit/pull/1989 separately for it that really improved some giant builds, but unfortunately introduced a deadlock that needs to be figured out for implementation and was reverted.

aaronlehmann commented 3 years ago

Although rare, I do think we need some logic to also pause ops when needed. For example, let's say there is a lot of processes running ./configure && make. Configure is usually quite sequential, so CPU load will not be detected. It also takes a long time, so the startup delays do not have an effect. But make can likely take advantage of the whole CPU and create a big run queue. So if the run queue remains very long for a long time, Update should send a signal to op to pause it.

As you mentioned above this, the OS-level scheduler generally does a good job managing CPU contention. I can see why we might want to delay starting ops based on load, since trying to run too many ops at once could lead to suboptimal performance. But I'm curious about the rationale for pausing ops when CPU load gets high - do you think BuildKit would be able to make good decisions about what to pause and when to pause it, which would be better overall than the scheduler's timeslicing?

tonistiigi commented 3 years ago

@aaronlehmann

I do think that 100% cpu usage is not something we should avoid. Contrary, builder should maximize the time cpu is fully utilized. But I think there are some extreme cases where it might be better for us to pause.

Eg. conventional wisdom tells that make should be started with make -j $(nproc), not with $(nproc) * 10 for example. But that's what can easily happen when running make in buildkit atm. Adding these extra threads would add more negative effect than speed things up. Even if the kernel's scheduler is without overhead I can imagine this reduces cpu cache hits due to many more context switches and increases chances for expensive core switch. Very long run queue that does not stabilize might also point to some other bottlenecks in the system that are not as easy to detect.

Startup delays would be enough, except processes (groups) do not have constant CPU usage. Therefore there might be no load initially that picks up later. I agree that we do need to be careful with these pauses. I don't want a flickering experience where a process flips between running and paused all the time.

I guess some practical cases would tell for sure how effective pausing would be. I think it at least makes sense to test it.

vladaionescu commented 3 years ago

@vladaionescu Do you mean more about the pulls/pushes being too much in parallel rather than the io bottlenecks in exec? I had #1989 separately for it that really improved some giant builds, but unfortunately introduced a deadlock that needs to be figured out for implementation and was reverted.

Yes, exactly - that kind of limiting is really helpful for really large builds. I've been noticing random timeouts due to simple COPY commands.

tonistiigi commented 3 years ago

I've been catching up on cgroupv2/psi and looks quite related and memory.low, memory.pressure, cpu.pressure, io.pressure are interesting. I wonder if we should make it a requirement for this implementation.

mllab-nl commented 1 year ago

Hey guys any plans to continue development of this ?

What @tonistiigi is describing here is really nice and complex, but a simple bookkeeping of RAM using something like the MemoryHint to a RUN command would make a world of difference. Now I need to make sure by some constraints that 2 HUGE RUNs will not be executed at the same time to prevent out of memory. With such a MemoryHint I would be able to make sure buildkit will take care of that.

This is especially painful because any external constraint does not know if the RUN is cached or not ...

Thus maybe it make to make small steps here and implement something relatively simple ?

mllab-nl commented 1 year ago

After giving this a bit more though and looking more into cgroupv2/psi I might suggest this kind of simple implementation for a resource (say memory):

Monitor memory pressure.
Start new runs when pressure is less then configured start threshold.
When memory pressure exceed configured eviction threshold - start canceling runs according to some ranking (That took most of memory and where running less).
When pressure drops below start threshold resume canceled runs

moby / buildkit