PID controllers and autoscaling

glyn commented 1 year ago

Hi Stevan

I wasn't sure how to get in touch, so please pardon the use of an issue!

I worked on a project (*) some years back where we experimented with the use of PID controllers to autoscale the number of instances in a cluster in response to workload metrics (e.g. based on a queue of requests). One of my then colleagues @jchester wrote up some of the findings in his book Knative in Action.

We found Feedback Control for Computer Systems by Philipp K Janert a useful introduction to the theory and practice of PID controllers.

We also found testing with a variety of shapes of input workload helped to identify problems with our controllers. We tested with sine waves, square waves, and step functions if I remember correctly. You might like to try that.

We experimented with various methods of smoothing and stabilising the resultant behaviour, as described in Janert's book. The results were variable in quality and tuning the PID parameters was tricky and unreliable. A particular issue we faced was the latency in scaling up the number of instances, which could produce some interesting (i.e. unwanted) feedback effects when the input workload varied quickly.

Hope that helps, Glyn

The project was projectriff which later merged into the Knative project.

stevana commented 1 year ago

Hey Glyn,

Thanks for sharing your experiences and references!

Two things spring to mind:

Did you have a look at robust control theory? Supposedly it's better suited for non-linear systems where there's noise (bursty traffic, etc);
I wonder if the latency in scaling up in instances can be solved by having a pool of instances ready to go. Perhaps this pool can be run by a third-party which keeps hot instances and charges for a "traffic insurance" premium? I haven't thought much about this, maybe it doesn't make economic sense...

glyn commented 1 year ago

Hey Glyn,

Thanks for sharing your experiences and references!

You're welcome!

Two things spring to mind:

1. Did you have a look at [robust control theory](https://users.ece.cmu.edu/~koopman/des_s99/control_theory/)? Supposedly  it's better suited for
   non-linear systems where there's noise (bursty traffic, etc);

No, I'm afraid we weren't aware of that at the time. Looks like it would have been useful though! Thanks.

2. I wonder if the latency in scaling up in instances can be solved by having
   a pool of instances ready to go. Perhaps this pool can be run by a
   third-party which keeps hot instances and charges for a "traffic insurance"
   premium? I haven't thought much about this, maybe it doesn't make economic
   sense...

Yes, that was one of the approaches we considered. It works well unless/until the pool becomes exhausted, in which case we are back to square one. I like the insurance analogy for addressing the charging issue.

(Closing the issue now so as not to clutter your set of issues. For future reference, you may want to enable discussions on github if you want to encourage feedback (no pun intended).)

stevana commented 1 year ago

It works well unless/until the pool becomes exhausted, in which case we are back to square one.

My guess is that the more clients (globally) this pool has the less bursty the traffic will be, because it's averaged out over many clients rather than just your one service.

For future reference, you may want to enable discussions on github

Thanks, I don't think I've ever used that before. I also don't mind using issues for this, this repo will likely not be very active anyway.

glyn commented 1 year ago

It works well unless/until the pool becomes exhausted, in which case we are back to square one.

My guess is that the more clients (globally) this pool has the less bursty the traffic will be, because it's averaged out over many clients rather than just your one service.

That's probably true in general. The riff project needed to be able to scale the instances to zero. The usecase was an occasionally used service that shouldn't consume resources when it's not in use. The instances were essentially instances of an application (e.g. packaged as a docker/oci image) rather than something reusable across applications.

theOGognf commented 1 year ago

Hey there,

Not every day I come across seeing a crossover in control and software. I like the way you applied the PID controller. I think the PID is a good fit for the application because it generates smooth input and is easily tunable.

Robust control can be beneficial if you can dynamically model your system (e.g., you know demand will change over time in an expected way like a sine wave) and you have parametric bounds on that model (e.g., you know the max demand over time). Adaptive control helps you build that dynamic model, but you still need to have some idea as to how your system fundamentally behaves to instantiate your controller with. The better your initial model, the better your adaptive and robust controller will perform. Otherwise, your controller won't have as smooth output as it'll attempt to compensate for unmodeled errors by applying large changes in inputs and model parameter estimates. Adaptive and robust controllers will generally perform well at tracking some desired state even for poorly modeled systems, but at the cost of large input fluctuations which may not be good for your application.

Cheers!

stevana commented 1 year ago

Thanks @theOGognf!

Do you happen to have any good resouces on robust and or adaptive control, by the way?

Cheers!

theOGognf commented 1 year ago

Thanks @theOGognf!

Do you happen to have any good resouces on robust and or adaptive control, by the way?

Cheers!

Here's my favorite. (PDF download warning). That professor has other good materials on adaptive robust control, but I like the slides the best.

stevana / elastically-scalable-thread-pools

PID controllers and autoscaling #1