substratusai / lingo

Lightweight ML model proxy and autoscaler for kubernetes
https://www.substratus.ai
Apache License 2.0
96 stars 6 forks source link

fix #106 improve scale down behavior #107

Closed samos123 closed 2 days ago

samos123 commented 5 days ago

Only scale down by 1 instead of scaling down to 1 when all replicas are actively processing requests.

nstogner commented 5 days ago

If all replicas are actively processing requests my understanding is that any active request should be considered in the accounting of in-progress requests (contribute to the average that drives the scale).

I think this PR would effectively just rate limit scale downs to 1 every 3 seconds (the period for running the autoscaling loop).

I think a better approach might be to adjust the moving average to encompass a longer time window. Time window should probably be configurable. See: https://github.com/substratusai/lingo/blob/55f112224aae5fea090fa13b3bc06e006813b9a5/cmd/lingo/main.go#L188 I will start on a PR for this.

nstogner commented 5 days ago

Another possible cause: If this is in conjunction with using a pubsub integration, it is possible that this flapping is due to error backoff: https://github.com/substratusai/lingo/blob/55f112224aae5fea090fa13b3bc06e006813b9a5/pkg/messenger/messager.go#L105

I think this could cause it b/c we don't know the total in-queue messages, we drive the moving average off of how many we pull from the queue over time. I will add an issue to track making this backoff configurable as well.

samos123 commented 4 days ago

What would you expect this code to return in this scenario?

        for deploymentName, waitCount := range stats.ActiveRequests {
            log.Println("Is leader, autoscaling")
            avg := a.getMovingAvgQueueSize(deploymentName)
            avg.Next(float64(waitCount))
            flt := avg.Calculate()
                        normalized := flt / float64(a.ConcurrencyPerReplica)
                        ceil := math.Ceil(normalized)
                        a.Deployments.SetDesiredScale(deploymentName, int32(ceil))

concurrency per replica = 100 replicas = 10 each replica is constantly processing 90 items waitCount = ~0-1

My expectation is that setDesired scale should be 9 or 10. However the way it's implemented today it would result in setDesired scale =1 and here is why: avg = ~90 flt= something like 88 ceil=1

I think scale up it makes sense to look at waitcount. However, for scale down we should be look at the average being processed by each replica and be less aggressive if all replicas are constantly processing requests.

I think using utilization of concurrency per replica for scale down would make more sense. The utilization is defined as : average requests per second / concurrency per replica.

Let me know what you think.

nstogner commented 4 days ago

each replica is constantly processing 90 items --> this would mean that there are 900 total active requests, so I would think:

flt = 900 (10 replicas processing 90 requests each)
concurrencyPerReplica = 100
desiredScale = 900 / 100 = 9 
nstogner commented 4 days ago

Note: if you know the timestamp, it looks like we log this info:

log.Printf("Average for deployment: %s: %v (ceil: %v), current wait count: %v", deploymentName, flt, ceil, waitCount)