substratusai / lingo

Lightweight ML model proxy and autoscaler for kubernetes
https://www.substratus.ai
Apache License 2.0
96 stars 6 forks source link

Improve scaling behavior when there are requests waiting to be queued #106

Closed samos123 closed 2 days ago

samos123 commented 5 days ago

When concurrency per replica is 100 and average is 100 then it should not try to scale down to from 28 to 1.

See this state:

2024/07/05 07:31:07 Average for deployment: llama-3-8b-instruct-vllm: 102.2 (ceil: 1), current wait count: 90
2024/07/05 07:31:07 SetDesiredScale(1), current: 29, min: 1, max: 500

There are currently 29 replicas that are each on average processing 102 requests however, lingo in this case decides to scale back to 1 replica.

I think in this case, the logic could be improved that if wait count is higher than 50% of concurrency per replica, it should not try to scale down anything at all.

samos123 commented 2 days ago

This doesn't seem to happen after increasing concurrency and making sure we have stable request rate