When concurrency per replica is 100 and average is 100 then it should not try to scale down to from 28 to 1.
See this state:
2024/07/05 07:31:07 Average for deployment: llama-3-8b-instruct-vllm: 102.2 (ceil: 1), current wait count: 90
2024/07/05 07:31:07 SetDesiredScale(1), current: 29, min: 1, max: 500
There are currently 29 replicas that are each on average processing 102 requests however, lingo in this case decides to scale back to 1 replica.
I think in this case, the logic could be improved that if wait count is higher than 50% of concurrency per replica, it should not try to scale down anything at all.
When concurrency per replica is 100 and average is 100 then it should not try to scale down to from 28 to 1.
See this state:
There are currently 29 replicas that are each on average processing 102 requests however, lingo in this case decides to scale back to 1 replica.
I think in this case, the logic could be improved that if wait count is higher than 50% of concurrency per replica, it should not try to scale down anything at all.