Open sanposhiho opened 7 months ago
For now, I'll implement the feature flag feature in Tortoise controller so that we can temporarily disable this feature first.
Implemented the alpha VerticalScalingBasedOnPreferredMaxReplicas
feature gate, disabled by default.
Could we scale the rate of scaling using something something similar to "tcp exponential backoff" or "tcp window scaling, used for slow start" to increase the vertical scaling? (https://en.wikipedia.org/wiki/Exponential_backoff). For tortoise, the "delay from tcp exp backoff or the window increase", becomes the vertical scaling factor.
this could reduce the # of "scaling" iterations at the cost of some over scaling.
@sanposhiho Could you please elaborate more on this part?
But, it keeps recreating Pods very many time, which is not great, because vertical scaling requires restarting Pods.
How frequent is too frequent? Based on this we can probably try exponential backoff like @lchavey suggested but we will need to add some edge cases to it because if the backoff window is too long then it will cause pods to crash and throttle
Currently, each Tortoise is reconciled every 15s. Meaning Tortoise keeps restarting(scaling up) Pods every 15s until the replica number goes below 30
, which is obviously too frequent.
Yup, "exponential backoff" would be a good idea to try out. Actually, the delay of this vertical scaling doesn't cause any problem on services because HPA still keeps increasing the replica in case of CPU utilization reaches the threshold of threshold. If vertical scaling up is too late, we can modify the factor from 1.1
to something bigger.
Sorry, I may have missed understood the original post.
it tries to make Pods vertically bigger. With the current implementation, we just simply keep resourceRequest.MilliValue() * 1.1 I was reading this as we were increasing the vertical by 1.1 each time.
So I was thinking of using "exponential scaling" for the vertical
it tries to make Pods vertically bigger. With the current implementation, we just simply keep resourceRequest.MilliValue() * 1.1
This got me thinking that we could use both (time and scale).
https://github.com/mercari/tortoise/blob/main/pkg/recommender/recommender.go#L185-L196
To prevent the deployment from creating too many but too small replicas, Tortoise has the feature that when the replica number goes higher than
30
(this threshold is configurable viaPreferredMaxReplicas
), it tries to make Pods vertically bigger.With the current implementation, we just simply keep
resourceRequest.MilliValue() * 1.1
until the replica number goes below30
and it works to some extent. But, it keeps recreating Pods very many time, which is not great, because vertical scaling requires restarting Pods.We should consider another way to achieve this, which is better, but as simple as the current stragety.