grafana / rollout-operator

Kubernetes Rollout Operator
Apache License 2.0
132 stars 17 forks source link

prepare-downscale: exclude pods in the same zone when finding unavailable pod #99

Closed dimitarvdimitrov closed 9 months ago

dimitarvdimitrov commented 9 months ago

Having unavailable or not up to date pods in the same zone we're trying to scale down shouldn't really stop us from downscaling the zone.

dimitarvdimitrov commented 9 months ago

@56quarters can you take a look at this? I'm not sure if there's an edge case that I didn't acccount for.

I checked what would happen if we try to prepare a non-ready pod for downscale - supposedly the request to it will fail and the whole downscale will be aborted. But in reality this should be rare because we need only the pods being downscaled to be ready - the rest can be non-ready or not up to date.

jhalterman commented 9 months ago

I'm not sure if this restriction was intended for downscales, but resharding and avoiding gaps in data is the concern that comes to mind for me, more than the request succeeding.

dimitarvdimitrov commented 9 months ago

I'm not sure if this restriction was intended for downscales, but resharding and avoiding gaps in data is the concern that comes to mind for me, more than the request succeeding.

I'm not sure I follow @jhalterman. If we are downscaling zone-X, then it shouldn't matter that there are unavailable pods in zone-X. We avoid gaps by making sure that the other zones have available pods.

And resharding is I think a natural consequence of scaling down. Not sure if we can avoid it.

Did I understand your comment correctly?

jhalterman commented 9 months ago

If we are downscaling zone-X, then it shouldn't matter that there are unavailable pods in zone-X. We avoid gaps by making sure that the other zones have available pods.

That's right - I forgot we have that cross-sts check. This change seems safe to me then.

pracucci commented 9 months ago

I reviewed this change too and I also think this change is good. It's not much different from the same check done in findDownscalesDoneMinTimeAgo(), which allows to downscale the same zone even if it was already downscaled less than "min time" ago.