teamhephy / controller

Hephy Workflow Controller (API)
https://teamhephy.com
MIT License
14 stars 26 forks source link

fix(controller): Set replicas to the current value of the deployment. #128

Closed jfuechsl closed 4 years ago

jfuechsl commented 4 years ago

This PR sets the number of replicas to the current number of replicas when updating a deployment. This prevents disruptive scaling activities when auto-scaling is enabled.

Cryptophobia commented 4 years ago

@jfuechsl , looks like some tests are failing here also: https://travis-ci.org/github/teamhephy/controller/builds/698506391

Also, I'm not aware if this will interfere with the way the HPA works...

jfuechsl commented 4 years ago

I'm looking at the tests one-by-one. First: https://github.com/jfuechsl/controller/blob/a4d0e22530db83125aa462dae3eb1d1930e7efab/rootfs/scheduler/tests/test_deployments.py#L155 IMO this test is incorrect. When we update a deployment we never (afaik) modify the replicas count. The only instance where we do this is when doing a ps:scale. In that case however we are invoking deployment.scale. What do you think?

jfuechsl commented 4 years ago

The other tests are using deployment.update to change the replicas. IMO it should be done via deployment.scale.

To explain why I suggest this change: we saw that the previous deployment.update implementation did interfere with HPAs in undesirable ways. For example, say we deploy an app and set 1 replica. We then activate autoscaling. We put load on the app and let it scale up. At that point we redeploy the application. The original implementation reset the replicas of the deployment back to 1 which caused pods to be removed until the HPA scaled the deployment back up again.

Cryptophobia commented 4 years ago

IMO this test is incorrect. When we update a deployment we never (afaik) modify the replicas count. The only instance where we do this is when doing a ps:scale. In that case however we are invoking deployment.scale. What do you think?

I generally agree with the above but then why was this test done this way. Maybe scale was the major component of deployment update that the original developers could think of?

To explain why I suggest this change: we saw that the previous deployment.update implementation did interfere with HPAs in undesirable ways. For example, say we deploy an app and set 1 replica. We then activate autoscaling. We put load on the app and let it scale up. At that point we redeploy the application. The original implementation reset the replicas of the deployment back to 1 which caused pods to be removed until the HPA scaled the deployment back up again.

I agree with the above. Could we change the tests that we disagree and make them pass again. I would be happy to merge this in.

jfuechsl commented 4 years ago

Updated the tests.

Cryptophobia commented 4 years ago

Thank you for this fix for a common issue during autoscaling and for fixing the test. Now it makes a lot more sense on what it is actually being tested.

Build passes: https://travis-ci.org/github/teamhephy/controller/builds/701947822

jfuechsl commented 4 years ago

Thank you!