teamhephy / controller

Hephy Workflow Controller (API)
https://teamhephy.com
MIT License
14 stars 26 forks source link

Fix autoscale without breaking manual scaling #108

Closed kingdonb closed 5 years ago

kingdonb commented 5 years ago

This PR is a WIP

We've been testing changes to make the Autoscale functionality via HPA work again.

Right now it looks like this image (quay.io/kingdonb/controller:kb-fix-autoscale-take-5) is successfully permitting a manual scale as well as an HPA to be created, like:

deis autoscale:set cmd -a testing-hephy --min=2 --max=5 --cpu-percent=10

The concession we had to make is that we don't know what has changed with manual scaling in apps/v1 deployment, so we have left it at extensions/v1beta1 API for deployments.

The autoscaler/v1 HorizontalPodAutoscaler will only find the Deployment as a match if it is also provided with a scaleTargetRef.apiVersion selector that matches. I added that here, and HPA now succeeds, assuming you have metrics-server properly configured so the HPA can get pod metrics to make scale decisions.

I also tested the manual deis ps:scale -a testing-hephy cmd=1 behavior, which was broken by the last attempt we made to upgrade to autoscaler/v1. We should make any reasonable efforts to upgrade the deployment to apps/v1, but that is not necessary in order to fix the autoscale feature (without unfortunately also breaking manual scaling.) This PR seems to do both 👍!

The other frankly unacceptable concession I had to make was in rootfs/api/models/app.py where the HPA is created, I have hardcoded 'apiVersion': 'extensions/v1beta1' as I'm not sure how the App resource is supposed to find out what the Deployment's api_version ought to be.

We discussed maybe adding an API version service that you can use from any module to determine what API versions are in use for what objects, but there is more discussion that needs to happen before it's clear what that piece ought to look like.

Also, since upgrading a resource's API version means necessarily destroying the old resource, we are going to need a bigger plan when it comes to upgrading deployments to apps/v1, since that step appears to be unavoidable, we might need to write code so that a cluster can recognize when these deployment upgrades are needed, and to handle them in a reasonable way. I'm not sure how this can be done without potentially causing apps to have downtime.

(It should be perfectly safe to delete and recreate HPAs though, so for the upgrade guide, maybe we give that a little mention and this can go to press relatively as-is in the next release.)

kingdonb commented 5 years ago

I rebased this with some --onto and --rebase-merges magic, so it wouldn't include any reverse-merges back from master into the wip branch.

kingdonb commented 5 years ago

We can probably also take this to a new branch and do some commit squashing to improve the history, for the changelog.

ess commented 5 years ago

Tested this on an EKS cluster running 1.13, verified as working.

Cryptophobia commented 5 years ago

@kingdonb , I would really like to improve the commit messages. I'm going to take your commits and put them on top of https://github.com/teamhephy/controller/pull/106 and merge them there. I will test again and merge everything once the tests pass. We probably want to squash some of these commits as well, the ones in the beginning (reverts).