The new controller v2.20.1 release is broken on (some?) GKE clusters

kingdonb commented 5 years ago

GKE clusters that are new today that are launched, are reporting their version numbers with a minor of 11+. We experienced issues with this before, but in #86 we still used this approach again, and it was released as v2.20.1 on Thursday.

It's possible to roll the controller back to v2.20.0 as described in https://blog.teamhephy.info/blog/posts/announcements/rollback-v2-20-1-v2-20-0-controller-GKE-bug.html#the-fallout-from-automatic-upgrading-of-platform-database

Please read the article before rolling back your cluster. The database upgrade from teamhephy/postgres#11 was also merged, so rollbacks from v2.20.1 to v2.20.0 are not possible with an on-cluster database, once the database deployment has been upgraded past v2.7.1.

The upgrade script from teamhephy/postgres#11 is published in teamhephy/database:v2.7.2 (workflow:v2.20.1)

tolstenko commented 5 years ago

As discussed on slack chat, a fast workaround could be something like this on every version() check.

   ...
        '''fast workaround for gcloud wrong semver ie.(1.11+)'''
        ver = parse(self.version().public.replace("+",""))

        if ver >= parse("1.9.0"):
            return 'apps/v1'
   ...

But I agree that the best approach would be checking the API capabilities. @kingdonb found this link, it might be good entry point. https://github.com/kubernetes-client/python/blob/67d30a03fbc1f63051102d4dbf4f5d333127ab0e/kubernetes/client/apis/apis_api.py#L38

Cryptophobia commented 5 years ago

@tolstenko @kingdonb Does anybody want to submit a PR to try to tackle this bug?

kingdonb commented 5 years ago

We're working on it. We need to add a controller method that reaches /apis and figure out the correct logic. I don't think there's been any Controller functionality that looked at "APIs" API / aka "Capabilities" map before.

The only place it's been needed in the past has been in Helm charts, I guess (and Helm already provides a Capabilities map for free, if you can stand to parse through it with the Helm/GoTL templates.) But the diversity of cluster/cloud Kubernetes has increased since the pre-1.7 days, and it looks like it's time to start considering this stuff at runtime, rather than just taking the version number at face value. In some places it's un-parseable with the language-included semver stuff, and in others it's not consistently informative about what capabilities are supported by the cluster.

There are other places where this parse/version logic was added, we'll try to get all of them at once.

Cryptophobia commented 5 years ago

Okay @kingdonb , I should be able to make an emergency release v2.20.2 this weekend to fix this if you guys submit the PRs for it. I can also take a look this weekend as well for what I can find.

kingdonb commented 5 years ago

We haven't got a PR together. I don't want to just roll back the change, but I haven't been able to find a way to change it so that anything else will work.

(I'd like to use the new apps/v1 API where it's available, but on my AKS cluster I get this message about Scale https://github.com/teamhephy/controller/issues/93)

failed to scale Deployment \"sredash-web\" in Namespace \"sredash\":
400 Bad Request Scale in version \"v1\" cannot be handled as a Scale:
no kind \"Scale\" is registered for version \"apps/v1\"',
'Deployment', 'sredash-web', 'sredash'

This above is what happens when I try to scale. I'm afraid something has changed in the API from extensions/v1beta2 to apps/v1

Beginning to think that we should really have a design document that encompasses all of the changes needed to move from pre-v1 APIs to the v1 API. But if we're receptive to a PR that just rolls back that one change, then I'll submit it.

creatorrr commented 5 years ago

Thanks @kingdonb for a fixed image! For those looking for a patch in the interim, this worked for me:

kubectl -n deis patch deployment/deis-controller -p '{"spec": {"template": {"containers": [{"image": "kingdonb/controller:canary"}]}}}'

creatorrr commented 5 years ago

Just to report that while this fixes controller version parse issues, other components may still be affected. For instance, after applying the patch mentioned above, deploying works but running autoscale gives the same original error.

$ deis autoscale:set web --min=2 --max=6 --cpu-percent=75                                                              
Applying autoscale settings for process type web on test-app...
Error: Unknown Error (400): {"detail":"Invalid version: '1.11+'"}

Cryptophobia commented 5 years ago

Thanks for the revert @kingdonb . We will merge the revert in first and fix everything before we do more research to support the new APIs and version parsing.

kingdonb commented 5 years ago

My understanding is that autoscaling on the latest k8s versions was universally broken before the patch in #86, and on some clusters #86 solved it, but simultaneously broke a number of other features. The research to push a proper fix is ongoing. (Meanwhile, there is apparently no version of controller which can do horizontal pod autoscaling for Workflow apps today on GKE.)

Cryptophobia commented 5 years ago

Hephy Workflow v2.20.2 is released with rollback to controller so this should not affect users on latest release anymore. In the meantime, we shoud start working on addressing this API versioning issue.

Cryptophobia commented 3 years ago

This should be resolved in latest version of Hephy!

teamhephy / controller

The new controller v2.20.1 release is broken on (some?) GKE clusters #92