teamhephy / controller

Hephy Workflow Controller (API)
https://teamhephy.com
MIT License
14 stars 26 forks source link

Revert "set deployment API to apps/v1 if k8s >= 1.9" #94

Closed kingdonb closed 5 years ago

kingdonb commented 5 years ago

This reverts commit a226971e4d7bb06850704b7c86ca463ff1b77fec.

Users on GKE experience an issue with this change, because the cluster Minor version is usually (9+) or (11+) not parseable in semver. #92

Users on AKS experience an issue with this change that disables deis ps:scale #93 – the cause is unknown, since AKS version numbers are parseable; the error message in the linked issue seems to indicate that "Scale" has moved, so I'm not sure it's as easy as this PR makes it out to be. In earlier versions, the Scale resource was under extensions/v1beta2 with the Deployment resource. In this version, it looks like it lives in the autoscaling/v1 API group.

This is a problem because our code assumes that the Scale resource/kind lives in the same API group as the Deployment kind, now apps/v1 – they don't live together anymore, and I wasn't able to find the lines of code representing this assumption, unfortunately. (They're in here somewhere.)

(Has anyone tested ps:scale on any cluster above 1.9 with this change? I think I understand this change was made to support Horizontal Pod Autoscaling, so it stands to reason that maybe nobody has. If everyone who has this change uses HPA, then maybe nobody would have needed ps:scale. I'm not using HPA, and I still use ps:scale, so this affected me on my AKS cluster with versions: Major:"1", Minor:"11", GitVersion:"v1.11.5" and BuildDate: BuildDate:"2018-11-26T14:31:35Z")

It just looks a bit like we didn't do all our homework on this one. I'm trying to find a good document about how the "v1" API itself is defined, and exactly what named groups are included in it (apps/v1, autoscaling/v1, etc) – and whether this is actually a well-defined thing, with clear lines between "what is in v1" and "what isn't in v1", or if there's a possibility that more named groups will sneak into v1 in later releases, or if it depends on implementation/cloud provider, etc. So far coming up empty.

I added a branch that is incomplete which provides access to the kubectl api-versions output, or "get_api_versions" as it's sometimes called in k8s client libraries. But it looks like that by itself is not enough, and I'm miserable at Python, so no idea how to start with interactively debugging. I found this gist which says you can use this line, as a surrogate for Pry (I'm a ruby programmer, so binding.pry is one of the go-to tools in my arsenal.)

import code; code.interact(local=dict(globals(), **locals()))

Haven't tried that yet. Maybe it will help me make more progress. I think I will need a controller running interactively to facilitate that, so I'm stuck for now. I think we should revert the change from #86 and cut a release v2.20.2, so that new users coming in using GKE and AKS (and probably others) are not stuck with issues until they find this blog post about how to revert the controller to v2.20.0: https://blog.teamhephy.info/blog/posts/announcements/rollback-v2-20-1-v2-20-0-controller-GKE-bug.html

I would think that it makes sense to look for autoscaling/v1 in the APIs endpoint, but it's more complicated than that. I think that later k8s versions omit a lot of the named groups and assume that you know what is in v1. I will find an example of a k8s version that does this, but what I'm expecting is that the controller will have to deal with a return from GET /apis which just contains v1 and doesn't call out any named groups.

kingdonb commented 5 years ago

Testing... reverting this commit indeed brings back manual scaling

Built the image at kingdonb/controller:canary

$ deis ps:scale web=3
Scaling processes... but first, coffee!
done in 35s
=== rocksolidapp Processes
--- web:
rocksolidapp-web-5cc7fd55bd-44qsd up (v5)
rocksolidapp-web-5cc7fd55bd-8js7h up (v5)
rocksolidapp-web-5cc7fd55bd-sjhkq up (v5)
creatorrr commented 5 years ago

Thanks @kingdonb for the image! For those looking for a patch in the interim, this worked for me:

kubectl -n deis patch deployment/deis-controller -p '{"spec": {"template": {"containers": [{"image": "kingdonb/controller:canary"}]}}}'
kingdonb commented 5 years ago

Thanks for the feedback! I see this was merged on Sunday, should land soon on Docker Hub at: hephy/controller:v2.20.2 (so you won't need to patch your clusters with something called "Canary")

Cryptophobia commented 5 years ago

New release is out tonight. Users can upgrade to Hephy Workflow v2.20.2 and it should have the rollback of the controller. Thank you for this PR @kingdonb .