teamhephy / controller

Hephy Workflow Controller (API)
https://teamhephy.com
MIT License
14 stars 26 forks source link

Feature/fix deprecated apis 1.16 #110

Closed Cryptophobia closed 3 years ago

Cryptophobia commented 4 years ago

https://kubernetes.io/blog/2019/07/18/api-deprecations-in-1-16/

Per the above recent blog post we need to migrate some of our APIs to work with the deprecations in k8s 1.16 .

Would really appreciate if other people looked at these changes and try to fix the broken tests. For now overwriting some url generations in the mock.py should fix the tests. We still have broken tests with these changes.

Steps to get a working development Django/Python environment:

  1. Install Postgresql local database and initialize with postgres user then edit /var/lib/pgsql/data/pg_hba.conf to allow local connections
  2. Install local django and python dependencies in /rootfs
       $ pip3 install -r requirements.txt --user
  3. Set these environment variables and make sure Postgresql database is running on port 5432:
     export DEIS_DATABASE_NAME=deis
     export DEIS_DATABASE_PASSWORD=postgres
     export DEIS_DATABASE_USER=postgres
     export DEIS_DATABASE_SERVICE_HOST=localhost
     export DEIS_DATABASE_SERVICE_PORT=5432
  4. To run an individual test run this command in /rootfs:

    python3 manage.py test --settings=api.settings.testing --noinput scheduler.tests.test_deployments
kingdonb commented 4 years ago

I don't think this is the final change, I was not able to see the Deployment api version apps/v1 being used when my upgraded controller created a new namespace and app deploy via Buildpack.

I'm not sure what's wrong but I revoke my checkmark, sorry for being premature.

felixbuenemann commented 4 years ago

@Cryptophobia if you are running both postgres and the python app on your local system, you don't need to modify pg_hba.conf, just make sure you have a database user that matches the user that is running the python app, you can then login to the db as that user without a password.

How do you plan on handling backwards-compatibility for pre-1.9.0 Kubernetes clusters?

The current change contain only one place, were the version is checked and the charts also hardcode the new value.

kingdonb commented 4 years ago

How do you plan on handling backwards-compatibility for pre-1.9.0 Kubernetes clusters?

I think I understood, and I'm happy to be proven wrong (although it will make this transition a bit more complicated)... that pre-1.9 K8s clusters didn't have extensions/v1beta1, but some earlier alpha API. I'm afraid this might not have been completely homogenous across all versions, and this goes back to we should really be checking with the cluster API to find out what APIs it has.

But if this was done right by upstreams, for definitions of right that include "they made it easy for us," then we can simply drop support for v1beta1 and substitute the v1 APIs in its place, and everything will be OK for all users on any version. Clusters with K8s versions older than 1.9 will have support for/be using deployment APIs that are older than v1beta1, and clusters with support for newer APIs will have no problem serving up the resources they manage with the newer APIs.

But like I said I'm happy to be proven wrong, and I have only done limited testing to understand the problem, it's entirely possible there are important points which I've glossed over (like, can a 1.12 cluster really serve up extensions/v1beta1 resources from the apps/v1 API? Or do you need to upgrade a cluster all the way to 1.16 to get the deprecated resources upgraded automatically?)

There are a lot of test scenarios which I haven't considered other than in passing, and it feels like this will be a good time for a minor version bump and an actual manual testing plan. I've started working on that.

I'm about 50/50 on whether we should go back and test those old versions. I maintained a 1.5 cluster for a long time and it was ultimately very easy to upgrade it to 1.13, once I got over RBAC, I've kept pace since then and it's on 1.15 now. I have yet to test this new controller on a 1.16 cluster which has removed the deprecated, and I suspect that in its current form it will not work. (What I'm saying in a round-about way is that versions <1.12 are not in support from upstream, and we should not go to great lengths to ensure we maintain support for them permanently)

So, that's where I'll start testing next, with a 1.16 cluster wherever I can put my hands on one.

felixbuenemann commented 4 years ago

I'm currently stuck with some clusters on 1.5.8, that are hard but not impossible to upgrade (mainly because they are based on an old version of kube-aws, so I need to start with a fresh cluster and transfer all apps and dbs, or keep backporting stuff from the new kube-aws), so I could test changes in a Staging cluster.

felixbuenemann commented 4 years ago

I just tried to change the apiVersion on deploy/deis-router from extensions/v1beta1 to apps/v1 and the change was rejected, so I think some usage of Capabilities.APIVersions.Has is needed inside the templates to use the proper apiVersion.

See: https://helm.sh/docs/chart_template_guide/#built-in-objects

Cryptophobia commented 4 years ago

Thank you for your notes @felixbuenemann . I wanted to add everyone involved in this PR and see how we can best do the backwards compatibility. Was not sure if we want to support < 1.9 cluster in the future anyways, but it is good to get feedback from users using Hephy.

I'm going to take your notes about the charts and versions and make some more changes.

till commented 4 years ago

Do people need to run the kubectl convert commands on existing installations?

kingdonb commented 4 years ago

That's a good question, I think the answer is yes, but more testing is needed. In the past we have decided which API version to use strictly by lowest-common-denominator strategy and paying attention to Kubectl API version. We have missed some opportunities to upgrade clusters >1.12 and preemptively ensure their clusters are using the non-beta APIs, before the deprecation.

I think it would be best if we supported both API versions, on clusters that are pre 1.16 and could support both, but the alternative would be to upgrade everyone on clusters that should support it. I'm not sure that we can reliably predict that based on version number, but it is a strategy that has worked for us in the past.

It sounds like we could provide a doc for people who are ready to upgrade deployment and other cluster resources to v1 API using kubectl convert, so they can do it at their convenience, and those who would be dragged kicking and screaming may get converted automatically if needed.

I didn't know there was a kubectl convert command, and the documentation for 1.16 has led to believe that objects in the deprecated APIs will still be available (eg. they would be automatically converted when the old support has dropped in an upgraded cluster.) I don't know if we need to do this in Hephy, but if there are envs which may be able to get upgraded by the controller in a way that is more-or-less seamless, that would be awesome.

Let's make it clear what happens for cluster operators, this would be a great addition in the documentation when we have worked out exactly what the v1 API transition plan should look like. I like the way we handled the postgres upgrades, and I don't have a problem with one-way-only transitions if necessary, but we should also consider what it might look like for an upgrade to go badly and how it could be safely reverted, if possible.

felixbuenemann commented 4 years ago

I don't think you need to do anything for a running cluster.

The kubectl convert command is deprecated with the following message:

kubectl convert is DEPRECATED and will be removed in a future version. In order to convert, kubectl apply the object to the cluster, then kubectl get at the desired version.

It's not entirely clear from the comment, wether the cluster needs to support both the old and new representation or if new cluster versions will still accept old resource manifests but convert them on the fly.

kingdonb commented 4 years ago

Ah, so I have misunderstood. The API version is just a facade over a singular resource in the back-end, and can be queried with any supported API version. And then I guess, other than the restful resource interface itself, which changes shape in some ways, behavior should not be expected to change in any way when upgrading client API versions, (only kubelet versions.)

Cryptophobia commented 4 years ago

@felixbuenemann @kingdonb @till

The API version is just a facade over a singular resource in the back-end, and can be queried with any supported API version.

This is what my interpretation was from the deprecation messages. On the golang client for k8s, I remember querying same resources on different APIs endpoints and receiving the same payload/resources. I noticed this when querying the APIs for the ingress and service objects. I hope this is the case still because it will make things much easier for us. :chipmunk:

Cryptophobia commented 4 years ago

Made quite a bit of changes now. Can you guys try to test it out again on a 1.16 cluster if you can? @felixbuenemann @kingdonb

Testing this would require building a new image and deploying it using the new chart updates so that the controller get deployed on the right apiVersion: apps/v1 going forward.

felixbuenemann commented 4 years ago

@kingdonb I don't have a 1.16 cluster for testing ATM, do you?

Shouldn't we also test on older clusters to make sure nothing breaks?

Cryptophobia commented 4 years ago

Shouldn't we also test on older clusters to make sure nothing breaks?

Yes, we need to test older versions to make sure backwards compatibility works and also try to see if upgrade path to 1.16 from older versions like 1.15, 1.14 also works without problems just in case.

kingdonb commented 4 years ago

I do in fact have a Kubeadm cluster running 1.16.2 now, so I can run these tests on the next version.

edit:

And, I have an example of what error you will get if you try to install the latest stable Hephy release on it.

$ helm install hephy/workflow --namespace deis --set global.use_rbac=true,router.host_port.enabled=true
Error: validation failed: [unable to recognize "": no matches for kind "DaemonSet" in version "extensions/v1beta1", unable to recognize "": no matches for kind "Deployment" in version "extensions/v1beta1"]

No surprises there. This is using Helm v2.15.0, the last minor version that will handle releases with a Tiller pod. I manually installed this cluster onto some Ubuntu nodes using kubeadm v1.16.2.

kingdonb commented 4 years ago

So, I updated the charts in #111 and tried this updated controller out, there are still some things to work out. Pushing the example-go app to builder as a test, the slug builds successfully but the result of the deploy is an error:

remote: 2019/10/27 15:40:59 Error running git receive hook [The controller returned an error when publishing the release: Unknown Error (400): {"detail":"(app::deploy): unorderable types: str() < NoneType()"}]

This is distinct from the previous error, so the ball is moving forward! But there is something wrong here, and I'm not sure how to debug it further by myself.

kingdonb commented 4 years ago

I merged this with #111 but I can't verify that it works. It may be an issue in another component that causes the problems I'm seeing.

Cryptophobia commented 3 years ago

Merging this in now that we have done testing and preparing for the new release Hephy Workflow v2.22.0!

Cryptophobia commented 3 years ago

Fixes #125 in new release Hephy Workflow v2.22.0 coming soon!