kubernetes-sigs / cluster-api-provider-nested

Cluster API Provider for Nested Clusters
Apache License 2.0
298 stars 65 forks source link

🐛Fix VirtualCluster CRD and reconciler for work in 1.19+1.20 environment #316

Closed m-messiah closed 1 year ago

m-messiah commented 1 year ago

What this PR does / why we need it: When the super cluster has two different versions of API Server (specifically 1.19 and 1.20) it results in errors of vc-manager, as 1.20 added non-nullable nulls https://github.com/kubernetes/kubernetes/pull/95423, while 1.19 can't work with them.

{"level":"error","ts":1664969234.2211943,"logger":"controller-runtime.manager.controller.virtualcluster","msg":"Reconciler error","reconciler group":"tenancy.x-k8s.io","reconciler kind":"VirtualCluster","name":"test","namespace":"default","error":"VirtualCluster.tenancy.x-k8s.io \"test\" is invalid: [status.reason: Invalid value: \"null\": status.reason in body must be of type string: \"null\", status.message: Invalid value: \"null\": status.message in body must be of type string: \"null\", status.phase: Invalid value: \"null\": status.phase in body must be of type string: \"null\"]"}

{"level":"error","ts":1664975149.3489153,"logger":"controller-runtime.manager.controller.virtualcluster","msg":"Reconciler error","reconciler group":"tenancy.x-k8s.io","reconciler kind":"VirtualCluster","name":"test","namespace":"default","error":"VirtualCluster.tenancy.x-k8s.io \"test\" is invalid: status.conditions.status: Invalid value: \"null\": status.conditions.status in body must be of type string: \"null\""}

So the first solution is to set LastTransitionTime to be a pointer (as metav1.Time is not nullable https://github.com/kubernetes/kubernetes/issues/86811 And the second solution is to set status.condition.Status to non-empty string (as it expected initially). I set it always True, as we don't have false conditions, but we could improve it later.

With these two changes vc-manager became working again.

Which issue(s) this PR fixes: The PR is related to https://github.com/kubernetes-sigs/cluster-api-provider-nested/issues/63, but I am not sure it could fix all the things there

k8s-ci-robot commented 1 year ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: m-messiah Once this PR has been reviewed and has the lgtm label, please assign fei-guo for approval by writing /assign @fei-guo in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[virtualcluster/OWNERS](https://github.com/kubernetes-sigs/cluster-api-provider-nested/blob/main/virtualcluster/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
Fei-Guo commented 1 year ago

I am a little confused about the problematic version combination of vc, vc-manager, super. Can you provide the details of the following?

vc apiserver version: ? vc-manager client-go version: ? super apiserver version: ?

m-messiah commented 1 year ago

vc apiserver version: does not matter. Happens with 1.19 and 1.20 as well vc-manager client-go version: latest commit https://github.com/kubernetes-sigs/cluster-api-provider-nested/commit/e55cc6f87f61edc1fa028c4ba11a69c5e5b4330a , v0.21.9 as I see in go.mod super apiserver version: v1.19.13 + v1.20.15 on several nodes

So the main problem here is we have two apiserver versions in the super cluster during migration upgrade, that leads to incorrectly set CRs (by api 1.20), that are needed to be handled by 1.19.13 apiserver.

christopherhein commented 1 year ago

If we don't have mix super cluster apis does this issue still occur? Given this changes the expected types and those are exported this could cause other breakage for users that implement other reconcilers.

@Fei-Guo I assume this is the big concern, if you are using this resource when you update to the latest code it's going to require changing any out-of-tree usage of those status fields, right?

Fei-Guo commented 1 year ago

@christopherhein Yes, this is crd field type change without bumping the version. I am not sure if existing CRs need to be converted to make controller or cr list/get work.

m-messiah commented 1 year ago

We found that the problem was deeper than it looked, especially after additional checks that this PR changes did not help. The root cause of the issue is described here https://github.com/kubernetes/kubernetes/issues/99804 and fixed there https://github.com/kubernetes/kube-openapi/pull/230 and it happened for our cluster where we had couple Kubernetes API Servers built with Golang 1.16, that failed the validation.

The fast and only solution is to rebuilt the server with Go 1.15, as only Kube APIServer 1.21.0+ has the bugfix for 1.16 compatibility included