kubernetes-retired / kubefed

Kubernetes Cluster Federation
Apache License 2.0
2.5k stars 529 forks source link

A single federated cluster can stop propagation of a type for all clusters if it does not have the specified resource version. #1241

Closed dangorst1066 closed 3 years ago

dangorst1066 commented 4 years ago

A single federated cluster can stop propagation of a type for all clusters if it does not have a particular resource version.

And a question - any good strategies for handling cluster estates that could have multiple versions of a resource in circulation (e.g. v1beta1 and v1 CRDs)

Editing the target type version in the federated type config to v1beta1 (lowest common denominator) appears to work around this ok (tbc), but it's still worrying a single cluster could stop all federation working - seems like this shouldn't be the expected behaviour.

What happened:

Run a federation control plane at kube version 1.16 Enabled federation of CRDs (v1) Joined another 1.16 cluster - confirmed CRDs and CRs of that type are being propagated ok Joined a 1.15 cluster - CRDs+CRs not propagated to the 1.15 cluster (CRDs at version v1beta1). All propagation of CRDs and CRs of the same type stopped working for the 1.16 cluster as well.

Logs for the controller manager show msgs like:

E0630 07:13:47.048845       1 reflector.go:153] pkg/mod/k8s.io/client-go@v0.17.3/tools/cache/reflector.go:105: Failed to list apiextensions.k8s.io/v1, Kind=CustomResourceDefinition: the server could not find the requested resource

What you expected to happen:

I expected v1 CRDs not to propagate to the 1.15 cluster, however I did not expect the propagation of all CRDs to all clusters to stop working.

How to reproduce it (as minimally and precisely as possible):

Run a federation control plane at kube version 1.16+ Enabled federation of v1 CRDs Create a Federated CRD, and a CR of that type with placement that will match all clusters Join another 1.16 cluster - confirmed CRD and CR are being propagated ok Join a 1.15 cluster - expect the CRD and CR not to be propagated Create a new federated CRD, or a CR of the original type - these should still be propagated to the 1.16 cluster but I have observed they are not.

Anything else we need to know?:

Environment:

/kind bug

RainbowMango commented 4 years ago

@dgorst Thanks for your feedback. Let me reproduce it locally and then get back to you.

RainbowMango commented 4 years ago

@dgorst Cloud you please help confirm if this the minimum step for reproducing?

Prepare clusters:

[root@ecs-d8b6 kubefed]# kubectl -n kube-federation-system get kubefedclusters
NAME       AGE     READY
cluster1   9d      True // v1.17.4  (apiextensions.k8s.io/v1)  `this is the host cluster`
cluster2   9d      True // v1.17.4 (apiextensions.k8s.io/v1)
cluster3   3h10m   True // v1.15.0 (apiextensions.k8s.io/`v1beta1`)

Operation Steps:

Result:

[root@ecs-d8b6 kubefed]# kubectl get crds crontabs.stable.example.com --context cluster1
NAME                          CREATED AT
crontabs.stable.example.com   2020-07-01T12:50:31Z
[root@ecs-d8b6 kubefed]# kubectl get crds crontabs.stable.example.com --context cluster2
Error from server (NotFound): customresourcedefinitions.apiextensions.k8s.io "crontabs.stable.example.com" not found
[root@ecs-d8b6 kubefed]# kubectl get crds crontabs.stable.example.com --context cluster3
Error from server (NotFound): customresourcedefinitions.apiextensions.k8s.io "crontabs.stable.example.com" not found

You expected the CRD will be propagated to cluster2 and ignore the cluster3, right?

dangorst1066 commented 4 years ago

Yes exactly @RainbowMango 👍

It feels like the blast radius from a single (tbf misconfigured) cluster, should not impact propagation to the good clusters. So in your example, yes I don't expect a v1 CRD in cluster1 to be propagated to cluster3, but I would expect it to continue to be propagated to cluster2.

I mention a CR of the type of the CRD as that would also stop propagating at the point the 1.15 cluster is joined. But it's the same issue I guess (the CRD doesn't get propagated because it can't list v1/crds, so it also can't list that type either)

RainbowMango commented 4 years ago

@dgorst I did some investigation and found that the FederatedCustomResourceDefinition sync controller totally be blocked as one of the informers can't finish its sync process.

The following check keeps failing. https://github.com/kubernetes-sigs/kubefed/blob/bf67d02369e9b2d93281f8224747b94afab3170e/pkg/controller/sync/controller.go#L235-L238

I agree with you that the propagation process should ignore bad clusters. Let's see how to solve this.

dangorst1066 commented 4 years ago

Thanks @RainbowMango for recreating and confirming 👍

Happy to have a stab at resolving this if that'll help? (caveat: I'm new to the kubefed codebase so may need to reach on slack with some questions though!)

RainbowMango commented 4 years ago

I've tried a workaround locally, but the community has discussed a better solution.

@hectorj2f @jimmidyson @irfanurrehman Could you please take a look? If the solution that changes FederatedTypeConfigStatus OK for you?

irfanurrehman commented 4 years ago

@RainbowMango thanks for tracking this. IMO the solution proposed by pmorie as per the link you mentioned is completely legit and can be implemented. As far as I understand @font might not be available to complete it. @dgorst are you up for taking this task up?

RainbowMango commented 4 years ago

Given the implementation is a little bit complicated(API change, controller adopt, testing, etc...), I'd like to set up an umbrella issue and split this to several tasks and then run it by iteration. @dgorst you are welcome and feel free to pick any iterated items you interested in.

How do you say? @irfanurrehman , and If it's ok for you, can you help review the following PRs?

irfanurrehman commented 4 years ago

Awsome suggestion @RainbowMango. I can certainly review the same. If time permits, I will take up some tasks too.

hectorj2f commented 4 years ago

Thanks for taking care of this @RainbowMango. It sounds good to me too. Share the action items to see if we can help somehow.

RainbowMango commented 4 years ago

Just sent a draft issue #1252. I have started some work locally, so I'll take the first task. Thanks for your support @irfanurrehman @hectorj2f .

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

jimmidyson commented 4 years ago

/remove-lifecycle stale

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

hectorj2f commented 3 years ago

/remove-lifecycle stale

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

fejta-bot commented 3 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten

fejta-bot commented 3 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community. /close

k8s-ci-robot commented 3 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/kubefed/issues/1241#issuecomment-859598708): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.