Closed dangorst1066 closed 3 years ago
@dgorst Thanks for your feedback. Let me reproduce it locally and then get back to you.
@dgorst Cloud you please help confirm if this the minimum step for reproducing?
Prepare clusters:
[root@ecs-d8b6 kubefed]# kubectl -n kube-federation-system get kubefedclusters
NAME AGE READY
cluster1 9d True // v1.17.4 (apiextensions.k8s.io/v1) `this is the host cluster`
cluster2 9d True // v1.17.4 (apiextensions.k8s.io/v1)
cluster3 3h10m True // v1.15.0 (apiextensions.k8s.io/`v1beta1`)
Operation Steps:
crontabs.stable.example.com
which apiVersion is apiextensions.k8s.io/v1
kubefedctl enable customresourcedefinitions
kubefedctl federate crd crontabs.stable.example.com
Result:
[root@ecs-d8b6 kubefed]# kubectl get crds crontabs.stable.example.com --context cluster1
NAME CREATED AT
crontabs.stable.example.com 2020-07-01T12:50:31Z
[root@ecs-d8b6 kubefed]# kubectl get crds crontabs.stable.example.com --context cluster2
Error from server (NotFound): customresourcedefinitions.apiextensions.k8s.io "crontabs.stable.example.com" not found
[root@ecs-d8b6 kubefed]# kubectl get crds crontabs.stable.example.com --context cluster3
Error from server (NotFound): customresourcedefinitions.apiextensions.k8s.io "crontabs.stable.example.com" not found
You expected the CRD will be propagated to cluster2 and ignore the cluster3, right?
Yes exactly @RainbowMango 👍
It feels like the blast radius from a single (tbf misconfigured) cluster, should not impact propagation to the good clusters. So in your example, yes I don't expect a v1 CRD in cluster1 to be propagated to cluster3, but I would expect it to continue to be propagated to cluster2.
I mention a CR of the type of the CRD as that would also stop propagating at the point the 1.15 cluster is joined. But it's the same issue I guess (the CRD doesn't get propagated because it can't list v1/crds, so it also can't list that type either)
@dgorst
I did some investigation and found that the FederatedCustomResourceDefinition
sync controller totally be blocked as one of the informers can't finish its sync process.
The following check keeps failing. https://github.com/kubernetes-sigs/kubefed/blob/bf67d02369e9b2d93281f8224747b94afab3170e/pkg/controller/sync/controller.go#L235-L238
I agree with you that the propagation process should ignore bad
clusters.
Let's see how to solve this.
Thanks @RainbowMango for recreating and confirming 👍
Happy to have a stab at resolving this if that'll help? (caveat: I'm new to the kubefed codebase so may need to reach on slack with some questions though!)
I've tried a workaround locally, but the community has discussed a better solution.
@hectorj2f @jimmidyson @irfanurrehman
Could you please take a look? If the solution that changes FederatedTypeConfigStatus
OK for you?
@RainbowMango thanks for tracking this. IMO the solution proposed by pmorie as per the link you mentioned is completely legit and can be implemented. As far as I understand @font might not be available to complete it. @dgorst are you up for taking this task up?
Given the implementation is a little bit complicated(API change, controller adopt, testing, etc...), I'd like to set up an umbrella issue and split this to several tasks and then run it by iteration. @dgorst you are welcome and feel free to pick any iterated items you interested in.
How do you say? @irfanurrehman , and If it's ok for you, can you help review the following PRs?
Awsome suggestion @RainbowMango. I can certainly review the same. If time permits, I will take up some tasks too.
Thanks for taking care of this @RainbowMango. It sounds good to me too. Share the action items to see if we can help somehow.
Just sent a draft issue #1252. I have started some work locally, so I'll take the first task. Thanks for your support @irfanurrehman @hectorj2f .
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
Send feedback to sig-contributor-experience at kubernetes/community. /close
@fejta-bot: Closing this issue.
A single federated cluster can stop propagation of a type for all clusters if it does not have a particular resource version.
And a question - any good strategies for handling cluster estates that could have multiple versions of a resource in circulation (e.g. v1beta1 and v1 CRDs)
Editing the target type version in the federated type config to v1beta1 (lowest common denominator) appears to work around this ok (tbc), but it's still worrying a single cluster could stop all federation working - seems like this shouldn't be the expected behaviour.
What happened:
Run a federation control plane at kube version 1.16 Enabled federation of CRDs (v1) Joined another 1.16 cluster - confirmed CRDs and CRs of that type are being propagated ok Joined a 1.15 cluster - CRDs+CRs not propagated to the 1.15 cluster (CRDs at version v1beta1). All propagation of CRDs and CRs of the same type stopped working for the 1.16 cluster as well.
Logs for the controller manager show msgs like:
What you expected to happen:
I expected v1 CRDs not to propagate to the 1.15 cluster, however I did not expect the propagation of all CRDs to all clusters to stop working.
How to reproduce it (as minimally and precisely as possible):
Run a federation control plane at kube version 1.16+ Enabled federation of v1 CRDs Create a Federated CRD, and a CR of that type with placement that will match all clusters Join another 1.16 cluster - confirmed CRD and CR are being propagated ok Join a 1.15 cluster - expect the CRD and CR not to be propagated Create a new federated CRD, or a CR of the original type - these should still be propagated to the 1.16 cluster but I have observed they are not.
Anything else we need to know?:
Environment:
/kind bug