kubernetes-sigs / cluster-api

Home for Cluster API, a subproject of sig-cluster-lifecycle
https://cluster-api.sigs.k8s.io
Apache License 2.0
3.55k stars 1.3k forks source link

`clusterctl alpha topology plan` fails if infrastructure resource version does not match CAPI Contract Version #6220

Closed apricote closed 2 years ago

apricote commented 2 years ago

What steps did you take and what happened:

Then I executed the command and got following response:

$ clusterctl alpha topology plan -f cluster-resources.yaml -o output/ 
Detected a cluster with Cluster API installed. Will use it to fetch missing objects.
Error: failed to dry run the topology controller: error reading the ClusterClass: failed to get infrastructure cluster template for ClusterClass/default-class: failed to retrieve OpenStackClusterTemplate "default-class" in namespace "default": failed to retrieve OpenStackClusterTemplate external object "default"/"default-class": no matches for kind "OpenStackClusterTemplate" in version "infrastructure.cluster.x-k8s.io/v1beta1"

What did you expect to happen:

I expected clusterctl alpha topology plan to output a valid plan about the required topology changes, as CAPO conforms to the Provider Contract v1beta and I was able to succesfully create a cluster using ClusterClass with CAPO previously (not with the example manifests provided here in this issue).

Anything else you would like to add:

The error message is (in itself) correct, there is no "OpenStackClusterTemplate" in version "infrastructure.cluster.x-k8s.io/v1beta1)", CAPO only has released v1alpha4 CRDs so far. util/conversion.UpdateReferenceAPIContract() changes the referenced API version from v1alpha4 to v1beta1.

It looks like util.GetGVKMetadata returns invalid labels in my local test setup with Kind (v1.23.3) and Tilt. According to GetGVKMetadata the CRD has exactly one label cluster.x-k8s.io/v1beta1: v1beta1, but the actual labels differ:

$ k get crd openstackclustertemplates.infrastructure.cluster.x-k8s.io -o=jsonpath='{.metadata.labels}' | jq
{
  "cluster.x-k8s.io/provider": "infrastructure-openstack",
  "cluster.x-k8s.io/v1alpha3": "v1alpha3",
  "cluster.x-k8s.io/v1alpha4": "v1alpha4",
  "cluster.x-k8s.io/v1beta1": "v1alpha4"
}

There is some previous discussion to this issue in Slack.

Environment:

/kind bug /area clusterctl /area topology

sbueringer commented 2 years ago

Just a short pointer for now (also mentioned in the Slack thread). I think the issue is not GetGVKMetadata, the problem is that we internally (when running the topology plan command) generate a "CRD" which only has v1beta1 contract => v1beta1 mapping. Because of that we assume that all CRDs compatible with the v1beta1 contract also have v1beta1 api version.

fabriziopandini commented 2 years ago

/milestone v1.2

... we assume that all CRDs compatible with the v1beta1 contract also have v1beta1 api version.

v1beta1 contract -> v1beta1 version is a wrong assumption, but I think that it is safe to assume that all the objects that are provided to topology dry run are compatible with the v1beta1 contract.

So it should be enough to change https://github.com/kubernetes-sigs/cluster-api/blob/07b5e0a41b87e6a100461d6ac4dd46d1172254e7/cmd/clusterctl/client/cluster/topology.go#L513

to something like

clusterv1.GroupVersion.String(): gvk.Version,

But I defer the last call on the proposed solution to @ykakarap who experimented a lot with this feature (or to @apricote if he would like to validate the proposed fix) Also, this fix IMO qualifies for backport

sbueringer commented 2 years ago

+1 to backport.

The fix sounds good enough. Although it looks to me like this can lead to issues if someone uses multiple versions of a CR with the same group/kind. (but we have that issue in generateCRDs independent of the fix)

E.g. OpenStackMachineTemplate v1alpha4 + v1alpha3 in the same YAML. In this case we would get two CRDs with the same name, both are added to the map, but eventually one would win.

We could also support this case by collecting a mapping of group/kind => version and then add all infra api versions as label value.

fabriziopandini commented 2 years ago

Let's keep the fix to the issue as scoped as possible so we can have a clean backport. Then we can eventually iterate for supporting multiple versions if there is a need, or simply document the limitation in case the problem arises.

sbueringer commented 2 years ago

Sounds good to me. I don't think that would require much more changes. Maybe 4-5 lines instead of 1. But I'm fine with both solutions.

apricote commented 2 years ago

So it should be enough to change https://github.com/kubernetes-sigs/cluster-api/blob/07b5e0a41b87e6a100461d6ac4dd46d1172254e7/cmd/clusterctl/client/cluster/topology.go#L513

to something like

clusterv1.GroupVersion.String(): gvk.Version,

This works great, output is now a valid plan.

I am also +1 for a backport, even though the command is marked as alpha, it is still really helpful (even required IMO) for the adoption of ClusterClass.

apricote commented 2 years ago

Just tested multiple versions (OpenStackMachineTemplate as v1alpha4 for Worker and v1alpha3 for CP) and it errored out:

panic: failed to add object &CustomResourceDefinition{ObjectMeta:{openstackmachinetemplates.infrastructure.cluster.x-k8s.io     999 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[cluster.x-k8s.io/v1beta1:v1alpha4] map[] [] []  []},Spec:CustomResourceDefinitionSpec{Group:,Names:CustomResourceDefinitionNames{Plural:,Singular:,ShortNames:[],Kind:,ListKind:,Categories:[],},Scope:,Versions:[]CustomResourceDefinitionVersion{},Conversion:nil,PreserveUnknownFields:false,},Status:CustomResourceDefinitionStatus{Conditions:[]CustomResourceDefinitionCondition{},AcceptedNames:CustomResourceDefinitionNames{Plural:,Singular:,ShortNames:[],Kind:,ListKind:,Categories:[],},StoredVersions:[],},} to fake client: customresourcedefinitions.apiextensions.k8s.io "openstackmachinetemplates.infrastructure.cluster.x-k8s.io" already exists

goroutine 1 [running]:
sigs.k8s.io/controller-runtime/pkg/client/fake.(*ClientBuilder).Build(0xc000cf59f0)
        sigs.k8s.io/controller-runtime@v0.11.1/pkg/client/fake/client.go:143 +0x713
sigs.k8s.io/cluster-api/cmd/clusterctl/client/cluster/internal/dryrun.NewClient({0x7fa7cc06f860, 0xc0005b08c0}, {0xc000202000, 0x15, 0x7fa7cc06f860})
        sigs.k8s.io/cluster-api/cmd/clusterctl/client/cluster/internal/dryrun/client.go:117 +0x11a
sigs.k8s.io/cluster-api/cmd/clusterctl/client/cluster.(*topologyClient).Plan(0xc0007afd40, 0xc00015dcc0)
        sigs.k8s.io/cluster-api/cmd/clusterctl/client/cluster/topology.go:156 +0x4a7
sigs.k8s.io/cluster-api/cmd/clusterctl/client.(*clusterctlClient).TopologyPlan(0x1c2605f, {{{0x0, 0x0}, {0x0, 0x0}}, {0xc00013c100, 0xf, 0x10}, {0x0, 0x0}, ...})
        sigs.k8s.io/cluster-api/cmd/clusterctl/client/topology.go:55 +0x14a
sigs.k8s.io/cluster-api/cmd/clusterctl/cmd.runTopologyPlan()
        sigs.k8s.io/cluster-api/cmd/clusterctl/cmd/topology_plan.go:126 +0x1aa
sigs.k8s.io/cluster-api/cmd/clusterctl/cmd.glob..func15(0x2e86ae0, {0xc00040fac0, 0x4, 0x4})
        sigs.k8s.io/cluster-api/cmd/clusterctl/cmd/topology_plan.go:82 +0x17
github.com/spf13/cobra.(*Command).execute(0x2e86ae0, {0xc00040fa80, 0x4, 0x4})
        github.com/spf13/cobra@v1.2.1/command.go:856 +0x60e
github.com/spf13/cobra.(*Command).ExecuteC(0x2e87760)
        github.com/spf13/cobra@v1.2.1/command.go:974 +0x3bc
github.com/spf13/cobra.(*Command).Execute(...)
        github.com/spf13/cobra@v1.2.1/command.go:902
sigs.k8s.io/cluster-api/cmd/clusterctl/cmd.Execute()
        sigs.k8s.io/cluster-api/cmd/clusterctl/cmd/root.go:94 +0x25
main.main()
        sigs.k8s.io/cluster-api/cmd/clusterctl/main.go:26 +0x17

Still an improvement over the current situation.

sbueringer commented 2 years ago

I think this case will be also fixed if we implement my suggestion :)

fabriziopandini commented 2 years ago

/milestone v1.2

ykakarap commented 2 years ago

So it should be enough to change

https://github.com/kubernetes-sigs/cluster-api/blob/07b5e0a41b87e6a100461d6ac4dd46d1172254e7/cmd/clusterctl/client/cluster/topology.go#L513

to something like

clusterv1.GroupVersion.String(): gvk.Version,

But I defer the last call on the proposed solution to @ykakarap who experimented a lot with this feature (or to @apricote if he would like to validate the proposed fix) Also, this fix IMO qualifies for backport

Yes, the problem is because of the CRDs we are installing during the topology plan command. (I briefly considered only installing CRDs that are not available in the backing cluster - that would solve the problem too - and later dropped that because we have to support the no backing cluster anyway and the problem would persist.)

This suggested fix would address the problem in the issue. Adding multiple version support should also be straight forward . I will open a PR for this.

ykakarap commented 2 years ago

/assign