kubernetes-sigs / cluster-api-provider-openstack

Cluster API implementation for OpenStack
https://cluster-api-openstack.sigs.k8s.io/
Apache License 2.0
279 stars 252 forks source link

RELEASE NOTE: ensure MutatingWebhookConfiguration is deleted on upgrade #1927

Closed huxcrux closed 1 month ago

huxcrux commented 4 months ago

/kind bug

What steps did you take and what happened: It seems like https://github.com/kubernetes-sigs/cluster-api-provider-openstack/pull/1920 broke the defaulting webhook for openstackcluster in v1beta1.

Error during apply:

[2024-03-05T18:41:23+0100]: {'kind': 'Status', 'apiVersion': 'v1', 'metadata': {}, 'status': 'Failure', 'message': 'Internal error occurred: failed calling webhook "default.openstackcluster.infrastructure.cluster.x-k8s.io": failed to call webhook: the server could not find the requested resource', 'reason': 'InternalError', 'details': {'causes': [{'message': 'failed calling webhook "default.openstackcluster.infrastructure.cluster.x-k8s.io": failed to call webhook: the server could not find the requested resource'}]}, 'code': 500}

I suspect there might need to be some changes in webhook generation since it seems to be removed completely when checking the following file: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/commit/e6bb34f3bd6fa141c7c094f62b537891f716b4a7#diff-369b61dd1f2f1f60722927fda70e5a51e130cea68336e6d536086b522ffef0c6

Environment:

mdbooth commented 4 months ago

How are you triggering this?

mdbooth commented 4 months ago

I deliberately removed the defaulting webhooks in #1920 because none of them had an implementation.

Ah... I'll bet I know what happened. Note that #1920 removes the MutatingWebhookConfiguration from the manifests which references the defaulting webhook. However, if you're upgrading and you just apply the new manifests there's going to be nothing to remove it.

I'm not sure how to handle that without an operator, tbh. Definitely needs a release note!

To start with, please can you confirm that deleting CAPO's MutatingWebhookConfiguration fixes the problem. If we can confirm that we can think about how to automate it, or at least ensure folks know about it.

huxcrux commented 4 months ago

I deliberately removed the defaulting webhooks in #1920 because none of them had an implementation.

Ah... I'll bet I know what happened. Note that #1920 removes the MutatingWebhookConfiguration from the manifests which references the defaulting webhook. However, if you're upgrading and you just apply the new manifests there's going to be nothing to remove it.

I'm not sure how to handle that without an operator, tbh. Definitely needs a release note!

To start with, please can you confirm that deleting CAPO's MutatingWebhookConfiguration fixes the problem. If we can confirm that we can think about how to automate it, or at least ensure folks know about it.

If I remove the mutateing webhook it solves the problem. Also installing a new management cluster directly from main "fixes" the problem since the webhook is never created meaning this is just a problem when upgrading CAPO to main ATM

lentzi90 commented 4 months ago

It seems that clusterctl upgrade is smart about this and that is why the upgrade test was successful. That means it will "only" be an issue when using kubectl or other tools instead of clusterctl.

$ k get mutatingwebhookconfigurations.admissionregistration.k8s.io 
NAME                                                        WEBHOOKS   AGE
capi-kubeadm-bootstrap-mutating-webhook-configuration       2          14s
capi-kubeadm-control-plane-mutating-webhook-configuration   2          14s
capi-mutating-webhook-configuration                         9          14s
capo-mutating-webhook-configuration                         3          13s
cert-manager-webhook                                        1          36s
$ clusterctl upgrade apply --contract=v1beta1 --config=clusterctl.yaml
Checking cert-manager version...
Cert-manager is already up to date
Performing upgrade...
Scaling down Provider="infrastructure-openstack" Version="v0.9.0" Namespace="capo-system"
Deleting Provider="infrastructure-openstack" Version="v0.9.0" Namespace="capo-system"
Installing Provider="infrastructure-openstack" Version="v0.9.99" TargetNamespace="capo-system"

sigs.k8s.io/cluster-api
$ k get mutatingwebhookconfigurations.admissionregistration.k8s.io 
NAME                                                        WEBHOOKS   AGE
capi-kubeadm-bootstrap-mutating-webhook-configuration       2          73s
capi-kubeadm-control-plane-mutating-webhook-configuration   2          73s
capi-mutating-webhook-configuration                         9          73s
cert-manager-webhook                                        1          95s
huxcrux commented 4 months ago

It seems that clusterctl upgrade is smart about this and that is why the upgrade test was successful. That means it will "only" be an issue when using kubectl or other tools instead of clusterctl.

$ k get mutatingwebhookconfigurations.admissionregistration.k8s.io 
NAME                                                        WEBHOOKS   AGE
capi-kubeadm-bootstrap-mutating-webhook-configuration       2          14s
capi-kubeadm-control-plane-mutating-webhook-configuration   2          14s
capi-mutating-webhook-configuration                         9          14s
capo-mutating-webhook-configuration                         3          13s
cert-manager-webhook                                        1          36s
$ clusterctl upgrade apply --contract=v1beta1 --config=clusterctl.yaml
Checking cert-manager version...
Cert-manager is already up to date
Performing upgrade...
Scaling down Provider="infrastructure-openstack" Version="v0.9.0" Namespace="capo-system"
Deleting Provider="infrastructure-openstack" Version="v0.9.0" Namespace="capo-system"
Installing Provider="infrastructure-openstack" Version="v0.9.99" TargetNamespace="capo-system"

sigs.k8s.io/cluster-api
$ k get mutatingwebhookconfigurations.admissionregistration.k8s.io 
NAME                                                        WEBHOOKS   AGE
capi-kubeadm-bootstrap-mutating-webhook-configuration       2          73s
capi-kubeadm-control-plane-mutating-webhook-configuration   2          73s
capi-mutating-webhook-configuration                         9          73s
cert-manager-webhook                                        1          95s

This indeed is a problem when just "applying" the new manifests. clusterctl upgrade seems to get around this by just removing resources and then add the new ones. We should probably add something in the releasenotes since this is breaking for anyone not using clsuterctl to upgrade CAPO.

For reference we have discussed this on slack where more details can be found: https://kubernetes.slack.com/archives/C8TSNPY4T/p1709672361982489

mdbooth commented 4 months ago

So this is 'not a bug', but it's a serious 'not a bug' so lets keep this open until we've ensured there is a release note about it.

@jichenjc I'm again thinking it would be really nice to have a workflow from GitHub which results in a Release Note. I think CPO has this. Any idea how it works?

jichenjc commented 4 months ago

@mdbooth you mean https://github.com/kubernetes/cloud-provider-openstack/blob/master/.github/PULL_REQUEST_TEMPLATE.md ?

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

EmilienM commented 1 month ago

/remove-lifecycle stale

mdbooth commented 1 month ago

This was done.