RELEASE NOTE: ensure MutatingWebhookConfiguration is deleted on upgrade

huxcrux commented 4 months ago

/kind bug

What steps did you take and what happened: It seems like https://github.com/kubernetes-sigs/cluster-api-provider-openstack/pull/1920 broke the defaulting webhook for openstackcluster in v1beta1.

Error during apply:

[2024-03-05T18:41:23+0100]: {'kind': 'Status', 'apiVersion': 'v1', 'metadata': {}, 'status': 'Failure', 'message': 'Internal error occurred: failed calling webhook "default.openstackcluster.infrastructure.cluster.x-k8s.io": failed to call webhook: the server could not find the requested resource', 'reason': 'InternalError', 'details': {'causes': [{'message': 'failed calling webhook "default.openstackcluster.infrastructure.cluster.x-k8s.io": failed to call webhook: the server could not find the requested resource'}]}, 'code': 500}

I suspect there might need to be some changes in webhook generation since it seems to be removed completely when checking the following file: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/commit/e6bb34f3bd6fa141c7c094f62b537891f716b4a7#diff-369b61dd1f2f1f60722927fda70e5a51e130cea68336e6d536086b522ffef0c6

Environment:

Cluster API Provider OpenStack version (Or git rev-parse HEAD if manually built): main (current)
Cluster-API version:
OpenStack version:
Minikube/KIND version:
Kubernetes version (use kubectl version):
OS (e.g. from /etc/os-release):

mdbooth commented 4 months ago

How are you triggering this?

mdbooth commented 4 months ago

I deliberately removed the defaulting webhooks in #1920 because none of them had an implementation.

Ah... I'll bet I know what happened. Note that #1920 removes the MutatingWebhookConfiguration from the manifests which references the defaulting webhook. However, if you're upgrading and you just apply the new manifests there's going to be nothing to remove it.

I'm not sure how to handle that without an operator, tbh. Definitely needs a release note!

To start with, please can you confirm that deleting CAPO's MutatingWebhookConfiguration fixes the problem. If we can confirm that we can think about how to automate it, or at least ensure folks know about it.

huxcrux commented 4 months ago

I deliberately removed the defaulting webhooks in #1920 because none of them had an implementation.

Ah... I'll bet I know what happened. Note that #1920 removes the MutatingWebhookConfiguration from the manifests which references the defaulting webhook. However, if you're upgrading and you just apply the new manifests there's going to be nothing to remove it.

I'm not sure how to handle that without an operator, tbh. Definitely needs a release note!

To start with, please can you confirm that deleting CAPO's MutatingWebhookConfiguration fixes the problem. If we can confirm that we can think about how to automate it, or at least ensure folks know about it.

If I remove the mutateing webhook it solves the problem. Also installing a new management cluster directly from main "fixes" the problem since the webhook is never created meaning this is just a problem when upgrading CAPO to main ATM

lentzi90 commented 4 months ago

It seems that clusterctl upgrade is smart about this and that is why the upgrade test was successful. That means it will "only" be an issue when using kubectl or other tools instead of clusterctl.

$ k get mutatingwebhookconfigurations.admissionregistration.k8s.io 
NAME                                                        WEBHOOKS   AGE
capi-kubeadm-bootstrap-mutating-webhook-configuration       2          14s
capi-kubeadm-control-plane-mutating-webhook-configuration   2          14s
capi-mutating-webhook-configuration                         9          14s
capo-mutating-webhook-configuration                         3          13s
cert-manager-webhook                                        1          36s
$ clusterctl upgrade apply --contract=v1beta1 --config=clusterctl.yaml
Checking cert-manager version...
Cert-manager is already up to date
Performing upgrade...
Scaling down Provider="infrastructure-openstack" Version="v0.9.0" Namespace="capo-system"
Deleting Provider="infrastructure-openstack" Version="v0.9.0" Namespace="capo-system"
Installing Provider="infrastructure-openstack" Version="v0.9.99" TargetNamespace="capo-system"

sigs.k8s.io/cluster-api
$ k get mutatingwebhookconfigurations.admissionregistration.k8s.io 
NAME                                                        WEBHOOKS   AGE
capi-kubeadm-bootstrap-mutating-webhook-configuration       2          73s
capi-kubeadm-control-plane-mutating-webhook-configuration   2          73s
capi-mutating-webhook-configuration                         9          73s
cert-manager-webhook                                        1          95s

huxcrux commented 4 months ago

It seems that clusterctl upgrade is smart about this and that is why the upgrade test was successful. That means it will "only" be an issue when using kubectl or other tools instead of clusterctl.

$ k get mutatingwebhookconfigurations.admissionregistration.k8s.io 
NAME                                                        WEBHOOKS   AGE
capi-kubeadm-bootstrap-mutating-webhook-configuration       2          14s
capi-kubeadm-control-plane-mutating-webhook-configuration   2          14s
capi-mutating-webhook-configuration                         9          14s
capo-mutating-webhook-configuration                         3          13s
cert-manager-webhook                                        1          36s
$ clusterctl upgrade apply --contract=v1beta1 --config=clusterctl.yaml
Checking cert-manager version...
Cert-manager is already up to date
Performing upgrade...
Scaling down Provider="infrastructure-openstack" Version="v0.9.0" Namespace="capo-system"
Deleting Provider="infrastructure-openstack" Version="v0.9.0" Namespace="capo-system"
Installing Provider="infrastructure-openstack" Version="v0.9.99" TargetNamespace="capo-system"

sigs.k8s.io/cluster-api
$ k get mutatingwebhookconfigurations.admissionregistration.k8s.io 
NAME                                                        WEBHOOKS   AGE
capi-kubeadm-bootstrap-mutating-webhook-configuration       2          73s
capi-kubeadm-control-plane-mutating-webhook-configuration   2          73s
capi-mutating-webhook-configuration                         9          73s
cert-manager-webhook                                        1          95s

This indeed is a problem when just "applying" the new manifests. clusterctl upgrade seems to get around this by just removing resources and then add the new ones. We should probably add something in the releasenotes since this is breaking for anyone not using clsuterctl to upgrade CAPO.

For reference we have discussed this on slack where more details can be found: https://kubernetes.slack.com/archives/C8TSNPY4T/p1709672361982489

mdbooth commented 4 months ago

So this is 'not a bug', but it's a serious 'not a bug' so lets keep this open until we've ensured there is a release note about it.

@jichenjc I'm again thinking it would be really nice to have a workflow from GitHub which results in a Release Note. I think CPO has this. Any idea how it works?

jichenjc commented 4 months ago

@mdbooth you mean https://github.com/kubernetes/cloud-provider-openstack/blob/master/.github/PULL_REQUEST_TEMPLATE.md ?

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

EmilienM commented 1 month ago

/remove-lifecycle stale

mdbooth commented 1 month ago

This was done.

kubernetes-sigs / cluster-api-provider-openstack

RELEASE NOTE: ensure MutatingWebhookConfiguration is deleted on upgrade #1927