kubernetes-sigs / cluster-api-provider-openstack

Cluster API implementation for OpenStack
https://cluster-api-openstack.sigs.k8s.io/
Apache License 2.0
279 stars 252 forks source link

Cluster occasionally stops reconciling prior to completion #1954

Closed mdbooth closed 4 months ago

mdbooth commented 4 months ago

/kind bug

I've seen this several times recently, normally on the upgrade test. An example:

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-openstack/1951/pull-cluster-api-provider-openstack-e2e-full-test/1769057441917440000

In the logs we see:

I0316 18:33:56.607682       1 port.go:599] "Adopted previously created port which was not in status" controller="openstackcluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackCluster" OpenStackCluster="self-hosted-xudrlo/self-hosted-0wwq91" namespace="self-hosted-xudrlo" name="self-hosted-0wwq91" reconcileID="16545edd-8004-4a0d-b4ef-7a73fcf69301" cluster="self-hosted-0wwq91" port index=0 portID="779b61f4-3ce4-4319-9ece-213726d71430"

and then nothing further from the cluster controller. The above log message is always associated with an update to the status, so however the reconcile ended we should have been reconciled again.

The root cause was this predicate, which explicitly filters updates which only change the status:

https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/c1fc60d989552246c230eeb0cde1862137280a44/controllers/openstackcluster_controller.go#L829-L838

This predicate was originally added in commit 4b368a68676c766c4f39a7eab89fd60a754c2201, which notes that it was copying a pattern from CAPA.

FWIW, CAPA also still has this predicate for both AWSMachine and AWSCluster: https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/068aba046a2cae6c7a6de6c295604d5cb863d5ce/controllers/awsmachine_controller.go#L260-L277

We should remove this predicate in both OpenStackCluster and OpenStackMachine, as it results in a race when exiting the reconcile after a status update without returning an error.