kubernetes-sigs / cluster-api-provider-openstack

Cluster API implementation for OpenStack
https://cluster-api-openstack.sigs.k8s.io/
Apache License 2.0
278 stars 252 forks source link

additionalPorts and allowedCidrs loops and never completes LB provisioning #1687

Open huxcrux opened 10 months ago

huxcrux commented 10 months ago

/kind bug

What steps did you take and what happened: When creating a cluster and defining booth one or more additionalPorts and at least one CIDR under allowedCidrs this causes the LB to never become fully functional and no cluster to be created.

It seems like there is a check missing for if the LB is out of PENDING_UPDATE causing the API to respond with

capo-system/capo-controller-manager-7c78d95c95-7qphw[manager]: I0926 13:49:42.720981 1 recorder.go:104] "events: Failed to create listener k8s-clusterapi-cluster-default-hux-lab1-kubeapi-31991: Expected HTTP response code [201 202] when accessing [POST https://:9876/v2.0/lbaas/listeners], but got 409 instead\n{\"faultcode\": \"Client\", \"faultstring\": \"Load Balancer 7d690fcb-ea67-4462-a49e-dda806c8f792 is immutable and cannot be updated.\", \"debuginfo\": null}" type="Warning" object={Kind:OpenStackCluster Namespace:default Name:hux-lab1 UID:d973908d-ccd0-4f5b-9f9f-d628143d021a APIVersion:infrastructure.cluster.x-k8s.io/v1alpha7 ResourceVersion:33567788 FieldPath:} reason="Failedcreatelistener"

I have also tested booth additionalPorts and allowedCidrs one by one and it works as intended, it's just when booth are used at the same time this is occuring.

Also worth noticing is that the allowedCidrs option is being set on the LB and I have verified that the security groups for the LB contains the correct rules meaning this is just a race condition during provisioning

What did you expect to happen: The flow seems to be:

  1. Create LB
  2. Create Listener
  3. If allowedCidrs is defined update the listener with the allowedCidrs list.
  4. Create the 2nd listener (this is done without checking if the LB is no longer in PENDING_UPDATE state causing an API failure and CAPO the allocates a ned FIP for the LB and try reconciling listeners again, however since the LB have a FIP connected it will continue to loop in this state)

Anything else you would like to add: A gist with completed output: https://gist.github.com/huxcrux/7c288f1c0b045de67eac1beaf3211e6e I redacted IPs and Openstack API URIs. Every time FIP attachment fails it's a new FIP that has been allocated.

Environment:

mdbooth commented 10 months ago

Thanks. Does it succeed eventually when it retries?

mdbooth commented 10 months ago

We're definitely leaking floating IPs every time this fails, and in a way which looks like it might prevent the cluster ever coming up. I think that's the most serious issue here.

mdbooth commented 10 months ago

We need to robustify this entire method against incremental failures. Unfortunately I don't have time to work on this myself right now, but if you are able to work on it or can find somebody else I can help.

My initial thoughts:

openStackCluster.Status.APIServerLoadBalancer is populated in one shot at the end of the method. It could/should be populated incrementally as the lb and floating IP are created.

Additionally (due to potential failure to write status update) we should take some measure to check if the FIP was previously created when creating a new FIP.

huxcrux commented 10 months ago

Thanks. Does it succeed eventually when it retries?

No after the first fail it never recovers without me manually modifying the LB

huxcrux commented 10 months ago

We need to robustify this entire method against incremental failures. Unfortunately I don't have time to work on this myself right now, but if you are able to work on it or can find somebody else I can help.

My initial thoughts:

openStackCluster.Status.APIServerLoadBalancer is populated in one shot at the end of the method. It could/should be populated incrementally as the lb and floating IP are created.

Additionally (due to potential failure to write status update) we should take some measure to check if the FIP was previously created when creating a new FIP.

Sounds like a good way forward. I do not have much time during the upcoming week or so sadly, if things change I'll post an update

seanschneeweiss commented 9 months ago

Possibly related issue that is also about leaking floating IPs: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/issues/1632

k8s-triage-robot commented 6 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

mdbooth commented 6 months ago

Is there any chance this was fixed by https://github.com/kubernetes-sigs/cluster-api-provider-openstack/pull/1829?

huxcrux commented 6 months ago

Is there any chance this was fixed by #1829?

I think it's related. I will verify this tomorrow :)

huxcrux commented 6 months ago

Is there any chance this was fixed by #1829?

Sadly this does not seem to fix the problem. I think the code change by the PR might contain a bug where it fails to patch due to the status object not existing.

EDIT: After more testing I think this is related to the same problem as https://github.com/kubernetes-sigs/cluster-api-provider-openstack/pull/1842 trying to resolve. If I manually set ready to false this code seems to be working.

I0201 10:05:29.343878       1 recorder.go:104] "events: Failed to create listener k8s-clusterapi-cluster-default-hux-lab1-kubeapi-31991: Expected HTTP response code [201 202] when accessing [POST https://ops.elastx.cloud:9876/v2.0/lbaas/listeners], but got 409 instead\n{\"faultcode\": \"Client\", \"faultstring\": \"Load Balancer cb8e830a-6cc8-4a0f-b854-270905bc43d1 is immutable and cannot be updated.\", \"debuginfo\": null}" type="Warning" object={"kind":"OpenStackCluster","namespace":"default","name":"hux-lab1","uid":"f9602d51-b6a4-4a38-be29-ca7dff70cd01","apiVersion":"infrastructure.cluster.x-k8s.io/v1alpha8","resourceVersion":"19401"} reason="Failedcreatelistener"
E0201 10:05:29.354438       1 controller.go:329] "Reconciler error" err=<
    [failed to reconcile load balancer: Expected HTTP response code [201 202] when accessing [POST https://ops.elastx.cloud:9876/v2.0/lbaas/listeners], but got 409 instead
    {"faultcode": "Client", "faultstring": "Load Balancer cb8e830a-6cc8-4a0f-b854-270905bc43d1 is immutable and cannot be updated.", "debuginfo": null}, error patching OpenStackCluster default/hux-lab1: OpenStackCluster.infrastructure.cluster.x-k8s.io "hux-lab1" is invalid: status.ready: Required value]
 > controller="openstackcluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackCluster" OpenStackCluster="default/hux-lab1" namespace="default" name="hux-lab1" reconcileID="edb06052-e2a4-4485-9bd9-01a30caa8e47"

I have identified two problems with the loadbalancer and allowedCIDRs logic.

  1. The first one being the check that waits for the LB to become active again. I do not know if it's to quick or something however it seems to pass before the changes are applied. The code responsible for the check can be found here: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/5cc483bfc6eae8a8b8a67b32e9b7af0bafa473ca/pkg/cloud/services/loadbalancer/loadbalancer.go#L294-L297
  2. The list compare seems to be broken. even of the list of IPs matches a new update is triggered due to the list not being sorted. The code responsible for this can be found here: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/5cc483bfc6eae8a8b8a67b32e9b7af0bafa473ca/pkg/cloud/services/loadbalancer/loadbalancer.go#L282

As a workaround I tested to just put a sleep just after the code linked under the first issue. 10 seconds seems to be just enough for everything to work. However if the second problem is resolved I think the first one will be resolved by itself after a new reconcile. However this require an additional reconcile per listener due to the allowedCIDR patch being added on each LB listener.

huxcrux commented 6 months ago

/remove-lifecycle stale

mdbooth commented 6 months ago

I have identified two problems with the loadbalancer and allowedCIDRs logic.

  1. The first one being the check that waits for the LB to become active again. I do not know if it's to quick or something however it seems to pass before the changes are applied. The code responsible for the check can be found here: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/5cc483bfc6eae8a8b8a67b32e9b7af0bafa473ca/pkg/cloud/services/loadbalancer/loadbalancer.go#L294-L297

Is it possible that after updating the listener the loadbalancer becomes temporarily not ACTIVE?

cc @dulek

  1. The list compare seems to be broken. even of the list of IPs matches a new update is triggered due to the list not being sorted. The code responsible for this can be found here: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/5cc483bfc6eae8a8b8a67b32e9b7af0bafa473ca/pkg/cloud/services/loadbalancer/loadbalancer.go#L282

As a workaround I tested to just put a sleep just after the code linked under the first issue. 10 seconds seems to be just enough for everything to work. However if the second problem is resolved I think the first one will be resolved by itself after a new reconcile. However this require an additional reconcile per listener due to the allowedCIDR patch being added on each LB listener.

dulek commented 5 months ago

I have identified two problems with the loadbalancer and allowedCIDRs logic.

  1. The first one being the check that waits for the LB to become active again. I do not know if it's to quick or something however it seems to pass before the changes are applied. The code responsible for the check can be found here: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/5cc483bfc6eae8a8b8a67b32e9b7af0bafa473ca/pkg/cloud/services/loadbalancer/loadbalancer.go#L294-L297

Is it possible that after updating the listener the loadbalancer becomes temporarily not ACTIVE?

cc @dulek

The LB will go ACTIVE->PENDING_UPDATE->ACTIVE when AllowedCIDRs are modified. You can generally expect that any action done on the LB or its children requires you to wait for the LB to be ACTIVE again.

Also looking at that linked code fragment - you should never wait for the listener to be ACTIVE again, but rather for the whole LB. I think this might be causing the problems. Never wait for anything else than the LB, state management for the underlying resources is unreliable, only LB status counts when waiting for readiness.

Yes, Octavia API is really tough for users.

cc @huxcrux

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

EmilienM commented 2 months ago

/remove-lifecycle stale