Open huxcrux opened 1 year ago
Thanks. Does it succeed eventually when it retries?
We're definitely leaking floating IPs every time this fails, and in a way which looks like it might prevent the cluster ever coming up. I think that's the most serious issue here.
We need to robustify this entire method against incremental failures. Unfortunately I don't have time to work on this myself right now, but if you are able to work on it or can find somebody else I can help.
My initial thoughts:
openStackCluster.Status.APIServerLoadBalancer
is populated in one shot at the end of the method. It could/should be populated incrementally as the lb and floating IP are created.
Additionally (due to potential failure to write status update) we should take some measure to check if the FIP was previously created when creating a new FIP.
Thanks. Does it succeed eventually when it retries?
No after the first fail it never recovers without me manually modifying the LB
We need to robustify this entire method against incremental failures. Unfortunately I don't have time to work on this myself right now, but if you are able to work on it or can find somebody else I can help.
My initial thoughts:
openStackCluster.Status.APIServerLoadBalancer
is populated in one shot at the end of the method. It could/should be populated incrementally as the lb and floating IP are created.Additionally (due to potential failure to write status update) we should take some measure to check if the FIP was previously created when creating a new FIP.
Sounds like a good way forward. I do not have much time during the upcoming week or so sadly, if things change I'll post an update
Possibly related issue that is also about leaking floating IPs: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/issues/1632
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
Is there any chance this was fixed by https://github.com/kubernetes-sigs/cluster-api-provider-openstack/pull/1829?
Is there any chance this was fixed by #1829?
I think it's related. I will verify this tomorrow :)
Is there any chance this was fixed by #1829?
Sadly this does not seem to fix the problem. I think the code change by the PR might contain a bug where it fails to patch due to the status object not existing.
EDIT: After more testing I think this is related to the same problem as https://github.com/kubernetes-sigs/cluster-api-provider-openstack/pull/1842 trying to resolve. If I manually set ready to false this code seems to be working.
I0201 10:05:29.343878 1 recorder.go:104] "events: Failed to create listener k8s-clusterapi-cluster-default-hux-lab1-kubeapi-31991: Expected HTTP response code [201 202] when accessing [POST https://ops.elastx.cloud:9876/v2.0/lbaas/listeners], but got 409 instead\n{\"faultcode\": \"Client\", \"faultstring\": \"Load Balancer cb8e830a-6cc8-4a0f-b854-270905bc43d1 is immutable and cannot be updated.\", \"debuginfo\": null}" type="Warning" object={"kind":"OpenStackCluster","namespace":"default","name":"hux-lab1","uid":"f9602d51-b6a4-4a38-be29-ca7dff70cd01","apiVersion":"infrastructure.cluster.x-k8s.io/v1alpha8","resourceVersion":"19401"} reason="Failedcreatelistener"
E0201 10:05:29.354438 1 controller.go:329] "Reconciler error" err=<
[failed to reconcile load balancer: Expected HTTP response code [201 202] when accessing [POST https://ops.elastx.cloud:9876/v2.0/lbaas/listeners], but got 409 instead
{"faultcode": "Client", "faultstring": "Load Balancer cb8e830a-6cc8-4a0f-b854-270905bc43d1 is immutable and cannot be updated.", "debuginfo": null}, error patching OpenStackCluster default/hux-lab1: OpenStackCluster.infrastructure.cluster.x-k8s.io "hux-lab1" is invalid: status.ready: Required value]
> controller="openstackcluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackCluster" OpenStackCluster="default/hux-lab1" namespace="default" name="hux-lab1" reconcileID="edb06052-e2a4-4485-9bd9-01a30caa8e47"
I have identified two problems with the loadbalancer and allowedCIDRs logic.
As a workaround I tested to just put a sleep just after the code linked under the first issue. 10 seconds seems to be just enough for everything to work. However if the second problem is resolved I think the first one will be resolved by itself after a new reconcile. However this require an additional reconcile per listener due to the allowedCIDR patch being added on each LB listener.
/remove-lifecycle stale
I have identified two problems with the loadbalancer and allowedCIDRs logic.
- The first one being the check that waits for the LB to become active again. I do not know if it's to quick or something however it seems to pass before the changes are applied. The code responsible for the check can be found here: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/5cc483bfc6eae8a8b8a67b32e9b7af0bafa473ca/pkg/cloud/services/loadbalancer/loadbalancer.go#L294-L297
Is it possible that after updating the listener the loadbalancer becomes temporarily not ACTIVE?
cc @dulek
- The list compare seems to be broken. even of the list of IPs matches a new update is triggered due to the list not being sorted. The code responsible for this can be found here: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/5cc483bfc6eae8a8b8a67b32e9b7af0bafa473ca/pkg/cloud/services/loadbalancer/loadbalancer.go#L282
As a workaround I tested to just put a sleep just after the code linked under the first issue. 10 seconds seems to be just enough for everything to work. However if the second problem is resolved I think the first one will be resolved by itself after a new reconcile. However this require an additional reconcile per listener due to the allowedCIDR patch being added on each LB listener.
I have identified two problems with the loadbalancer and allowedCIDRs logic.
- The first one being the check that waits for the LB to become active again. I do not know if it's to quick or something however it seems to pass before the changes are applied. The code responsible for the check can be found here: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/5cc483bfc6eae8a8b8a67b32e9b7af0bafa473ca/pkg/cloud/services/loadbalancer/loadbalancer.go#L294-L297
Is it possible that after updating the listener the loadbalancer becomes temporarily not ACTIVE?
cc @dulek
The LB will go ACTIVE->PENDING_UPDATE->ACTIVE
when AllowedCIDRs
are modified. You can generally expect that any action done on the LB or its children requires you to wait for the LB to be ACTIVE
again.
Also looking at that linked code fragment - you should never wait for the listener to be ACTIVE again, but rather for the whole LB. I think this might be causing the problems. Never wait for anything else than the LB, state management for the underlying resources is unreliable, only LB status counts when waiting for readiness.
Yes, Octavia API is really tough for users.
cc @huxcrux
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/kind bug
What steps did you take and what happened: When creating a cluster and defining booth one or more additionalPorts and at least one CIDR under allowedCidrs this causes the LB to never become fully functional and no cluster to be created.
It seems like there is a check missing for if the LB is out of PENDING_UPDATE causing the API to respond with
I have also tested booth additionalPorts and allowedCidrs one by one and it works as intended, it's just when booth are used at the same time this is occuring.
Also worth noticing is that the allowedCidrs option is being set on the LB and I have verified that the security groups for the LB contains the correct rules meaning this is just a race condition during provisioning
What did you expect to happen: The flow seems to be:
Anything else you would like to add: A gist with completed output: https://gist.github.com/huxcrux/7c288f1c0b045de67eac1beaf3211e6e I redacted IPs and Openstack API URIs. Every time FIP attachment fails it's a new FIP that has been allocated.
Environment:
git rev-parse HEAD
if manually built): 0.8.0kubectl version
): 1.28.1/etc/os-release
): Ubuntu 20.04