kubernetes-sigs / cluster-api-provider-openstack

Cluster API implementation for OpenStack
https://cluster-api-openstack.sigs.k8s.io/
Apache License 2.0
283 stars 253 forks source link

`DeletePoolMember` too fragile with OVN #1810

Closed EmilienM closed 3 months ago

EmilienM commented 8 months ago

/kind bug

What steps did you take and what happened: When deploying OpenStack Zed with Neutron/Octavia/OVN, we have issues when creating the API load balancers and when pool is updated: https://paste.opendev.org/show/bZ2NGZyifEXa9z4xfzDE/

error deleting lbmember: Expected HTTP response code [202 204] when accessing [DELETE http://10.0.3.15/load-balancer/v2.0/lbaas/pools/81bf6069-982b-459e-8039-370c6fda4b43/members/5f366006-28e1-49bb-86d3-b13540aabcf0], but got 409 instead

What did you expect to happen:

The pool can be updated safely.

Environment:

jichenjc commented 8 months ago

is it because https://github.com/openstack/octavia/blob/master/octavia/common/exceptions.py#L188 so it's something need to be supported in Octavia first?

E1227 03:04:26.726377       1 controller.go:329] "Reconciler error" err=<
    error deleting lbmember: Expected HTTP response code [202 204] when accessing [DELETE http://10.0.3.15/load-balancer/v2.0/lbaas/pools/81bf6069-982b-459e-8039-370c6fda4b43/members/5f366006-28e1-49bb-86d3-b13540aabcf0], but got 409 instead
    {"faultcode": "Client", "faultstring": "Load Balancer 3c3a57d2-8de1-4bd4-bd1c-3b280c36290b is immutable and cannot be updated.", "debuginfo": null}
 > controller="openstackmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackMachine" OpenStackMachine="e2e-ftcai7/cluster-e2e-ftcai7-control-plane-kl4nh" namespace="e2e-ftcai7" name="cluster-e2e-ftcai7-control-plane-kl4nh" reconcileID="8570c8d9-ab17-4a82-aa6d-a1030bd4786e"
I1227 03:04:26.731140       1 openstackmachine_controller.go:222] "Reconciling Machine delete" controller="openstackcluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackCluster" OpenStackCluster="e2e-tts4oz/cluster-e2e-tts4oz" namespace="e2e-tts4oz" name="cluster-e2e-tts4oz" reconcileID="e0d60443-7747-485c-bffb-bdca26be0ba7" cluster="cluster-e2e-tts4oz"
E1227 03:04:26.791383       1 controller.go:329] "Reconciler error" err=<
    error deleting lbmember: Expected HTTP response code [202 204] when accessing [DELETE http://10.0.3.15/load-balancer/v2.0/lbaas/pools/81bf6069-982b-459e-8039-370c6fda4b43/members/a5517fdb-f38f-441a-8605-62543359341a], but got 409 instead
    {"faultcode": "Client", "faultstring": "Pool 81bf6069-982b-459e-8039-370c6fda4b43 is immutable and cannot be updated.", "debuginfo": null}
 > controller="openstackmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackMachine" OpenStackMachine="e2e-ftcai7/clust
dulek commented 8 months ago

is it because https://github.com/openstack/octavia/blob/master/octavia/common/exceptions.py#L188 so it's something need to be supported in Octavia first?

E1227 03:04:26.726377       1 controller.go:329] "Reconciler error" err=<
  error deleting lbmember: Expected HTTP response code [202 204] when accessing [DELETE http://10.0.3.15/load-balancer/v2.0/lbaas/pools/81bf6069-982b-459e-8039-370c6fda4b43/members/5f366006-28e1-49bb-86d3-b13540aabcf0], but got 409 instead
  {"faultcode": "Client", "faultstring": "Load Balancer 3c3a57d2-8de1-4bd4-bd1c-3b280c36290b is immutable and cannot be updated.", "debuginfo": null}
 > controller="openstackmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackMachine" OpenStackMachine="e2e-ftcai7/cluster-e2e-ftcai7-control-plane-kl4nh" namespace="e2e-ftcai7" name="cluster-e2e-ftcai7-control-plane-kl4nh" reconcileID="8570c8d9-ab17-4a82-aa6d-a1030bd4786e"
I1227 03:04:26.731140       1 openstackmachine_controller.go:222] "Reconciling Machine delete" controller="openstackcluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackCluster" OpenStackCluster="e2e-tts4oz/cluster-e2e-tts4oz" namespace="e2e-tts4oz" name="cluster-e2e-tts4oz" reconcileID="e0d60443-7747-485c-bffb-bdca26be0ba7" cluster="cluster-e2e-tts4oz"
E1227 03:04:26.791383       1 controller.go:329] "Reconciler error" err=<
  error deleting lbmember: Expected HTTP response code [202 204] when accessing [DELETE http://10.0.3.15/load-balancer/v2.0/lbaas/pools/81bf6069-982b-459e-8039-370c6fda4b43/members/a5517fdb-f38f-441a-8605-62543359341a], but got 409 instead
  {"faultcode": "Client", "faultstring": "Pool 81bf6069-982b-459e-8039-370c6fda4b43 is immutable and cannot be updated.", "debuginfo": null}
 > controller="openstackmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackMachine" OpenStackMachine="e2e-ftcai7/clust

This is being raised when LB is in PENDING_* state. We technically wait for the LB to be ACTIVE again, but there's a case that 2 threads (by default we run with 10) can be waiting for the LB to become ACTIVE and there's a race condition that allows only one to actually issue the call. On 409 we should probably just restart waiting for the LB to be ACTIVE again.

k8s-triage-robot commented 5 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 4 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

mdbooth commented 4 months ago

/remove-lifecycle stale

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 3 months ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/cluster-api-provider-openstack/issues/1810#issuecomment-2153031263): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.