kubernetes / cloud-provider-openstack

Apache License 2.0
616 stars 601 forks source link

[occm] Support for OpenStack LBaaS providers that do not implement the "fully-populated" loadbalancer API call #2271

Closed m-bull closed 1 year ago

m-bull commented 1 year ago

/kind feature

What happened: Some Octavia plugins (specifically vmwareedge/vmware-nsx in this case) do not provide an API endpoint for creating "fully-populated" loadbalancers, so calling this function results in a http 503. As a result, loadbalancers aren't created. These plugins are deployed on VMWare Integrated OpenStack (VIO) platforms.

While the vmwareedge Octavia provider isn't explicitly supported here (just Amphora/OVN), recycling some old code to use the non-fully-populated loadbalancer API calls and placing it behind an opt-in switch would allow it to be used. I had a go a making these changes here, and this approach works to create loadbalancers in a VMWare VIO7 environment, although I haven't tested ingress resources in anger.

The patch is a bit old now (and probably not implemented brilliantly), but if the project would be interested in supporting a "legacy-api-call" config option that re-enabled the non-fully-populated loadbalancer creation approach from OCCM <1.21 alongside the existing approach, I'd happily contribute.

What you expected to happen: Loadbalancers are created.

How to reproduce it: On a VMWare VIO7 cloud (OpenStack Train and vmware-nsx Neutron/Octavia plugins), using any version of OCCM after 1.20 results in a failure to provision a loadbalancer.

Environment:

mdbooth commented 1 year ago

@dulek Thoughts?

dulek commented 1 year ago

@m-bull: Thanks for raising this! So the code you're attempting to add doesn't feel too cluttered, so I think it's worth considering. I wonder however how do you deal with the fact that after each call CCM needs to wait for the LB to become ACTIVE again before making the next call. In bigger environments (60-100 nodes) we've seen that this causes LB creation to take up to an hour and I believe this is the reason to use the bulk APIs in CCM.

m-bull commented 1 year ago

Hi @dulek - thanks for responding and apologies for the slow reply.

This was really intended as a last resort to get a loadbalancer provider that doesn't support the bulk API working in OCCM - I really wouldn't recommend using this approach on anything provider that provides the bulk API, exactly because of the issues that you describe!

I'm not sure of the status of the rest of the providers listed here, and whether they support the bulk API, but the most common providers (Amphora and OVN) would never need this enabled.

FWIW, in (limited) testing we see that when you move outside the "compute band" required for Amphorae to either OVN or in this case to vmwareedge, loadbalancers provision and update their state comparatively quickly, so we've never seen very long creation times for these types of loadbalancers.

dulek commented 1 year ago

Alright. Let's proceed with this then. While I'm not a fan of maintaining more code it does make sense. Could you make sure all the functions related to one-by-one creation go into a separate file? loadbalancer.go is super crowded already.

mdbooth commented 1 year ago

Also we need to ensure we have CI coverage of both code paths, or the untested one will rapidly bitrot. If our Octavia CI supports bulk updates, please can we give some thought to how to test non-bulk update in the same environment?

dulek commented 1 year ago

Also we need to ensure we have CI coverage of both code paths, or the untested one will rapidly bitrot. If our Octavia CI supports bulk updates, please can we give some thought to how to test non-bulk update in the same environment?

Non-bulk is supported by any provider and the proposed change adds a new use-legacy-api-calls, so it's as simple as adding a job with it enabled. And while I hate adding options, it's probably a better idea than maintaining a list of Octavia providers that need "legacy" calls ourselves.

mdbooth commented 1 year ago

FWIW I'd like to give that config a lot more thought: it's vague and won't make sense the next time we want to support a 'legacy' API call.

m-bull commented 1 year ago

Thanks @mdbooth and @dulek for the helpful comments.

I had a crack at addressing these in the linked PR as the patch needed to be refreshed anyway.

mdbooth commented 1 year ago

@m-bull What's the relationship between this PR and https://github.com/kubernetes/cloud-provider-openstack/pull/2314?

m-bull commented 1 year ago

This issue provides a bit more context on the PR in https://github.com/kubernetes/cloud-provider-openstack/pull/2314.

https://github.com/kubernetes/cloud-provider-openstack/pull/2314 is a refresh and slight refactor of the patch described in https://github.com/kubernetes/cloud-provider-openstack/issues/2271#issue-1749580380 to try and cover off some of the concerns raised here.

mdbooth commented 1 year ago

This issue provides a bit more context on the PR in #2314.

2314 is a refresh and slight refactor of the patch described in #2271 (comment) to try and cover off some of the concerns raised here.

Sorry, must have been pre-coffee. Didn't notice that one's an issue and the other's the PR 🤦