kubernetes / cloud-provider-openstack

Apache License 2.0
615 stars 603 forks source link

[occm] LoadBalancer created and linked but service pending #2609

Closed framctr closed 3 months ago

framctr commented 4 months ago

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug /kind feature

What happened: OCCM creates the OpenStack Load balancer resource correctly and links it to the ingress service, but the Kubernetes Service resource is still Pending.

What you expected to happen: The Kubernetes resource is Ready.

How to reproduce it:

  1. Deploy OCCM on RKE2 with Calico managed by Rancher using the configuration below
  2. The OCCM component is Running
  3. The ingress service is Pending

openstack.conf

[Global]
...

[LoadBalancer]
use-octavia=true
manage-security-groups=true
enable-health-monitor=true
floating-network-id=${floating_network_id}
subnet-id=${subnet_id}

Controller Helm values:

controllerExtraArgs: |-
   - "--use-service-account-credentials=false"
   - "--configure-cloud-routes=false"

The occm logs:

E0530 09:08:53.120627      11 controller.go:298] error processing service kube-system/rke2-ingress-nginx-controller (retrying with exponential backoff): failed to ensure load balancer: failed when reconciling security groups for LB service kube-system/rke2-ingress-nginx-controller: error getting server ID from the node: ProviderID "rke2://vm-name-redacted" didn't match expected format "openstack://region/InstanceID"
2024-05-30T09:08:53.121019259Z I0530 09:08:53.120718      11 event.go:389] "Event occurred" object="kube-system/rke2-ingress-nginx-controller" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: failed when reconciling security groups for LB service kube-system/rke2-ingress-nginx-controller: error getting server ID from the node: ProviderID \"rke2://vm-name-redacted\" didn't match expected format \"openstack://region/InstanceID\""

Anything else we need to know?: I'm using a shared network between 2 tenants. The load balancer and its associated FIP are created with terraform scripts and the ID of the load balancer is passed to the ingress service through an annotation.

If I open all the security groups to any IP and make the argument manage-security-groups=false in the occm configuration and remove the controllerExtraArgs from the Helm values, it works as expected.

Environment:

jichenjc commented 4 months ago

error getting server ID from the node: ProviderID \"rke2://vm-name-redacted\" didn't match expected format \"openstack://region/InstanceID\""

If I open all the security groups to any IP and make the argument manage-security-groups=false in the occm configuration

the log above and info you provided seems irrelevant , a little bit confused here as the error you showed is related to provider ID not match which might be related to openstack ID...

dulek commented 3 months ago

Looks like you're running on Rancher and your K8s nodes are rancher nodes? So seems like you're not running on OpenStack?

framctr commented 3 months ago

I'm using Rancher Manager to deploy RKE2 on OpenStack instances created by Rancher itself. Everything goes fine, but after the RKE2 cluster is created and I deploy OpenStack CCM with the management of the security groups enabled, I get the error I previously described.

I will do more testing and report it here

dulek commented 3 months ago

Okay, so it looks like something set the ProviderID on the node to "rke2://vm-name-redacted". As I believe this is responsibility of the cloud provider, it suggests there's another cloud provider running in the environment and that's the RKE provider. You can't share both.

jichenjc commented 3 months ago

'm using Rancher Manager to deploy RKE2 on OpenStack instances created by Rancher itself.

I think RKE2 cloud provider should be used instead of OCCM based on this sentence... whoever control the cloud function need run cloud provider and in your case seems it's RKE2

framctr commented 3 months ago

After sometime I came back to that issue and I found that it was related to an incorrect configuration in the kubelet.

Practically, I was forced to create a cluster with the RKE2 default cloud provider and then, to install occm, I needed set disable-cloud-controller: true in the cluster configuration. This disabled the RKE2 default cloud provider, but the kubelet was not correctly set to use occm, because it was missing the argument cloud-provider: external. For this reason I just added to the rke configuration:

machine_selector_config {
    config = yamlencode({ cloud-provider-name = "external" })
}