kubernetes-sigs / cloud-provider-equinix-metal

Kubernetes Cloud Provider for Equinix Metal (formerly Packet Cloud Controller Manager)
https://deploy.equinix.com/labs/cloud-provider-equinix-metal
Apache License 2.0
76 stars 27 forks source link

How to ignore MetalLB trying to provision CPEM LoadBalancer? #389

Open Lirt opened 1 year ago

Lirt commented 1 year ago

Hello,

This is rather complicated issue but I'll try to explain it in simplest way.

I have standard CPEM LoadBalancer provisioned by CPEM:

k get svc
cloud-provider-equinix-metal-kubernetes-external       LoadBalancer

I use MetalLB to provision additional LoadBalancer services - currently just one ingress-nginx-caas-controller for test case.

I have issue that MetalLB is watching service cloud-provider-equinix-metal-kubernetes-external by default and it fights for updates on this service with CPEM. We see this very easily, because as soon as I start MetalLB controller the cloud-provider-equinix-metal-kubernetes-external service changes to this (see <pending>):

$ k get svc
NAME                                                   TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)                  AGE
cloud-provider-equinix-metal-kubernetes-external       LoadBalancer   172.26.85.165    <pending>     443:32557/TCP            49d

This is service description including last events to see that metallb is actually doing changes to this svc:

Name:                     cloud-provider-equinix-metal-kubernetes-external
Namespace:                kube-system
Labels:                   <none>
Annotations:              metallb.universe.tf/address-pool: disabled-metallb-do-not-use-any-address-pool
Selector:                 <none>
Type:                     LoadBalancer
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       172.26.85.165
IPs:                      172.26.85.165
IP:                       <REDACTED>
Port:                     https  443/TCP
TargetPort:               6443/TCP
NodePort:                 https  32557/TCP
Endpoints:                10.68.53.131:6443,10.68.53.137:6443,10.68.53.139:6443
Session Affinity:         None
External Traffic Policy:  Cluster
Events:
  Type     Reason                Age                From                Message
  ----     ------                ----               ----                -------
  Normal   EnsuringLoadBalancer  44m                service-controller  Ensuring load balancer
  Normal   EnsuredLoadBalancer   44m                service-controller  Ensured load balancer
  Normal   EnsuringLoadBalancer  35m                service-controller  Ensuring load balancer
  Normal   EnsuredLoadBalancer   35m                service-controller  Ensured load balancer
  Normal   EnsuringLoadBalancer  17m                service-controller  Ensuring load balancer
  Normal   EnsuredLoadBalancer   17m                service-controller  Ensured load balancer
  Warning  AllocationFailed      84s (x3 over 84s)  metallb-controller  Failed to allocate IP for "kube-system/cloud-provider-equinix-metal-kubernetes-external": ["<REDACTED>"] is not allowed in config

EQX support told us we do 15k IP assignments per day. It's most likely caused by situation describe above.

So I wanted to use new feature of MetalLB (0.13) to set loadBalancerClass that MetalLB will be watching - https://github.com/metallb/metallb/blob/77923bc823294f2f31e68193901efa3b30faea59/controller/main.go. Simply define --lb-class my-lb-class.

MetalLB stops updating cloud-provider-equinix-metal-kubernetes-external as expected. This is good.

But then what happens is that CPEM doesn't see events on service with loadBalancerClass. Meaning when I create or delete service that contains loadBalancerClass, nothing happens in CPEM.

After long troubleshooting I found out that this behavior is defined in ServiceController that CPEM uses and is expected to happen - please see this code.

Now :smile: seeing that those 2 controllers don't work well together my question is do you have recommended way how to make this setup to work correctly without DoS-ing your API or point me to where I do a mistake if I do any.

I understand that this part of the code is very unlikely to be changed. If MetalLB decided to just use annotation to ignore service it would be all good :smiley: but they actually used attribute that is ignored by cloudprovider library.

Issues is easy to replicate - here is example of service I create (this service will be unnoticed by CPEM):

---
apiVersion: v1
kind: Service
metadata:
  name: ingress-nginx-caas-controller
  namespace: kube-system
spec:
  type: LoadBalancer
  allocateLoadBalancerNodePorts: true
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  loadBalancerClass: my-lb-class
  ports:
  - appProtocol: http
    name: http
    port: 80
    protocol: TCP
    targetPort: http
  - appProtocol: https
    name: https
    port: 443
    protocol: TCP
    targetPort: https
  selector:
    app.kubernetes.io/component: controller
    app.kubernetes.io/instance: ingress-nginx
    app.kubernetes.io/name: ingress-nginx

Note: Tested with latest main (https://github.com/equinix/cloud-provider-equinix-metal/pull/386). I think this issue was present also before and is not related to recent changes.

cprivitere commented 1 year ago

@Lirt I thought the 15K ip assignments per day was due to bug #380. Are you still doing that many assignments after the fix for #380 was installed?

Lirt commented 1 year ago

Hmmm, it's hard to tell which one caused the IP assignment DoS. But the reason why service is in pending forever is this one in our case ( disappears after I stop metallb-controller). I don't have a way to see how many requests are being done right now I think...

You can eventually check the counters again in one day (or check what is the rate right now if it helps).

cprivitere commented 1 year ago

Thanks @Lirt . We've done some checking and validated that the actual cause of the error was on our API's side. No fixes to CPEM resolved it and you're not currently causing any additional assignments right now.

I appreciate you're trying to leverage LoadBalancerClass's to avoid ever accidentally triggering this again, but this particular issue can't actually be stopped with this method. It was truly on the Equinix metal API side of things.

What we CAN do is implement better rate limiting and error handling, and that's something we've targeted to do for CPEM, but I don't have a timeframe for when it would be done.

If you're still interested in using LoadBalancerClass, we can continue to look at how to make CPEM interact with them better and not run into this issue.

Lirt commented 1 year ago

Thank you for help.

This is not that important for us as long as it's not causing you internal troubles. My impression was that this is causing high amount of ip assignment requests, but if not, then it's good.

So right now only thing that is "off" is cosmetic issue - external IP of Service in <pending> state.

cloud-provider-equinix-metal-kubernetes-external       LoadBalancer   172.26.85.165    <pending>       443:32557/TCP                49d
cprivitere commented 1 year ago

Understood. Even if it's just a cosmetic issue, knowing that you're going to continue using LoadBalancerClass helps us prioritize this versus other issues when we consider what to fix next. Thank you.

k8s-triage-robot commented 10 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 9 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 8 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 8 months ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/cloud-provider-equinix-metal/issues/389#issuecomment-2005590506): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
cprivitere commented 6 months ago

/reopen

k8s-ci-robot commented 6 months ago

@cprivitere: Reopened this issue.

In response to [this](https://github.com/kubernetes-sigs/cloud-provider-equinix-metal/issues/389#issuecomment-2110615300): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
cprivitere commented 6 months ago

/triage accepted