MetalLB ConfigMap not updated when node IP changes

TimJones commented 1 year ago

We recently ran into an issue with using Equinix CCM v3.5.0 with MetalLB v0.12.1 with the ConfigMap not having the correct IP address for a node in the cluster and not being updated to correct the misconfiguration.

The node in question had an address of 10.68.104.15

❯ kubectl --context adaptmx-DC get node omni-c3-medium-x86-2 -o wide
NAME                   STATUS   ROLES          AGE     VERSION   INTERNAL-IP    EXTERNAL-IP     OS-IMAGE         KERNEL-VERSION   CONTAINER-RUNTIME
omni-c3-medium-x86-2   Ready    loadbalancer   3d22h   v1.26.1   10.68.104.15   147.75.51.245   Talos (v1.3.6)   5.15.102-talos   containerd://1.6.18

But for some reason the entries in the MetalLB ConfigMap had another IP:

apiVersion: v1
kind: ConfigMap
metadata:
  name: equinix-metallb
  namespace: sidero-ingress
data:
  config: |
    peers:
    - my-asn: 65000
      peer-asn: 65530
      peer-address: 169.254.255.1
      peer-port: 0
      source-address: 10.68.104.109
      hold-time: ""
      router-id: ""
      node-selectors:
      - match-labels:
          kubernetes.io/hostname: omni-c3-medium-x86-2
        match-expressions: []
      - match-labels:
          nomatch.metal.equinix.com/service-name: envoy
          nomatch.metal.equinix.com/service-namespace: sidero-ingress
        match-expressions: []
      password: ""
    - my-asn: 65000
      peer-asn: 65530
      peer-address: 169.254.255.2
      peer-port: 0
      source-address: 10.68.104.109
      hold-time: ""
      router-id: ""
      node-selectors:
      - match-labels:
          kubernetes.io/hostname: omni-c3-medium-x86-2
        match-expressions: []
      - match-labels:
          nomatch.metal.equinix.com/service-name: envoy
          nomatch.metal.equinix.com/service-namespace: sidero-ingress
        match-expressions: []
      password: ""

This was causing the MetalLB speaker pod on that node to fail, and therefore not receive traffic for the BGP LoadBalancer addresses:

❯ kubectl --context adaptmx-DC -n sidero-ingress logs dc-metallb-speaker-hbf5d
{"caller":"level.go:63","error":"dial \"169.254.255.2:179\": Address \"10.68.104.109\" doesn't exist on this host","level":"error","localASN":65000,"msg":"failed to connect to peer","op":"connect","peer":"169.254.255.2:179","peerASN":65530,"ts":"2023-05-16T11:45:42.880991863Z"}
{"caller":"level.go:63","error":"dial \"169.254.255.1:179\": Address \"10.68.104.109\" doesn't exist on this host","level":"error","localASN":65000,"msg":"failed to connect to peer","op":"connect","peer":"169.254.255.1:179","peerASN":65530,"ts":"2023-05-16T11:45:42.881017161Z"}

When I manually deleted the peers entries for the host from the ConfigMap and restarted the CCM, it regenerated the peers with the correct configuration:

❯ kubectl --context adaptmx-DC -n kube-system logs cloud-provider-equinix-metal-vvztg
I0516 13:09:46.183404       1 serving.go:348] Generated self-signed cert in-memory
W0516 13:09:46.574009       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0516 13:09:46.574466       1 config.go:201] authToken: '<masked>'
I0516 13:09:46.574474       1 config.go:201] projectID: '83005521-48d5-4eae-bf4c-25c0f7d0fe97'
I0516 13:09:46.574477       1 config.go:201] load balancer config: 'metallb:///sidero-ingress/equinix-metallb'
I0516 13:09:46.574480       1 config.go:201] metro: ''
I0516 13:09:46.574484       1 config.go:201] facility: 'dc13'
I0516 13:09:46.574487       1 config.go:201] local ASN: '65000'
I0516 13:09:46.574490       1 config.go:201] Elastic IP Tag: ''
I0516 13:09:46.574493       1 config.go:201] API Server Port: '0'
I0516 13:09:46.574496       1 config.go:201] BGP Node Selector: ''
I0516 13:09:46.574535       1 controllermanager.go:145] Version: v3.5.0
I0516 13:09:46.575652       1 secure_serving.go:210] Serving securely on [::]:10258
I0516 13:09:46.575739       1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0516 13:09:46.575864       1 leaderelection.go:248] attempting to acquire leader lease kube-system/cloud-controller-manager...
I0516 13:10:03.423819       1 leaderelection.go:258] successfully acquired lease kube-system/cloud-controller-manager
I0516 13:10:03.423965       1 event.go:294] "Event occurred" object="kube-system/cloud-controller-manager" fieldPath="" kind="Lease" apiVersion="coordination.k8s.io/v1" type="Normal" reason="LeaderElection" message="omni-m3-small-x86-0_71489971-0cd0-474e-8316-e08565c637cd became leader"
I0516 13:10:03.831437       1 eip_controlplane_reconciliation.go:71] EIP Tag is not configured skipping control plane endpoint management.
I0516 13:10:04.182808       1 loadbalancers.go:86] loadbalancer implementation enabled: metallb
I0516 13:10:04.182841       1 cloud.go:98] Initialize of cloud provider complete
I0516 13:10:04.183323       1 controllermanager.go:301] Started "cloud-node"
I0516 13:10:04.183404       1 node_controller.go:157] Sending events to api server.
I0516 13:10:04.183562       1 node_controller.go:166] Waiting for informer caches to sync
I0516 13:10:04.183583       1 controllermanager.go:301] Started "cloud-node-lifecycle"
I0516 13:10:04.183714       1 node_lifecycle_controller.go:113] Sending events to api server
I0516 13:10:04.184011       1 controllermanager.go:301] Started "service"
I0516 13:10:04.184290       1 controller.go:241] Starting service controller
I0516 13:10:04.184345       1 shared_informer.go:255] Waiting for caches to sync for service
I0516 13:10:04.285005       1 shared_informer.go:262] Caches are synced for service
I0516 13:10:04.285429       1 event.go:294] "Event occurred" object="sidero-ingress/envoy" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer"
E0516 13:10:25.485873       1 loadbalancers.go:471] could not ensure BGP enabled for node omni-c3-medium-x86-2: %!w(*packngo.ErrorResponse=&{0xc0004af830 [Oh snap, something went wrong! We've logged the error and will take a look - please reach out to us if you continue having trouble.] })
I0516 13:10:27.052320       1 event.go:294] "Event occurred" object="sidero-ingress/envoy" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="EnsuredLoadBalancer" message="Ensured load balancer"
I0516 13:10:32.721346       1 event.go:294] "Event occurred" object="sidero-ingress/envoy" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="UpdatedLoadBalancer" message="Updated load balancer with new hosts"

We confirmed that the peers entries were correct in the MetalLB ConfigMap and that the node was then able to handle traffic, but I would expect this to be handled by the CCM and updated on the fly when configuration drift is detected.

cprivitere commented 1 year ago

@ctreatma @displague Did we get the fix for this for free with the changes we made in 3.6.1 to properly support generating the peers for MetalLB <= 0.12.1?

displague commented 1 year ago

Coming from 3.5.0, I'm not sure that 3.6.0->3.6.1 fix would be a factor.

This EM API 500 error line (in the working state, after manually updating the peers) is suspicious. I wonder if this could have been preventing the config from being automatically updated:

E0516 13:10:25.485873       1 loadbalancers.go:471] could not ensure BGP enabled for node omni-c3-medium-x86-2: %!w(*packngo.ErrorResponse=&{0xc0004af830 [Oh snap, something went wrong! We've logged the error and will take a look - please reach out to us if you continue having trouble.] })

@TimJones was this present in the logs before your manual update?

displague commented 1 year ago

https://github.com/equinix/cloud-provider-equinix-metal/issues/198#issuecomment-1116002323 could be related too ("Other than at startup,...")

TimJones commented 1 year ago

@TimJones was this present in the logs before your manual update?

@displague Not as far as I saw. That error was only logged after manually modifying the ConfigMap & restarting the CCM, and only the once. I've rechecked the logs and it hasn't appeared since either, but then it hasn't logged anything since at all either.

cprivitere commented 1 year ago

Current thinking:

This could be the same issue as #198 , just another aspect that's impacted by it.
This could just be fixed with an upgrade to 3.6.2 and/or Metal LB 0.13.X's CRD style configuration
I'm not 100% sure how this is supposed to work, if we're supposed to update these or if metal lb is supposed to recreate them when the source node changes.

My main confusion point is HOW this happened. If this is just a result of a service moving to another node, which is just something that can happen with k8s (resources can move between nodes), then how isn't this happening all the time with folks?

k8s-triage-robot commented 9 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 8 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 7 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 7 months ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/cloud-provider-equinix-metal/issues/412#issuecomment-2010278610): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

cprivitere commented 7 months ago

/remove-lifecycle rotten

cprivitere commented 7 months ago

/reopen

k8s-ci-robot commented 7 months ago

@cprivitere: Reopened this issue.

In response to [this](https://github.com/kubernetes-sigs/cloud-provider-equinix-metal/issues/412#issuecomment-2010476779): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

k8s-triage-robot commented 4 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

cprivitere commented 4 months ago

/remove-lifecycle stale

cprivitere commented 4 months ago

/triage accepted

kubernetes-sigs / cloud-provider-equinix-metal

MetalLB ConfigMap not updated when node IP changes #412