kubernetes-sigs / cloud-provider-equinix-metal

Kubernetes Cloud Provider for Equinix Metal (formerly Packet Cloud Controller Manager)
https://deploy.equinix.com/labs/cloud-provider-equinix-metal
Apache License 2.0
74 stars 26 forks source link

MetalLB ConfigMap not updated when node IP changes #412

Open TimJones opened 1 year ago

TimJones commented 1 year ago

We recently ran into an issue with using Equinix CCM v3.5.0 with MetalLB v0.12.1 with the ConfigMap not having the correct IP address for a node in the cluster and not being updated to correct the misconfiguration.

The node in question had an address of 10.68.104.15

❯ kubectl --context adaptmx-DC get node omni-c3-medium-x86-2 -o wide
NAME                   STATUS   ROLES          AGE     VERSION   INTERNAL-IP    EXTERNAL-IP     OS-IMAGE         KERNEL-VERSION   CONTAINER-RUNTIME
omni-c3-medium-x86-2   Ready    loadbalancer   3d22h   v1.26.1   10.68.104.15   147.75.51.245   Talos (v1.3.6)   5.15.102-talos   containerd://1.6.18

But for some reason the entries in the MetalLB ConfigMap had another IP:

apiVersion: v1
kind: ConfigMap
metadata:
  name: equinix-metallb
  namespace: sidero-ingress
data:
  config: |
    peers:
    - my-asn: 65000
      peer-asn: 65530
      peer-address: 169.254.255.1
      peer-port: 0
      source-address: 10.68.104.109
      hold-time: ""
      router-id: ""
      node-selectors:
      - match-labels:
          kubernetes.io/hostname: omni-c3-medium-x86-2
        match-expressions: []
      - match-labels:
          nomatch.metal.equinix.com/service-name: envoy
          nomatch.metal.equinix.com/service-namespace: sidero-ingress
        match-expressions: []
      password: ""
    - my-asn: 65000
      peer-asn: 65530
      peer-address: 169.254.255.2
      peer-port: 0
      source-address: 10.68.104.109
      hold-time: ""
      router-id: ""
      node-selectors:
      - match-labels:
          kubernetes.io/hostname: omni-c3-medium-x86-2
        match-expressions: []
      - match-labels:
          nomatch.metal.equinix.com/service-name: envoy
          nomatch.metal.equinix.com/service-namespace: sidero-ingress
        match-expressions: []
      password: ""

This was causing the MetalLB speaker pod on that node to fail, and therefore not receive traffic for the BGP LoadBalancer addresses:

❯ kubectl --context adaptmx-DC -n sidero-ingress logs dc-metallb-speaker-hbf5d
{"caller":"level.go:63","error":"dial \"169.254.255.2:179\": Address \"10.68.104.109\" doesn't exist on this host","level":"error","localASN":65000,"msg":"failed to connect to peer","op":"connect","peer":"169.254.255.2:179","peerASN":65530,"ts":"2023-05-16T11:45:42.880991863Z"}
{"caller":"level.go:63","error":"dial \"169.254.255.1:179\": Address \"10.68.104.109\" doesn't exist on this host","level":"error","localASN":65000,"msg":"failed to connect to peer","op":"connect","peer":"169.254.255.1:179","peerASN":65530,"ts":"2023-05-16T11:45:42.881017161Z"}

When I manually deleted the peers entries for the host from the ConfigMap and restarted the CCM, it regenerated the peers with the correct configuration:

❯ kubectl --context adaptmx-DC -n kube-system logs cloud-provider-equinix-metal-vvztg
I0516 13:09:46.183404       1 serving.go:348] Generated self-signed cert in-memory
W0516 13:09:46.574009       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0516 13:09:46.574466       1 config.go:201] authToken: '<masked>'
I0516 13:09:46.574474       1 config.go:201] projectID: '83005521-48d5-4eae-bf4c-25c0f7d0fe97'
I0516 13:09:46.574477       1 config.go:201] load balancer config: 'metallb:///sidero-ingress/equinix-metallb'
I0516 13:09:46.574480       1 config.go:201] metro: ''
I0516 13:09:46.574484       1 config.go:201] facility: 'dc13'
I0516 13:09:46.574487       1 config.go:201] local ASN: '65000'
I0516 13:09:46.574490       1 config.go:201] Elastic IP Tag: ''
I0516 13:09:46.574493       1 config.go:201] API Server Port: '0'
I0516 13:09:46.574496       1 config.go:201] BGP Node Selector: ''
I0516 13:09:46.574535       1 controllermanager.go:145] Version: v3.5.0
I0516 13:09:46.575652       1 secure_serving.go:210] Serving securely on [::]:10258
I0516 13:09:46.575739       1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0516 13:09:46.575864       1 leaderelection.go:248] attempting to acquire leader lease kube-system/cloud-controller-manager...
I0516 13:10:03.423819       1 leaderelection.go:258] successfully acquired lease kube-system/cloud-controller-manager
I0516 13:10:03.423965       1 event.go:294] "Event occurred" object="kube-system/cloud-controller-manager" fieldPath="" kind="Lease" apiVersion="coordination.k8s.io/v1" type="Normal" reason="LeaderElection" message="omni-m3-small-x86-0_71489971-0cd0-474e-8316-e08565c637cd became leader"
I0516 13:10:03.831437       1 eip_controlplane_reconciliation.go:71] EIP Tag is not configured skipping control plane endpoint management.
I0516 13:10:04.182808       1 loadbalancers.go:86] loadbalancer implementation enabled: metallb
I0516 13:10:04.182841       1 cloud.go:98] Initialize of cloud provider complete
I0516 13:10:04.183323       1 controllermanager.go:301] Started "cloud-node"
I0516 13:10:04.183404       1 node_controller.go:157] Sending events to api server.
I0516 13:10:04.183562       1 node_controller.go:166] Waiting for informer caches to sync
I0516 13:10:04.183583       1 controllermanager.go:301] Started "cloud-node-lifecycle"
I0516 13:10:04.183714       1 node_lifecycle_controller.go:113] Sending events to api server
I0516 13:10:04.184011       1 controllermanager.go:301] Started "service"
I0516 13:10:04.184290       1 controller.go:241] Starting service controller
I0516 13:10:04.184345       1 shared_informer.go:255] Waiting for caches to sync for service
I0516 13:10:04.285005       1 shared_informer.go:262] Caches are synced for service
I0516 13:10:04.285429       1 event.go:294] "Event occurred" object="sidero-ingress/envoy" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer"
E0516 13:10:25.485873       1 loadbalancers.go:471] could not ensure BGP enabled for node omni-c3-medium-x86-2: %!w(*packngo.ErrorResponse=&{0xc0004af830 [Oh snap, something went wrong! We've logged the error and will take a look - please reach out to us if you continue having trouble.] })
I0516 13:10:27.052320       1 event.go:294] "Event occurred" object="sidero-ingress/envoy" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="EnsuredLoadBalancer" message="Ensured load balancer"
I0516 13:10:32.721346       1 event.go:294] "Event occurred" object="sidero-ingress/envoy" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="UpdatedLoadBalancer" message="Updated load balancer with new hosts"

We confirmed that the peers entries were correct in the MetalLB ConfigMap and that the node was then able to handle traffic, but I would expect this to be handled by the CCM and updated on the fly when configuration drift is detected.

cprivitere commented 1 year ago

@ctreatma @displague Did we get the fix for this for free with the changes we made in 3.6.1 to properly support generating the peers for MetalLB <= 0.12.1?

displague commented 1 year ago

Coming from 3.5.0, I'm not sure that 3.6.0->3.6.1 fix would be a factor.

This EM API 500 error line (in the working state, after manually updating the peers) is suspicious. I wonder if this could have been preventing the config from being automatically updated:

E0516 13:10:25.485873       1 loadbalancers.go:471] could not ensure BGP enabled for node omni-c3-medium-x86-2: %!w(*packngo.ErrorResponse=&{0xc0004af830 [Oh snap, something went wrong! We've logged the error and will take a look - please reach out to us if you continue having trouble.] })

@TimJones was this present in the logs before your manual update?

displague commented 1 year ago

https://github.com/equinix/cloud-provider-equinix-metal/issues/198#issuecomment-1116002323 could be related too ("Other than at startup,...")

TimJones commented 1 year ago

@TimJones was this present in the logs before your manual update?

@displague Not as far as I saw. That error was only logged after manually modifying the ConfigMap & restarting the CCM, and only the once. I've rechecked the logs and it hasn't appeared since either, but then it hasn't logged anything since at all either.

cprivitere commented 1 year ago

Current thinking:

My main confusion point is HOW this happened. If this is just a result of a service moving to another node, which is just something that can happen with k8s (resources can move between nodes), then how isn't this happening all the time with folks?

k8s-triage-robot commented 9 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 8 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 7 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 7 months ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/cloud-provider-equinix-metal/issues/412#issuecomment-2010278610): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
cprivitere commented 7 months ago

/remove-lifecycle rotten

cprivitere commented 7 months ago

/reopen

k8s-ci-robot commented 7 months ago

@cprivitere: Reopened this issue.

In response to [this](https://github.com/kubernetes-sigs/cloud-provider-equinix-metal/issues/412#issuecomment-2010476779): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
k8s-triage-robot commented 4 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

cprivitere commented 4 months ago

/remove-lifecycle stale

cprivitere commented 4 months ago

/triage accepted