Open TimJones opened 1 year ago
@ctreatma @displague Did we get the fix for this for free with the changes we made in 3.6.1 to properly support generating the peers for MetalLB <= 0.12.1?
Coming from 3.5.0, I'm not sure that 3.6.0->3.6.1 fix would be a factor.
This EM API 500 error line (in the working state, after manually updating the peers) is suspicious. I wonder if this could have been preventing the config from being automatically updated:
E0516 13:10:25.485873 1 loadbalancers.go:471] could not ensure BGP enabled for node omni-c3-medium-x86-2: %!w(*packngo.ErrorResponse=&{0xc0004af830 [Oh snap, something went wrong! We've logged the error and will take a look - please reach out to us if you continue having trouble.] })
@TimJones was this present in the logs before your manual update?
https://github.com/equinix/cloud-provider-equinix-metal/issues/198#issuecomment-1116002323 could be related too ("Other than at startup,...")
@TimJones was this present in the logs before your manual update?
@displague Not as far as I saw. That error was only logged after manually modifying the ConfigMap & restarting the CCM, and only the once. I've rechecked the logs and it hasn't appeared since either, but then it hasn't logged anything since at all either.
Current thinking:
My main confusion point is HOW this happened. If this is just a result of a service moving to another node, which is just something that can happen with k8s (resources can move between nodes), then how isn't this happening all the time with folks?
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
/remove-lifecycle rotten
/reopen
@cprivitere: Reopened this issue.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
/triage accepted
We recently ran into an issue with using Equinix CCM
v3.5.0
with MetalLBv0.12.1
with the ConfigMap not having the correct IP address for a node in the cluster and not being updated to correct the misconfiguration.The node in question had an address of
10.68.104.15
But for some reason the entries in the MetalLB ConfigMap had another IP:
This was causing the MetalLB speaker pod on that node to fail, and therefore not receive traffic for the BGP LoadBalancer addresses:
When I manually deleted the peers entries for the host from the ConfigMap and restarted the CCM, it regenerated the peers with the correct configuration:
We confirmed that the peers entries were correct in the MetalLB ConfigMap and that the node was then able to handle traffic, but I would expect this to be handled by the CCM and updated on the fly when configuration drift is detected.