Open schrodit opened 11 months ago
Hi @schrodit! Is this causing your cluster and/or the applications running in it to malfunction in some way, or is the log message the only observable outcome? Did this issue happen after upgrading CPEM in an existing cluster, or is this happening in a new cluster that was set up from the beginning with the software versions listed in the issue description?
Hey @ctreatma ,
Did this issue happen after upgrading CPEM in an existing cluster, or is this happening in a new cluster that was set up from the beginning with the software versions listed in the issue description?
the cpem and metallb were updated from metallb v9.5 and equinix ccm 3.3.0.
Is this causing your cluster and/or the applications running in it to malfunction in some way, or is the log message the only observable outcome?
An issue is that the bgp routes of some nodes are not correctly updated. So only half of the nodes serve the IP according to the Metal Console. And less and less machines are serving the domain. I think all nodes should serve the domain.
All speaker pods were no routes are assigned have the saem/similöar error message:
{"caller":"level.go:63","event":"nodeLabelsChanged","level":"info","msg":"Node labels changed, resyncing BGP peers","ts":"2023-07-17T13:45:53.944909421Z"}
{"caller":"level.go:63","configmap":"metallb-system/config","error":"peer #25 already exists","event":"configStale","level":"error","msg":"config (re)load failed, config marked stale","ts":"2023-07-17T13:45:53.948508927Z"}
Non-of them serving the ip.
Could you provide the config for the other peers as well? Are there 2 identical peer configs in metallb-system/config?
I think its the best if provide the whole configmap:
We now had to completly roll back to equinix ccm v3.3.0 and metallb v0.9.5 as all bgp routes were deleted.
After the rollback everything was fine again
Hey @schrodit few questions for you:
For @ctreatma it looks like the config map they ended up with had 48 entries with 12 unique hostnames, so four entries each.
Hey @cprivitere ,
Do you have the config map for CPEM that you tried using?
we do not use a configmap but rather the env var approach. Our env vars are:
env:
- name: METAL_API_KEY
valueFrom:
secretKeyRef:
key: apiToken
name: cloudprovider
- name: METAL_PROJECT_ID
valueFrom:
secretKeyRef:
key: projectID
name: cloudprovider
- name: METAL_METRO_NAME
value: fr
- name: METAL_LOAD_BALANCER
value: metallb:///
Do you have what the metallb config map looked like before the upgrade?
yes like it currently does (becasue we had to roll back):
Have you tried letting metallb rebuild the config map for metallb (obviously you'd want to do this in a testing environment to see how hard it is to kick the cluster into rebuilding that config map for you).
I think it did that when the new ccm was started. This was because first we used metallb 0.9.5 with ccm 3.6.2 which resulted in an error in the metallb (source-address was unknown). After updating the config was correctly parsed but the described error occurred.
Do you already have a planned timeframe for getting to MetalLB 0.13 or later? If so, we may be able to side-step troubleshooting this given the 0.13 versions use a CRD based setup and the work of getting to 0.12.1 won't help with that.
We are not tight to version 0.12.1. But I guess the CRD approach needs more testing. And the question is if it will work better. Because I tried to reproduce the issue in a fresh environment, but at least there it worked.
Maybe another sidenote to our setup:
we have two services of type Loadbalancer configured whereas one of them has externalTrafficPolicy: Local
.
Not sure if that makes a difference.
Do you have an Idea how we ended up there and how we can upgrade. If you say that the CRD approach is the safest way we could do that (But currently as I understood the configmap way is the default. so the stbale one?). Are there any things we need to consider while updating or is it mostly updating metallb and the configuring the ccm as described here
Because I tried to reproduce the issue in a fresh environment, but at least there it worked.
Are you saying that this works properly in a fresh environment so the issue here has to do with upgrading? We don't have a lot of testing or feedback about the upgrade process. I'm wondering if you need to upgrade in steps of metallb minor versions until you reach an issue and then cpem minor versions until you hit an issue and then back to metallb.
Is it possible to instead plan to build a fresh cluster and migrate the applications? I feel like a fresh cluster with 3.6.2 and MetalLB 0.13.X would be an easier route.
Are you saying that this works properly in a fresh environment so the issue here has to do with upgrading?
It works in my simple test. But I'm not sure how to reproduce the situation in our prod cluster.
Is it possible to instead plan to build a fresh cluster and migrate the applications?
This is unfortunately not possible. We have customers running on that cluster and can not simple migrate them.
@cprivitere do you have an idea why the ccm generates a invalid metallb config? You're suggest way foward is to upgrade metallb. then upgrade the equinix ccm. What if there are still issues?
@schrodit Still working on reproducing this and figuring out how to best guide you. One thing that did come up in my testing. Between CPEM 3.3.0 and 3.6.2 we changed it from running as a deployment to running as a daemonset. Did you possibly have both the old and new version running at the same time? How did you go about stopping the old deployment and starting the new daemonset?
Ok we did some attempts to reproduce this and while we can't get the same exact issue you ran into, we did see several issues come up depending on how one tries to upgrade. Here's what we've found to be the best way to upgrade from MetalLB 0.9.5 and CPEM 3.3.0 to MetalLB 0.12.1 and CPEM 3.6.2. The biggest finding was that the Metal LB config map needs to be cleaned up to get rid of older entries that aren't formatted properly, CPEM won't over write these so the only option is to delete them.
This assumes your metallb lives in the metallb-system namespace and the metallb configmap is named "config".
# Remove current Metal LB
kubectl delete -f https://raw.githubusercontent.com/metallb/metallb/v0.9.5/manifests/metallb.yaml
# Remove current CPEM
kubectl delete -f https://github.com/equinix/cloud-provider-equinix-metal/releases/download/v3.3.0/deployment.yaml
# Remove current MetalLB configmap (CPEM will regenerate it)
kubectl -n metallb-system delete cm config
# Install current Metal LB
kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.12.1/manifests/metallb.yaml
# Install current CPEM
kubectl apply -f https://github.com/equinix/cloud-provider-equinix-metal/releases/download/v3.6.2/deployment.yaml
Are you able to test this out and see if the load balancers come up following your upgrade?
Hi @schrodit, complementing @cprivitere ' comment and for other people seeking this information, during the tests, we found that version Metallb 0.9.5 doesn't work with CPEM 3.6.2, so upgrading both is necessary to ensure it works well. We also noticed that it is possible to upgrade them without deleting the configmap. However, we still recommend doing so when possible, as we encountered duplicated fields as in the config you shared when we perform multiple upgrades/rollbacks between different versions where changes were made to some data types and fields, and they were interpreted as different
@cprivitere @ocobles thanks for the input. Cleaning up the config.yaml seems to be the best option for us. We still try that out and let you know.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
/reopen /remove-lifecycle rotten /triage accepted
@cprivitere: Reopened this issue.
The metallb of our Kubernetes cluster throws that error on startup:
the respective peer config is:
In the equinix metalconsole I also see that we 2 nodes without a learned bgp route (might be related).
Environment:
K8s: 1.22.17 Equinix CCM: v3.6.2 MetalLb: 0.12.1