kubernetes-sigs / cloud-provider-equinix-metal

Kubernetes Cloud Provider for Equinix Metal (formerly Packet Cloud Controller Manager)
https://deploy.equinix.com/labs/cloud-provider-equinix-metal
Apache License 2.0
73 stars 26 forks source link

CPEM requires Kubernetes node name to match Equinix Metal device name #533

Open hh opened 2 months ago

hh commented 2 months ago

I'm not sure where to set providerID. I don't remember setting it in the past. Any suggestions?

CPEM daemonset

kubectl  -n kube-system  describe ds cloud-provider-equinix-metal
Name:           cloud-provider-equinix-metal
Selector:       app=cloud-provider-equinix-metal
Node-Selector:  <none>
Labels:         app=cloud-provider-equinix-metal
Annotations:    deprecated.daemonset.template.generation: 1
Desired Number of Nodes Scheduled: 3
Current Number of Nodes Scheduled: 3
Number of Nodes Scheduled with Up-to-date Pods: 3
Number of Nodes Scheduled with Available Pods: 3
Number of Nodes Misscheduled: 0
Pods Status:  3 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:           app=cloud-provider-equinix-metal
  Service Account:  cloud-provider-equinix-metal
  Containers:
   cloud-provider-equinix-metal:
    Image:      quay.io/equinix-oss/cloud-provider-equinix-metal:v3.8.0
    Port:       <none>
    Host Port:  <none>
    Command:
      ./cloud-provider-equinix-metal
      --cloud-provider=equinixmetal
      --leader-elect=true
      --authentication-skip-lookup=true
      --cloud-config=/etc/cloud-sa/cloud-sa.json
    Requests:
      cpu:        100m
      memory:     50Mi
    Environment:  <none>
    Mounts:
      /etc/cloud-sa from cloud-sa-volume (ro)
  Volumes:
   cloud-sa-volume:
    Type:               Secret (a volume populated by a Secret)
    SecretName:         metal-cloud-config
    Optional:           false
  Priority Class Name:  system-cluster-critical
Events:                 <none>

cloud-sa.json

kubectl  -n kube-system  get secret metal-cloud-config -o json | jq '.data["cloud-sa.json"]' -r | base64 -d | jq .
{
  "apiKey": "XXXXXXXXX",
  "projectID": "82b5c425-8dd4-429e-ae0d-d32f265c63e4",
  "metro": "sv",
  "eipTag": "eip-apiserver-sharingio",
  "eipHealthCheckUseHostIP": true,
  "loadBalancer": "metallb:///metallb-system?crdConfiguration=true"
}

CPEM logs

kubectl  -n kube-system logs ds/cloud-provider-equinix-metal | tail -10
Found 3 pods, using pod/cloud-provider-equinix-metal-bl7nh
I0417 16:37:29.152076       1 eip_controlplane_reconciliation.go:249] healthcheck node https://139.178.94.175:6443/healthz
E0417 16:37:29.157164       1 eip_controlplane_reconciliation.go:135] failed to handle node health check: failed to assign the control plane endpoint: providerID cannot be empty string
I0417 16:37:29.157191       1 eip_controlplane_reconciliation.go:125] handling update, node: shining-ant
I0417 16:37:29.389548       1 eip_controlplane_reconciliation.go:529] doHealthCheck(): no control plane IP assignment found, trying to assign to an available controlplane node
I0417 16:37:29.399453       1 eip_controlplane_reconciliation.go:249] healthcheck node https://139.178.94.167:6443/healthz
E0417 16:37:29.405800       1 eip_controlplane_reconciliation.go:135] failed to handle node health check: failed to assign the control plane endpoint: providerID cannot be empty string
I0417 16:37:29.405833       1 eip_controlplane_reconciliation.go:125] handling update, node: trusty-marmot
I0417 16:37:29.675037       1 eip_controlplane_reconciliation.go:529] doHealthCheck(): no control plane IP assignment found, trying to assign to an available controlplane node
I0417 16:37:29.683583       1 eip_controlplane_reconciliation.go:249] healthcheck node https://145.40.82.49:6443/healthz
E0417 16:37:29.689076       1 eip_controlplane_reconciliation.go:135] failed to handle node health check: failed to assign the control plane endpoint: providerID cannot be empty string
cprivitere commented 2 months ago

You shouldn't be setting providerID, that's something CPEM sets for you. Why it's not setting it here though, that's the real question. Hmm.

We had this part working in the work we did before kubecon, do you still have access to that config? Probably something we had to disable on the talos side.

hh commented 2 months ago

It should be noted that it's also not clearing a taint I suspect it's responsible for: https://github.com/kubernetes-sigs/cloud-provider-equinix-metal/issues/531

hh commented 2 months ago

I have another open issue related to the /healthz check: https://github.com/kubernetes-sigs/cloud-provider-equinix-metal/issues/519

hh commented 2 months ago

Lively conversation happing in #support channel on Talos / Sidero slack: https://taloscommunity.slack.com/archives/CMARMBC4E/p1712793108556169

Seems it might be related to the deviceByName function fallback wanting the kubernetes node names to match the Equinix devices names exactly.

Possibly? https://github.com/kubernetes-sigs/cloud-provider-equinix-metal/blob/main/metal/devices.go#L165-L167

hh commented 2 months ago

Going to try setting the machine.kubelet.registerWithFQDN: true in the Talos configuration.

hh commented 2 months ago

I found a work around, but it was a bit difficult to find.

https://github.com/sharingio/infra/commit/96bff1f14010b050670e8760b538e706ef3da336

I might be a one-off, but it might make sense to take some steps to raise visibility so others don't get stuck on this in the future: