CPEM requires Kubernetes node name to match Equinix Metal device name

hh commented 7 months ago

I'm not sure where to set providerID. I don't remember setting it in the past. Any suggestions?

CPEM daemonset

kubectl  -n kube-system  describe ds cloud-provider-equinix-metal
Name:           cloud-provider-equinix-metal
Selector:       app=cloud-provider-equinix-metal
Node-Selector:  <none>
Labels:         app=cloud-provider-equinix-metal
Annotations:    deprecated.daemonset.template.generation: 1
Desired Number of Nodes Scheduled: 3
Current Number of Nodes Scheduled: 3
Number of Nodes Scheduled with Up-to-date Pods: 3
Number of Nodes Scheduled with Available Pods: 3
Number of Nodes Misscheduled: 0
Pods Status:  3 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:           app=cloud-provider-equinix-metal
  Service Account:  cloud-provider-equinix-metal
  Containers:
   cloud-provider-equinix-metal:
    Image:      quay.io/equinix-oss/cloud-provider-equinix-metal:v3.8.0
    Port:       <none>
    Host Port:  <none>
    Command:
      ./cloud-provider-equinix-metal
      --cloud-provider=equinixmetal
      --leader-elect=true
      --authentication-skip-lookup=true
      --cloud-config=/etc/cloud-sa/cloud-sa.json
    Requests:
      cpu:        100m
      memory:     50Mi
    Environment:  <none>
    Mounts:
      /etc/cloud-sa from cloud-sa-volume (ro)
  Volumes:
   cloud-sa-volume:
    Type:               Secret (a volume populated by a Secret)
    SecretName:         metal-cloud-config
    Optional:           false
  Priority Class Name:  system-cluster-critical
Events:                 <none>

cloud-sa.json

kubectl  -n kube-system  get secret metal-cloud-config -o json | jq '.data["cloud-sa.json"]' -r | base64 -d | jq .
{
  "apiKey": "XXXXXXXXX",
  "projectID": "82b5c425-8dd4-429e-ae0d-d32f265c63e4",
  "metro": "sv",
  "eipTag": "eip-apiserver-sharingio",
  "eipHealthCheckUseHostIP": true,
  "loadBalancer": "metallb:///metallb-system?crdConfiguration=true"
}

CPEM logs

kubectl  -n kube-system logs ds/cloud-provider-equinix-metal | tail -10
Found 3 pods, using pod/cloud-provider-equinix-metal-bl7nh
I0417 16:37:29.152076       1 eip_controlplane_reconciliation.go:249] healthcheck node https://139.178.94.175:6443/healthz
E0417 16:37:29.157164       1 eip_controlplane_reconciliation.go:135] failed to handle node health check: failed to assign the control plane endpoint: providerID cannot be empty string
I0417 16:37:29.157191       1 eip_controlplane_reconciliation.go:125] handling update, node: shining-ant
I0417 16:37:29.389548       1 eip_controlplane_reconciliation.go:529] doHealthCheck(): no control plane IP assignment found, trying to assign to an available controlplane node
I0417 16:37:29.399453       1 eip_controlplane_reconciliation.go:249] healthcheck node https://139.178.94.167:6443/healthz
E0417 16:37:29.405800       1 eip_controlplane_reconciliation.go:135] failed to handle node health check: failed to assign the control plane endpoint: providerID cannot be empty string
I0417 16:37:29.405833       1 eip_controlplane_reconciliation.go:125] handling update, node: trusty-marmot
I0417 16:37:29.675037       1 eip_controlplane_reconciliation.go:529] doHealthCheck(): no control plane IP assignment found, trying to assign to an available controlplane node
I0417 16:37:29.683583       1 eip_controlplane_reconciliation.go:249] healthcheck node https://145.40.82.49:6443/healthz
E0417 16:37:29.689076       1 eip_controlplane_reconciliation.go:135] failed to handle node health check: failed to assign the control plane endpoint: providerID cannot be empty string

cprivitere commented 7 months ago

You shouldn't be setting providerID, that's something CPEM sets for you. Why it's not setting it here though, that's the real question. Hmm.

We had this part working in the work we did before kubecon, do you still have access to that config? Probably something we had to disable on the talos side.

hh commented 7 months ago

It should be noted that it's also not clearing a taint I suspect it's responsible for: https://github.com/kubernetes-sigs/cloud-provider-equinix-metal/issues/531

hh commented 7 months ago

I have another open issue related to the /healthz check: https://github.com/kubernetes-sigs/cloud-provider-equinix-metal/issues/519

hh commented 7 months ago

Lively conversation happing in #support channel on Talos / Sidero slack: https://taloscommunity.slack.com/archives/CMARMBC4E/p1712793108556169

Seems it might be related to the deviceByName function fallback wanting the kubernetes node names to match the Equinix devices names exactly.

Possibly? https://github.com/kubernetes-sigs/cloud-provider-equinix-metal/blob/main/metal/devices.go#L165-L167

hh commented 7 months ago

Going to try setting the machine.kubelet.registerWithFQDN: true in the Talos configuration.

hh commented 7 months ago

I found a work around, but it was a bit difficult to find.

https://github.com/sharingio/infra/commit/96bff1f14010b050670e8760b538e706ef3da336

I might be a one-off, but it might make sense to take some steps to raise visibility so others don't get stuck on this in the future:

the CPEM error message should clearly state reason match could not occur, possibly link to documentation
CPEM documentation should clearly state that kubernetes node names must match Equinix Metal device names
Talos documentation should probably state something similar in an updated integration page with Equinix

k8s-triage-robot commented 4 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 2 months ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/cloud-provider-equinix-metal/issues/533#issuecomment-2351207294): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

kubernetes-sigs / cloud-provider-equinix-metal