kubernetes-sigs / cloud-provider-equinix-metal

Kubernetes Cloud Provider for Equinix Metal (formerly Packet Cloud Controller Manager)
https://deploy.equinix.com/labs/cloud-provider-equinix-metal
Apache License 2.0
74 stars 26 forks source link

CPEM should clear node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule ? #531

Closed hh closed 5 months ago

hh commented 5 months ago

I'm having some trouble with CPEM initializing and clearing the node.cloudprovider taint:

$ kubectl describe nodes  | grep Taints
Taints:             node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule
Taints:             node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule
Taints:             node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule

So I thought I'd run a pod to verify the Equinix metadata endpoint:

kubectl run -n kube-system -i --restart=Never --rm ubuntu --image=ubuntu --overrides='{"kind":"Pod", "apiVersion":"v1", "spec": {"hostNetwork": true}}'

But the jobs can't be scheduled on the nodes:

kubectl describe pods -n kube-system ubuntu | grep -A5 Events:
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  3s    default-scheduler  0/3 nodes are available: 3 node(s) had untolerated taint {node.cloudprovider.kubernetes.io/uninitialized: true}. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.

It seems this should probably be handled by CPEM, but for debugging I went ahead and removed the taints:

kubectl taint nodes --all node.cloudprovider.kubernetes.io/uninitialized:NoSchedule-
node/allowing-collie untainted
node/evident-jackal untainted
node/loved-raccoon untainted

I was able to untaint the nodes, and grab all the bond / network interface information here:

https://gist.github.com/hh/58254edaf9836c2f8433fc9ba102e75a#linkspec https://gist.github.com/hh/58254edaf9836c2f8433fc9ba102e75a#curl-via-k8s

cprivitere commented 5 months ago

Like I said in the other issue, this is all stuff we had worked through in the pre-kubecon work, that I believe was related to Talos. Do you have access to that config still?

hh commented 5 months ago

I'll try again soon, we managed to get past, what ended up being a metal device name must match kubernetes node name issue:

https://github.com/kubernetes-sigs/cloud-provider-equinix-metal/issues/533#issuecomment-2062104366

cprivitere commented 5 months ago

So this issue is really the other issue?

hh commented 5 months ago

Yes, I'll close this one. Once the node is recognized, CPEM clears the taint.