Open jeefy opened 4 months ago
Which CCM version?
/area cluster-autoscaler
@jeefy The URLs being fetched suggest that the Device ID was not known and not included in the URL. The URLs we would expect to see autoscaler calling:
POST /metal/v1/projects/{project_id}/devices
create a device with a predefined hostname and tagsGET /metal/v1/projects/{project_id}/devices
list devices to find the host with the predefined hostname and tagsGET /metal/v1/devices/{device_id}
get the precise device by idDELETE /metal/v1/devices/{device_id}
delete the unneeded device by idSome problems I see in this implementation:
getEquinixMetalDevice
should error early when no id
was givenlistMetalDevices
should handle paginated results, the servers expected may not be on the first page (https://deploy.equinix.com/developers/api/metal/#tag/Devices/operation/findProjectDevices)devices
struct should include errors
and error
(either may be returned for non 2xx requests) and meta
for pagination While the above would improve handling, I don't see any obvious ways that an empty ID
could have snuck into the fetched device list.
Where it appears possible for getEquinixMetalDevice
to be called with an empty id
is if equinixMetalManager.NodeGroupForNode
was called without an id. This only seems possible if the providerID
(equinixmetal://{device_uuid}
) on the Node
was not present: https://github.com/kubernetes/autoscaler/blob/c8e47217692c1fe70f53f4841a3b83b70cc0e878/cluster-autoscaler/cloudprovider/equinixmetal/cloud_provider.go#L111-L126
This function could also be improved. When node.Spec.ProviderID
is empty, it should return an error.
This is typically set by cloud-provider-equinix-metal
(CPEM), a deployment requirement for this autoscaler.
Note, older versions of CPEM used a packet://{device_uuid}
providerID
which is no longer supported by autoscaler.
@cprivitere
Which CCM version?
quay.io/equinix-oss/cloud-provider-equinix-metal:v3.8.1
@displague
Where it appears possible for getEquinixMetalDevice to be called with an empty id is if
equinixMetalManager.NodeGroupForNode
was called without an id.
Is it possible this is because there is a control-plane and single worker node without a providerId
set? IIRC the CCM documentation says that an uninitialized node would have the providerId set,
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
Which component are you using?: Cluster Autoscaler (Equinix)
What version of the component are you using?: 1.30.2 (but have the same behavior going back to 1.29.x)
Component version:
What k8s version are you using (
kubectl version
)?:kubectl version
OutputWhat environment is this in?:
Equinix
What did you expect to happen?:
Cluster Autoscaler would auto-scale a defined node pool.
What happened instead?:
Repeats infinitely
How to reproduce it (as minimally and precisely as possible):
Spin up k3s control plane in Equinix (ubuntu_22_04), install ccm and autoscaler per directions from Equinix folders/repo
Anything else we need to know?:
Happy to try anything or output other logs if needed. :) For the record, I ran into this with a
kubeadm
-managed cluster as well, so I don't believe this is a k3s-specific issue.