Equinix Cluster Autoscaler repeats 404 message, does not scale up/down

jeefy commented 4 months ago

Which component are you using?: Cluster Autoscaler (Equinix)

What version of the component are you using?: 1.30.2 (but have the same behavior going back to 1.29.x)

Component version:

What k8s version are you using (kubectl version)?:

kubectl version Output

➜  ~ kubectl version
Client Version: v1.30.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.5+k3s1

What environment is this in?:

Equinix

What did you expect to happen?:

Cluster Autoscaler would auto-scale a defined node pool.

What happened instead?:

I0719 18:33:40.945752       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 741.677162ms
E0719 18:33:47.460265       1 static_autoscaler.go:380] Failed to get node infos for groups: could not find group for node:  GET https://api.equinix.com/metal/v1/devices: 404 Not found 
E0719 18:33:57.668251       1 static_autoscaler.go:380] Failed to get node infos for groups: could not find group for node:  GET https://api.equinix.com/metal/v1/devices: 404 Not found 
E0719 18:34:07.876043       1 static_autoscaler.go:380] Failed to get node infos for groups: could not find group for node:  GET https://api.equinix.com/metal/v1/devices: 404 Not found 
E0719 18:34:18.074806       1 static_autoscaler.go:380] Failed to get node infos for groups: could not find group for node:  GET https://api.equinix.com/metal/v1/devices: 404 Not found 
E0719 18:34:28.375637       1 static_autoscaler.go:380] Failed to get node infos for groups: could not find group for node:  GET https://api.equinix.com/metal/v1/devices: 404 Not found 
E0719 18:34:38.560910       1 static_autoscaler.go:380] Failed to get node infos for groups: could not find group for node:  GET https://api.equinix.com/metal/v1/devices: 404 Not found 
E0719 18:34:48.797565       1 static_autoscaler.go:380] Failed to get node infos for groups: could not find group for node:  GET https://api.equinix.com/metal/v1/devices: 404 Not found 
E0719 18:34:59.136517       1 static_autoscaler.go:380] Failed to get node infos for groups: could not find group for node:  GET https://api.equinix.com/metal/v1/devices: 404 Not found 
E0719 18:35:09.330305       1 static_autoscaler.go:380] Failed to get node infos for groups: could not find group for node:  GET https://api.equinix.com/metal/v1/devices: 404 Not found 
E0719 18:35:19.545328       1 static_autoscaler.go:380] Failed to get node infos for groups: could not find group for node:  GET https://api.equinix.com/metal/v1/devices: 404 Not found

Repeats infinitely

How to reproduce it (as minimally and precisely as possible):

Spin up k3s control plane in Equinix (ubuntu_22_04), install ccm and autoscaler per directions from Equinix folders/repo

Anything else we need to know?:

Happy to try anything or output other logs if needed. :) For the record, I ran into this with a kubeadm-managed cluster as well, so I don't believe this is a k3s-specific issue.

cprivitere commented 4 months ago

Which CCM version?

adrianmoisey commented 4 months ago

/area cluster-autoscaler

displague commented 4 months ago

@jeefy The URLs being fetched suggest that the Device ID was not known and not included in the URL. The URLs we would expect to see autoscaler calling:

POST /metal/v1/projects/{project_id}/devices create a device with a predefined hostname and tags
GET /metal/v1/projects/{project_id}/devices list devices to find the host with the predefined hostname and tags
GET /metal/v1/devices/{device_id} get the precise device by id
DELETE /metal/v1/devices/{device_id} delete the unneeded device by id

Some problems I see in this implementation:

getEquinixMetalDevice should error early when no id was given
listMetalDevices should handle paginated results, the servers expected may not be on the first page (https://deploy.equinix.com/developers/api/metal/#tag/Devices/operation/findProjectDevices)
devices struct should include errors and error (either may be returned for non 2xx requests) and meta for pagination

While the above would improve handling, I don't see any obvious ways that an empty ID could have snuck into the fetched device list.

Where it appears possible for getEquinixMetalDevice to be called with an empty id is if equinixMetalManager.NodeGroupForNode was called without an id. This only seems possible if the providerID (equinixmetal://{device_uuid}) on the Node was not present: https://github.com/kubernetes/autoscaler/blob/c8e47217692c1fe70f53f4841a3b83b70cc0e878/cluster-autoscaler/cloudprovider/equinixmetal/cloud_provider.go#L111-L126

This function could also be improved. When node.Spec.ProviderID is empty, it should return an error.

This is typically set by cloud-provider-equinix-metal (CPEM), a deployment requirement for this autoscaler.

Note, older versions of CPEM used a packet://{device_uuid} providerID which is no longer supported by autoscaler.

jeefy commented 4 months ago

@cprivitere

Which CCM version?

quay.io/equinix-oss/cloud-provider-equinix-metal:v3.8.1

@displague

Where it appears possible for getEquinixMetalDevice to be called with an empty id is if equinixMetalManager.NodeGroupForNode was called without an id.

Is it possible this is because there is a control-plane and single worker node without a providerId set? IIRC the CCM documentation says that an uninitialized node would have the providerId set,

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 week ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

kubernetes / autoscaler

Equinix Cluster Autoscaler repeats 404 message, does not scale up/down #7073