kubernetes / autoscaler

Autoscaling components for Kubernetes
Apache License 2.0
7.92k stars 3.93k forks source link

Equinix Cluster Autoscaler repeats 404 message, does not scale up/down #7073

Open jeefy opened 1 month ago

jeefy commented 1 month ago

Which component are you using?: Cluster Autoscaler (Equinix)

What version of the component are you using?: 1.30.2 (but have the same behavior going back to 1.29.x)

Component version:

What k8s version are you using (kubectl version)?:

kubectl version Output
➜  ~ kubectl version
Client Version: v1.30.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.5+k3s1

What environment is this in?:

Equinix

What did you expect to happen?:

Cluster Autoscaler would auto-scale a defined node pool.

What happened instead?:

I0719 18:33:40.945752       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 741.677162ms
E0719 18:33:47.460265       1 static_autoscaler.go:380] Failed to get node infos for groups: could not find group for node:  GET https://api.equinix.com/metal/v1/devices: 404 Not found 
E0719 18:33:57.668251       1 static_autoscaler.go:380] Failed to get node infos for groups: could not find group for node:  GET https://api.equinix.com/metal/v1/devices: 404 Not found 
E0719 18:34:07.876043       1 static_autoscaler.go:380] Failed to get node infos for groups: could not find group for node:  GET https://api.equinix.com/metal/v1/devices: 404 Not found 
E0719 18:34:18.074806       1 static_autoscaler.go:380] Failed to get node infos for groups: could not find group for node:  GET https://api.equinix.com/metal/v1/devices: 404 Not found 
E0719 18:34:28.375637       1 static_autoscaler.go:380] Failed to get node infos for groups: could not find group for node:  GET https://api.equinix.com/metal/v1/devices: 404 Not found 
E0719 18:34:38.560910       1 static_autoscaler.go:380] Failed to get node infos for groups: could not find group for node:  GET https://api.equinix.com/metal/v1/devices: 404 Not found 
E0719 18:34:48.797565       1 static_autoscaler.go:380] Failed to get node infos for groups: could not find group for node:  GET https://api.equinix.com/metal/v1/devices: 404 Not found 
E0719 18:34:59.136517       1 static_autoscaler.go:380] Failed to get node infos for groups: could not find group for node:  GET https://api.equinix.com/metal/v1/devices: 404 Not found 
E0719 18:35:09.330305       1 static_autoscaler.go:380] Failed to get node infos for groups: could not find group for node:  GET https://api.equinix.com/metal/v1/devices: 404 Not found 
E0719 18:35:19.545328       1 static_autoscaler.go:380] Failed to get node infos for groups: could not find group for node:  GET https://api.equinix.com/metal/v1/devices: 404 Not found 

Repeats infinitely

How to reproduce it (as minimally and precisely as possible):

Spin up k3s control plane in Equinix (ubuntu_22_04), install ccm and autoscaler per directions from Equinix folders/repo

Anything else we need to know?:

Happy to try anything or output other logs if needed. :) For the record, I ran into this with a kubeadm-managed cluster as well, so I don't believe this is a k3s-specific issue.

cprivitere commented 1 month ago

Which CCM version?

adrianmoisey commented 1 month ago

/area cluster-autoscaler

displague commented 1 month ago

@jeefy The URLs being fetched suggest that the Device ID was not known and not included in the URL. The URLs we would expect to see autoscaler calling:

Some problems I see in this implementation:

While the above would improve handling, I don't see any obvious ways that an empty ID could have snuck into the fetched device list.

Where it appears possible for getEquinixMetalDevice to be called with an empty id is if equinixMetalManager.NodeGroupForNode was called without an id. This only seems possible if the providerID (equinixmetal://{device_uuid}) on the Node was not present: https://github.com/kubernetes/autoscaler/blob/c8e47217692c1fe70f53f4841a3b83b70cc0e878/cluster-autoscaler/cloudprovider/equinixmetal/cloud_provider.go#L111-L126

This function could also be improved. When node.Spec.ProviderID is empty, it should return an error.

This is typically set by cloud-provider-equinix-metal (CPEM), a deployment requirement for this autoscaler.


Note, older versions of CPEM used a packet://{device_uuid} providerID which is no longer supported by autoscaler.

jeefy commented 1 month ago

@cprivitere

Which CCM version?

quay.io/equinix-oss/cloud-provider-equinix-metal:v3.8.1

@displague

Where it appears possible for getEquinixMetalDevice to be called with an empty id is if equinixMetalManager.NodeGroupForNode was called without an id.

Is it possible this is because there is a control-plane and single worker node without a providerId set? IIRC the CCM documentation says that an uninitialized node would have the providerId set,