kubernetes / autoscaler

Autoscaling components for Kubernetes
Apache License 2.0
8.11k stars 3.98k forks source link

Autoscaler auto discover fails with sagemaker hyperpod #7540

Open Arthurhussey opened 4 days ago

Arthurhussey commented 4 days ago

Which component are you using?:

cluster-autoscaler

What version of the component are you using?:

Component version: 1.30

Component version:

What k8s version are you using (kubectl version)?:

$ Client Version: v1.30.0 $ Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 $ Server Version: v1.30.6-eks-7f9249a

What environment is this in?:

AWS/EKS

What did you expect to happen?:

Autoscaler pod should scale the ASGs as required, even when a sagemaker hyperpod cluster is attached.

ASG scaling

What happened instead?:

no autoscaling takes place, and i can see these errors in the logs

E1128 22:19:38.487185       1 static_autoscaler.go:387] Failed to get node infos for groups: wrong id: expected format aws:///<zone>/<name>, got aws:///usw1-az3/sagemaker/cluster/hyperpod-4lluwz86unnw-i-0059feb421f5b03ed
I1128 22:19:48.487306       1 static_autoscaler.go:306] Starting main loop
E1128 22:19:48.488169       1 static_autoscaler.go:387] Failed to get node infos for groups: wrong id: expected format aws:///<zone>/<name>, got aws:///usw1-az3/sagemaker/cluster/hyperpod-4lluwz86unnw-i-00f55e8b5be774bd7
I1128 22:19:58.489198       1 static_autoscaler.go:306] Starting main loop
E1128 22:19:58.490443       1 static_autoscaler.go:387] Failed to get node infos for groups: wrong id: expected format aws:///<zone>/<name>, got aws:///usw1-az3/sagemaker/cluster/hyperpod-4lluwz86unnw-i-0a42a231e263cac0c
I1128 22:20:08.491624       1 static_autoscaler.go:306] Starting main loop
E1128 22:20:08.492629       1 static_autoscaler.go:387] Failed to get node infos for groups: wrong id: expected format aws:///<zone>/<name>, got aws:///usw1-az3/sagemaker/cluster/hyperpod-4lluwz86unnw-i-0059feb421f5b03ed
I1128 22:20:18.493453       1 static_autoscaler.go:306] Starting main loop
E1128 22:20:18.494944       1 static_autoscaler.go:387] Failed to get node infos for groups: wrong id: expected format aws:///<zone>/<name>, got aws:///usw1-az3/sagemaker/cluster/hyperpod-4lluwz86unnw-i-0cfbeeb3654698d80

How to reproduce it (as minimally and precisely as possible):

Setup cloud autoscaler with auto discover. The discovery and autoscaling works well Add an AWS sagemaker hyperpod eks cluster This will cause these error logs and no autoscaling

Anything else we need to know?:

adrianmoisey commented 3 days ago

/area cluster-autoscaler