Open jackfrancis opened 9 months ago
/assign
cc @elmiko
i think we need to make sure that if the call to lookup the infrastructure template fails due to permissions that we log the error but continue to run.
My initial investigations suggest that things are short-circuiting during this call:
I'll keep digging in, but perhaps it's nothing we're doing ourselves and rather how we're using the k8s standard libs.
i wonder if we can detect from the error type that it's a permission thing?
Here's what's happening:
I think I hit this same issue without scale from zero. We hit a timeout when scaling up and after this the autoscaler got stuck in a loop with this same permission issue.
I0423 12:32:12.823109 1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0423 12:32:12.826263 1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 3.127004ms
I0423 12:34:12.827213 1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0423 12:34:13.054351 1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 227.107322ms I0423 12:36:13.055436 1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache I0423 12:36:13.058399 1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 2.918912ms
I0423 12:38:13.059011 1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache I0423 12:38:13.062092 1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 3.002037ms E0423 12:38:32.277862 1 orchestrator.go:446] Couldn't get autoscaling options for ng: MachineDeployment/default/prow-md-0
E0423 12:38:32.278382 1 orchestrator.go:503] Failed to get autoscaling options for node group MachineDeployment/default/prow-md-0: Not implemented I0423 12:38:32.278519 1 executor.go:147] Scale-up: setting group MachineDeployment/default/prow-md-0 size to 2 W0423 12:38:43.572435 1 clusterapi_controller.go:613] Machine "prow-md-0-5bf469475xqkcdw-bh2zg" has no providerID
W0423 12:38:54.524316 1 clusterapi_controller.go:613] Machine "prow-md-0-5bf469475xqkcdw-bh2zg" has no providerID W0423 12:39:05.581206 1 clusterapi_controller.go:613] Machine "prow-md-0-5bf469475xqkcdw-bh2zg" has no providerID W0423 12:39:16.611779 1 clusterapi_controller.go:613] Machine "prow-md-0-5bf469475xqkcdw-bh2zg" has no providerID
...
W0423 12:53:30.883821 1 clusterapi_controller.go:613] Machine "prow-md-0-5bf469475xqkcdw-bh2zg" has no providerID
W0423 12:53:42.099475 1 clusterapi_controller.go:613] Machine "prow-md-0-5bf469475xqkcdw-bh2zg" has no providerID
W0423 12:53:42.105887 1 clusterstate.go:273] Scale-up timed out for node group MachineDeployment/default/prow-md-0 after 15m9.146461594s
W0423 12:53:42.109963 1 reflector.go:539] k8s.io/client-go/dynamic/dynamicinformer/informer.go:108: failed to list infrastructure.cluster.x-k8s.io/v1alpha6, Resource=openstackmachinetemplates: opens
tackmachinetemplates.infrastructure.cluster.x-k8s.io is forbidden: User "system:serviceaccount:kube-system:cluster-autoscaler" cannot list resource "openstackmachinetemplates" in API group "infrastructure
.cluster.x-k8s.io" at the cluster scope
E0423 12:53:42.112312 1 reflector.go:147] k8s.io/client-go/dynamic/dynamicinformer/informer.go:108: Failed to watch infrastructure.cluster.x-k8s.io/v1alpha6, Resource=openstackmachinetemplates: fail
ed to list infrastructure.cluster.x-k8s.io/v1alpha6, Resource=openstackmachinetemplates: openstackmachinetemplates.infrastructure.cluster.x-k8s.io is forbidden: User "system:serviceaccount:kube-system:clu
ster-autoscaler" cannot list resource "openstackmachinetemplates" in API group "infrastructure.cluster.x-k8s.io" at the cluster scope
Deleting the stuck pod got it back on track.
thanks for the update @lentzi90 !
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
/lifecycle frozen
Which component are you using?:
cluster-autoscaler
What version of the component are you using?:
cluster-autoscaler
Component version:
1.29.0
What k8s version are you using (
kubectl version
)?:kubectl version
OutputWhat environment is this in?:
Azure + Cluster API
What did you expect to happen?:
I expected the cluster to scale out configured node pools from zero.
What happened instead?:
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?: