cluster-autoscaler CAPI provider fails to scale from zero if infra RBAC doesn't exist

jackfrancis commented 9 months ago

Which component are you using?:

cluster-autoscaler

What version of the component are you using?:

cluster-autoscaler

Component version:

1.29.0

What k8s version are you using (kubectl version)?:

$ k get nodes -o wide
NAME                                                          STATUS   ROLES           AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
capz-e2e-ultbq8-cluster-autoscaler-capi-control-plane-646wf   Ready    control-plane   18h   v1.27.9   10.0.0.4      <none>        Ubuntu 22.04.3 LTS   6.2.0-1018-azure   containerd://1.7.10
capz-e2e-ultbq8-cluster-autoscaler-capi-md-0-csm4h            Ready    <none>          68m   v1.27.9   10.1.0.6      <none>        Ubuntu 22.04.3 LTS   6.2.0-1018-azure   containerd://1.7.10
capz-e2e-ultbq8-cluster-autoscaler-capi-md-0-lvxjc            Ready    <none>          68m   v1.27.9   10.1.0.7      <none>        Ubuntu 22.04.3 LTS   6.2.0-1018-azure   containerd://1.7.10
capz-e2e-ultbq8-cluster-autoscaler-capi-md-0-xbnbt            Ready    <none>          68m   v1.27.9   10.1.0.8      <none>        Ubuntu 22.04.3 LTS   6.2.0-1018-azure   containerd://1.7.10

kubectl version Output

$ kubectl version
$ k version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.9", GitCommit:"d15213f69952c79b317e635abff6ff4ec81475f8", GitTreeState:"clean", BuildDate:"2023-12-19T13:41:13Z", GoVersion:"go1.20.12", Compiler:"gc", Platform:"darwin/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.9", GitCommit:"d15213f69952c79b317e635abff6ff4ec81475f8", GitTreeState:"clean", BuildDate:"2023-12-19T13:32:15Z", GoVersion:"go1.20.12", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?:

Azure + Cluster API

What did you expect to happen?:

I expected the cluster to scale out configured node pools from zero.

What happened instead?:

W0131 18:54:54.242627       1 reflector.go:539] k8s.io/client-go/dynamic/dynamicinformer/informer.go:108: failed to list infrastructure.cluster.x-k8s.io/v1beta1, Resource=azuremachinetemplates: azuremachinetemplates.infrastructure.cluster.x-k8s.io is forbidden: User "system:serviceaccount:capz-e2e-ultbq8:capz-e2e-ultbq8-cluster-autoscaler-capi-clusterapi-cluster-auto" cannot list resource "azuremachinetemplates" in API group "infrastructure.cluster.x-k8s.io" at the cluster scope
E0131 18:54:54.242736       1 reflector.go:147] k8s.io/client-go/dynamic/dynamicinformer/informer.go:108: Failed to watch infrastructure.cluster.x-k8s.io/v1beta1, Resource=azuremachinetemplates: failed to list infrastructure.cluster.x-k8s.io/v1beta1, Resource=azuremachinetemplates: azuremachinetemplates.infrastructure.cluster.x-k8s.io is forbidden: User "system:serviceaccount:capz-e2e-ultbq8:capz-e2e-ultbq8-cluster-autoscaler-capi-clusterapi-cluster-auto" cannot list resource "azuremachinetemplates" in API group "infrastructure.cluster.x-k8s.io" at the cluster scope

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

jackfrancis commented 9 months ago

/assign

jackfrancis commented 9 months ago

cc @elmiko

elmiko commented 9 months ago

i think we need to make sure that if the call to lookup the infrastructure template fails due to permissions that we log the error but continue to run.

jackfrancis commented 9 months ago

My initial investigations suggest that things are short-circuiting during this call:

https://github.com/kubernetes/autoscaler/blob/cluster-autoscaler-1.29.0/cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller.go#L874-L876

I'll keep digging in, but perhaps it's nothing we're doing ourselves and rather how we're using the k8s standard libs.

elmiko commented 9 months ago

i wonder if we can detect from the error type that it's a permission thing?

jackfrancis commented 9 months ago

Here's what's happening:

https://github.com/kubernetes/client-go/blob/v0.29.0-alpha.3/tools/cache/reflector.go#L146-L148

lentzi90 commented 6 months ago

I think I hit this same issue without scale from zero. We hit a timeout when scaling up and after this the autoscaler got stuck in a loop with this same permission issue.

I0423 12:32:12.823109       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache                                                                                             
I0423 12:32:12.826263       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 3.127004ms                                                                    
I0423 12:34:12.827213       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache                                                                                             
I0423 12:34:13.054351       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 227.107322ms                                                                  I0423 12:36:13.055436       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache                                                                                             I0423 12:36:13.058399       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 2.918912ms                                                                    
I0423 12:38:13.059011       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache                                                                                             I0423 12:38:13.062092       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 3.002037ms                                                                    E0423 12:38:32.277862       1 orchestrator.go:446] Couldn't get autoscaling options for ng: MachineDeployment/default/prow-md-0                                                                             
E0423 12:38:32.278382       1 orchestrator.go:503] Failed to get autoscaling options for node group MachineDeployment/default/prow-md-0: Not implemented                                                    I0423 12:38:32.278519       1 executor.go:147] Scale-up: setting group MachineDeployment/default/prow-md-0 size to 2                                                                                        W0423 12:38:43.572435       1 clusterapi_controller.go:613] Machine "prow-md-0-5bf469475xqkcdw-bh2zg" has no providerID                                                                                     
W0423 12:38:54.524316       1 clusterapi_controller.go:613] Machine "prow-md-0-5bf469475xqkcdw-bh2zg" has no providerID                                                                                     W0423 12:39:05.581206       1 clusterapi_controller.go:613] Machine "prow-md-0-5bf469475xqkcdw-bh2zg" has no providerID                                                                                     W0423 12:39:16.611779       1 clusterapi_controller.go:613] Machine "prow-md-0-5bf469475xqkcdw-bh2zg" has no providerID
...
W0423 12:53:30.883821       1 clusterapi_controller.go:613] Machine "prow-md-0-5bf469475xqkcdw-bh2zg" has no providerID
W0423 12:53:42.099475       1 clusterapi_controller.go:613] Machine "prow-md-0-5bf469475xqkcdw-bh2zg" has no providerID
W0423 12:53:42.105887       1 clusterstate.go:273] Scale-up timed out for node group MachineDeployment/default/prow-md-0 after 15m9.146461594s
W0423 12:53:42.109963       1 reflector.go:539] k8s.io/client-go/dynamic/dynamicinformer/informer.go:108: failed to list infrastructure.cluster.x-k8s.io/v1alpha6, Resource=openstackmachinetemplates: opens
tackmachinetemplates.infrastructure.cluster.x-k8s.io is forbidden: User "system:serviceaccount:kube-system:cluster-autoscaler" cannot list resource "openstackmachinetemplates" in API group "infrastructure
.cluster.x-k8s.io" at the cluster scope
E0423 12:53:42.112312       1 reflector.go:147] k8s.io/client-go/dynamic/dynamicinformer/informer.go:108: Failed to watch infrastructure.cluster.x-k8s.io/v1alpha6, Resource=openstackmachinetemplates: fail
ed to list infrastructure.cluster.x-k8s.io/v1alpha6, Resource=openstackmachinetemplates: openstackmachinetemplates.infrastructure.cluster.x-k8s.io is forbidden: User "system:serviceaccount:kube-system:clu
ster-autoscaler" cannot list resource "openstackmachinetemplates" in API group "infrastructure.cluster.x-k8s.io" at the cluster scope

Deleting the stuck pod got it back on track.

elmiko commented 6 months ago

thanks for the update @lentzi90 !

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Shubham82 commented 3 months ago

/remove-lifecycle stale

k8s-triage-robot commented 2 weeks ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

lentzi90 commented 2 weeks ago

/remove-lifecycle stale

Shubham82 commented 2 weeks ago

/lifecycle frozen

kubernetes / autoscaler

cluster-autoscaler CAPI provider fails to scale from zero if infra RBAC doesn't exist #6490