kubernetes / autoscaler

Autoscaling components for Kubernetes
Apache License 2.0
7.99k stars 3.95k forks source link

cluster-autoscaler has a panic in NodeDeletionBatcher.AddNode #5891

Closed com6056 closed 9 months ago

com6056 commented 1 year ago

Which component are you using?: cluster-autoscaler

What version of the component are you using?:

Component version: v1.27.1

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.1", GitCommit:"86ec240af8cbd1b60bcc4c03c20da9b98005b92e", GitTreeState:"clean", BuildDate:"2021-12-16T11:41:01Z", GoVersion:"go1.17.5", Compiler:"gc", Platform:"darwin/arm64"}
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.4", GitCommit:"e6c093d87ea4cbb530a7b2ae91e54c0842d8308a", GitTreeState:"clean", BuildDate:"2022-02-16T12:32:02Z", GoVersion:"go1.17.7", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?: AWS via cluster-api, cluster-api-provider-aws (MachinePool), and the aws cluster-autoscaler provider

What did you expect to happen?: No panic

What happened instead?: Panic:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x58 pc=0x46d3b08]

goroutine 3346 [running]:
k8s.io/autoscaler/cluster-autoscaler/core/scaledown/actuation.(*NodeDeletionBatcher).AddNode(0xc0074eb900, 0xc0047c7340, 0xaf?)
    /gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/scaledown/actuation/delete_in_batch.go:77 +0xa8
k8s.io/autoscaler/cluster-autoscaler/core/scaledown/actuation.(*Actuator).scheduleDeletion(0xc0061a4320, 0xc000a247d0?, {0xc007493a70, 0xf}, 0x88?)
    /gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/scaledown/actuation/actuator.go:363 +0x85
created by k8s.io/autoscaler/cluster-autoscaler/core/scaledown/actuation.(*Actuator).deleteNodesAsync
    /gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/scaledown/actuation/actuator.go:296 +0xc95

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?: We have 2 different cluster-autoscaler instances running, 1 to manage the MachineDeployment resources using the cluster-api provider and 1 to manage the MachinePool resources using the aws provider. This is a workaround until https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/4184 is complete and the cluster-api provider can be used for MachinePool resources.

It looks like it might be due to calling nodeGroup.Id() in https://github.com/kubernetes/autoscaler/blob/62d9c945786ff0a0b6942af269e8d5d05bee29fa/cluster-autoscaler/core/scaledown/actuation/delete_in_batch.go#L77 when nodeGroup might be nil, but I haven't confirmed.

com6056 commented 1 year ago

Made a pass at fixing it, still not 100% sure but I think it should hopefully be as simple as this: https://github.com/kubernetes/autoscaler/pull/5892

vadasambar commented 1 year ago

Thanks for the issue and the PR.

Can you reproduce the issue consistently?

I don't think CA supports running multiple instances in the same cluster. Maybe things are working because you are using two different cloud providers? I haven't seen a case where someone is running multiple CA instances. If the issue is hard to reproduce, I wonder if it's because of a race condition (I would expect some degree of separation in execution between 2 instances though)

com6056 commented 1 year ago

Yep, our CA container is restarting pretty consistently with this panic. I'm not sure if us running 2 CA instances has much to do with it, if you look at my PR with the proposed fix, it just looks like we don't have a nil check before trying to call nodeGroup.Id() so instead I just opted to pass through the known nodeGroupId that we already had earlier.

com6056 commented 1 year ago

deleteNodesFromCloudProvider calls NodeGroupForNode, which in the case of AWS is https://github.com/kubernetes/autoscaler/blob/8f83f7ec434eda955f176fcb8ad7153d90000477/cluster-autoscaler/cloudprovider/aws/aws_cloud_provider.go#L108C1-L128 and can sometimes return nil in an error, so we shouldn't be trying to call a method on nodeGroup on an error since it could be nil.