kubernetes / autoscaler

Autoscaling components for Kubernetes
Apache License 2.0
8.04k stars 3.96k forks source link

Unable to scale AWS autoscaling array up to satisfy pod EBS requirement in particular zone (NoVolumeZoneConflict) #1431

Closed garo closed 5 years ago

garo commented 5 years ago

Running Kubernetes v1.10.3-eks in Amazon EKS. Cluster has one AWS Autoscaling Array with three different availability zone / subnets defined. At the time of the problem the cluster has two nodes, one in us-east-1a and one in us-east-1c.

There is a pod with an PVC attached, which is backed by an EBS PV in us-east-1d. Because there isn't any node running in us-east-1d the pod cannot start.

The problem is that cluster-autoscaler isn't able to scale up the autoscaling array so that a new worker would appear in us-east-1d to satisfy the zone requirement. Manually increasing the autoscaling size does give a new node in the correct region.

Cluster-autoscaler is installed with helm: chart version cluster-autoscaler-0.7.0, App version: 1.2.2. Installation command:

helm install --name cluster-autoscaler --namespace kube-system \
   --version 0.7.0 \ # 0.7.0 is meant for kubernetes 1.10.x, see https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler
   --set autoDiscovery.clusterName=$CLUSTER_NAME \
   --set awsRegion=$AWS_REGION \
   --set sslCertPath=/etc/kubernetes/pki/ca.crt \
   --set rbac.create=true \
   --set podAnnotations."iam\.amazonaws\.com/role"=$CLUSTER_NAME-eks-worker-node \
   stable/cluster-autoscaler

Cluster autoscaler error log:

static_autoscaler.go:114] Starting main loop
utils.go:456] No pod using affinity / antiaffinity found in cluster, disabling affinity predicate for this loop
static_autoscaler.go:263] Filtering out schedulables
static_autoscaler.go:273] No schedulable pods
scale_up.go:59] Pod monitoring/prometheus-mon-prometheus-operator-prometheus-0 is unschedulable
scale_up.go:92] Upcoming 0 nodes
scale_up.go:152] Scale-up predicate failed: NoVolumeZoneConflict predicate mismatch, cannot put monitoring/prometheus-mon-prometheus-operator-prometheus-0 on template-node-for-cluster-generic-nodes-4423088653825289861, reason: node(s) had no available volume zone
scale_up.go:181] No pod can fit to cluster-generic-nodes
scale_up.go:186] No expansion options
static_autoscaler.go:322] Calculating unneeded nodes
factory.go:33] Event(v1.ObjectReference{Kind:"Pod", Namespace:"monitoring", Name:"prometheus-mon-prometheus-operator-prometheus-0", UID:"9473cd20-ee27-11e8-83f9-0e13418086b6", APIVersion:"v1", ResourceVersion:"5808447", FieldPath:""}): type: 'Normal' reason: 'NotTriggerScaleUp' pod didn't trigger scale-up (it wouldn't fit if a new node is added)
scale_down.go:175] Scale-down calculation: ignoring 2 nodes, that were unremovable in the last 5m0s
static_autoscaler.go:352] Scale down status: unneededOnly=true lastScaleUpTime=2018-11-22 07:26:02.736718705 +0000 UTC lastScaleDownDeleteTime=2018-11-22 07:26:02.736719107 +0000 UTC lastScaleDownFailTime=2018-11-22 07:26:02.736719507 +0000 UTC schedulablePodsPresent=false isDeleteInProgress=false
aleksandra-malinowska commented 5 years ago

scale_up.go:152] Scale-up predicate failed: NoVolumeZoneConflict predicate mismatch, cannot put monitoring/prometheus-mon-prometheus-operator-prometheus-0 on template-node-for-cluster-generic-nodes-4423088653825289861, reason: node(s) had no available volume zone

It's a bit of a guessing game since I've no idea what you mean by 'autoscaling array', but this sounds as if you have only 1 regional node group, with nodes in 3 different zone.

This isn't supported as Cluster Autoscaler assumes all nodes in a node group will be identical with respect to all scheduling properties (resources, labels, taints, zone). If you want to use scheduling features related to zone (like volume topological scheduling), go for 3 zonal node groups. More details are in FAQ

garo commented 5 years ago

Thank you for the response.

Yes, I have a single autoscaling array which creates nodes to all three different availability zones.

Apparently Volume Topological Scheduling is introduced in 1.11 and as I'm running on EKS I'm stuck on 1.10. Am I right that on 1.10 there isn't any way to make this work?

aleksandra-malinowska commented 5 years ago

Even with this feature, you'll still need to have separate node groups in each zone.

Cluster Autoscaler must be able to predict accurately what kind of node it will create. In your case, you want a node in the same zone as PV. If you use a regional node group which will create the node in a random zone, you'll get random behavior.

garo commented 5 years ago

Thank you for all the feedback.

frederiksf commented 5 years ago

Even with this feature, you'll still need to have separate node groups in each zone.

Cluster Autoscaler must be able to predict accurately what kind of node it will create. In your case, you want a node in the same zone as PV. If you use a regional node group which will create the node in a random zone, you'll get random behavior.

@aleksandra-malinowska Question: How will CA find out which node group belongs to a specific asg zone? Will it read the AWS ASG specific zones ( aws api "DescribeAutoScalingGroups")?

jfoy commented 5 years ago

@frederiksf Not sure about the scale-from-zero case, but per https://github.com/kubernetes/contrib/pull/1552#discussion_r75532949 , CA asks the cloudprovider for a sample Node from the NodePool, and asserts that any capacity it adds in that NodePool will have exactly the same characteristics as the sample Node.

abdennour commented 4 years ago

I have EKS 1.17 and cluster-autoscaler chart v7.0.0 ... Issue still persists. Any fix deadline ?

dprateek1991 commented 3 years ago

I have a setup with Multiple ASGs with ap-southeast-1a and ap-southeast-1b AZ attached to all the ASGs.

In my case, we have an EC2 running in ap-southeast-1a zone and thus Persistant Volume gets attached to the node perfectly fine since the EBS volume is in ap-southeast-1a itself.

However, I have another EBS Volume in ap-southeast-1b zone and in this case Cluster Autoscaler is not scaling to add a node in ap-southeast-1b zone. What can be wrong here ? Ideally it should scale up , add a Node in 1B zone and attach the volume to it.

I get this error --> pod didn't trigger scale-up (it wouldn't fit if a new node is added): 3 Insufficient nvidia.com/gpu, 1 node(s) had volume node affinity conflict

EKS Cluster Version - 1.17

edijsdrezovs commented 3 years ago

@dprateek1991

If you’re using Persistent Volumes, your deployment needs to run in the same AZ as where the EBS volume is, otherwise the pod scheduling could fail if it is scheduled in a different AZ and cannot find the EBS volume. To overcome this, either use a single AZ ASG for this use case, or an ASG-per-AZ while enabling --balance-similar-node-groups.

On creation time, the ASG will have the AZRebalance process enabled, which means it will actively work to balance the number of instances between AZs, and possibly terminate instances. If your applications could be impacted from sudden termination, you can either suspend the AZRebalance feature, or use a tool for automatic draining upon ASG scale-in such as the k8s-node-drainer. The AWS Node Termination Handler will also support this use-case in the future.