Closed garo closed 5 years ago
scale_up.go:152] Scale-up predicate failed: NoVolumeZoneConflict predicate mismatch, cannot put monitoring/prometheus-mon-prometheus-operator-prometheus-0 on template-node-for-cluster-generic-nodes-4423088653825289861, reason: node(s) had no available volume zone
It's a bit of a guessing game since I've no idea what you mean by 'autoscaling array', but this sounds as if you have only 1 regional node group, with nodes in 3 different zone.
This isn't supported as Cluster Autoscaler assumes all nodes in a node group will be identical with respect to all scheduling properties (resources, labels, taints, zone). If you want to use scheduling features related to zone (like volume topological scheduling), go for 3 zonal node groups. More details are in FAQ
Thank you for the response.
Yes, I have a single autoscaling array which creates nodes to all three different availability zones.
Apparently Volume Topological Scheduling is introduced in 1.11 and as I'm running on EKS I'm stuck on 1.10. Am I right that on 1.10 there isn't any way to make this work?
Even with this feature, you'll still need to have separate node groups in each zone.
Cluster Autoscaler must be able to predict accurately what kind of node it will create. In your case, you want a node in the same zone as PV. If you use a regional node group which will create the node in a random zone, you'll get random behavior.
Thank you for all the feedback.
Even with this feature, you'll still need to have separate node groups in each zone.
Cluster Autoscaler must be able to predict accurately what kind of node it will create. In your case, you want a node in the same zone as PV. If you use a regional node group which will create the node in a random zone, you'll get random behavior.
@aleksandra-malinowska Question: How will CA find out which node group belongs to a specific asg zone? Will it read the AWS ASG specific zones ( aws api "DescribeAutoScalingGroups")?
@frederiksf Not sure about the scale-from-zero case, but per https://github.com/kubernetes/contrib/pull/1552#discussion_r75532949 , CA asks the cloudprovider for a sample Node from the NodePool, and asserts that any capacity it adds in that NodePool will have exactly the same characteristics as the sample Node.
I have EKS 1.17 and cluster-autoscaler chart v7.0.0 ... Issue still persists. Any fix deadline ?
I have a setup with Multiple ASGs with ap-southeast-1a and ap-southeast-1b AZ attached to all the ASGs.
In my case, we have an EC2 running in ap-southeast-1a zone and thus Persistant Volume gets attached to the node perfectly fine since the EBS volume is in ap-southeast-1a itself.
However, I have another EBS Volume in ap-southeast-1b zone and in this case Cluster Autoscaler is not scaling to add a node in ap-southeast-1b zone. What can be wrong here ? Ideally it should scale up , add a Node in 1B zone and attach the volume to it.
I get this error --> pod didn't trigger scale-up (it wouldn't fit if a new node is added): 3 Insufficient nvidia.com/gpu, 1 node(s) had volume node affinity conflict
EKS Cluster Version - 1.17
@dprateek1991
If you’re using Persistent Volumes, your deployment needs to run in the same AZ as where the EBS volume is, otherwise the pod scheduling could fail if it is scheduled in a different AZ and cannot find the EBS volume. To overcome this, either use a single AZ ASG for this use case, or an ASG-per-AZ while enabling --balance-similar-node-groups.
On creation time, the ASG will have the AZRebalance process enabled, which means it will actively work to balance the number of instances between AZs, and possibly terminate instances. If your applications could be impacted from sudden termination, you can either suspend the AZRebalance feature, or use a tool for automatic draining upon ASG scale-in such as the k8s-node-drainer. The AWS Node Termination Handler will also support this use-case in the future.
Running Kubernetes v1.10.3-eks in Amazon EKS. Cluster has one AWS Autoscaling Array with three different availability zone / subnets defined. At the time of the problem the cluster has two nodes, one in us-east-1a and one in us-east-1c.
There is a pod with an PVC attached, which is backed by an EBS PV in us-east-1d. Because there isn't any node running in us-east-1d the pod cannot start.
The problem is that cluster-autoscaler isn't able to scale up the autoscaling array so that a new worker would appear in us-east-1d to satisfy the zone requirement. Manually increasing the autoscaling size does give a new node in the correct region.
Cluster-autoscaler is installed with helm: chart version cluster-autoscaler-0.7.0, App version: 1.2.2. Installation command:
Cluster autoscaler error log: