Closed Xyaren closed 1 year ago
Same problem here (edit after realizing there is no difference relevant difference in my previous post to what you wrote)
After doing some splunking, I you are correct it has something to do with scaling from 0 and usage of the topology.ebs.csi.aws.com/zone
label and the ability of the autoscaler to recognize it. Some experimentation corroborates this.
k8s.io/cluster-autoscaler/node-template/label/topology.ebs.csi.aws.com/zone
is the approach I am taking and it works like charm.
I can do some footwork in terraform to get the tags setup. Not sure what you're using to provision your cluster.
Though, it would be nice to have the labels generated from the list of AZs assigned to an ASG
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale
How to resolve this issue for statefulset deployments attached custom storage classes on EKS?
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
How to resolve this issue for statefulset deployments attached custom storage classes on EKS?
So just set
k8s.io/cluster-autoscaler/node-template/label/topology.kubernetes.io/zone: "us-east-2a"
for example? Like the OP mentions, how are we supposed to do this for multiple AZs
I have this exact problem too, to add further info the error I get on the pod unable to scale from zero is:
pod didn't trigger scale-up (it wouldn't fit if a new node is added): 3 node(s) had volume node affinity conflict
@FarhanSajid1 you should have one node group (and thus one ASG) for each AZ. The above tag needs to be applied to the ASG.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
Hi Folks! Facing the same issue: CA Version: v1.21.1 aws-ebs-csi-driver Version v1.10.0-eksbuild.1
Cluster-autosacler logs:
I0920 17:30:00.585954 1 scheduler_binder.go:803] Could not get a CSINode object for the node "ip-10-121-173-251.ap-south-1.compute.internal": csinode.storage.k8s.io "ip-10-121-173-251.ap-south-1.compute.internal" not found
I0920 17:30:00.586008 1 scheduler_binder.go:823] PersistentVolume "pvc-50c002d3-a5cc-4143-adf2-1362d18fc40e", Node "ip-10-121-173-251.ap-south-1.compute.internal" mismatch for Pod "kafka/kafka-0": no matching NodeSelectorTerms
I0920 17:30:00.586074 1 scheduler_binder.go:803] Could not get a CSINode object for the node "ip-10-121-68-79.ap-south-1.compute.internal": csinode.storage.k8s.io "ip-10-121-68-79.ap-south-1.compute.internal" not found
I0920 17:30:00.586107 1 scheduler_binder.go:823] PersistentVolume "pvc-31af46c4-0d27-4eea-8ef6-148bbb2b4f0b", Node "ip-10-121-68-79.ap-south-1.compute.internal" mismatch for Pod "kafka/kafka-0": no matching NodeSelectorTerms
I0920 17:30:00.586149 1 scheduler_binder.go:803] Could not get a CSINode object for the node "ip-10-121-162-179.ap-south-1.compute.internal": csinode.storage.k8s.io "ip-10-121-162-179.ap-south-1.compute.internal" not found
I0920 17:30:00.586172 1 scheduler_binder.go:823] PersistentVolume "pvc-50c002d3-a5cc-4143-adf2-1362d18fc40e", Node "ip-10-121-162-179.ap-south-1.compute.internal" mismatch for Pod "kafka/kafka-0": no matching NodeSelectorTerms
I0920 17:30:00.586247 1 scheduler_binder.go:803] Could not get a CSINode object for the node "ip-10-121-241-242.ap-south-1.compute.internal": csinode.storage.k8s.io "ip-10-121-241-242.ap-south-1.compute.internal" not found
I0920 17:30:00.586275 1 scheduler_binder.go:823] PersistentVolume "pvc-50c002d3-a5cc-4143-adf2-1362d18fc40e", Node "ip-10-121-241-242.ap-south-1.compute.internal" mismatch for Pod "kafka/kafka-0": no matching NodeSelectorTerms
I0920 17:30:00.586328 1 scheduler_binder.go:803] Could not get a CSINode object for the node "ip-10-121-5-204.ap-south-1.compute.internal": csinode.storage.k8s.io "ip-10-121-5-204.ap-south-1.compute.internal" not found
I0920 17:30:00.586350 1 scheduler_binder.go:823] PersistentVolume "pvc-31af46c4-0d27-4eea-8ef6-148bbb2b4f0b", Node "ip-10-121-5-204.ap-south-1.compute.internal" mismatch for Pod "kafka/kafka-0": no matching NodeSelectorTerms
I0920 17:30:00.586533 1 scheduler_binder.go:803] Could not get a CSINode object for the node "ip-10-121-173-251.ap-south-1.compute.internal": csinode.storage.k8s.io "ip-10-121-173-251.ap-south-1.compute.internal" not found
I0920 17:30:00.586572 1 scheduler_binder.go:823] PersistentVolume "pvc-0c9887c2-eea3-4ef7-baae-c4c0aca78699", Node "ip-10-121-173-251.ap-south-1.compute.internal" mismatch for Pod "kafka/kafka-1": no matching NodeSelectorTerms
I0920 17:30:00.586622 1 scheduler_binder.go:803] Could not get a CSINode object for the node "ip-10-121-68-79.ap-south-1.compute.internal": csinode.storage.k8s.io "ip-10-121-68-79.ap-south-1.compute.internal" not found
I0920 17:30:00.586663 1 scheduler_binder.go:823] PersistentVolume "pvc-df590cf4-a584-4842-9842-9629312c0e45", Node "ip-10-121-68-79.ap-south-1.compute.internal" mismatch for Pod "kafka/kafka-1": no matching NodeSelectorTerms
I0920 17:30:00.586711 1 scheduler_binder.go:803] Could not get a CSINode object for the node "ip-10-121-162-179.ap-south-1.compute.internal": csinode.storage.k8s.io "ip-10-121-162-179.ap-south-1.compute.internal" not found
I0920 17:30:00.586737 1 scheduler_binder.go:823] PersistentVolume "pvc-0c9887c2-eea3-4ef7-baae-c4c0aca78699", Node "ip-10-121-162-179.ap-south-1.compute.internal" mismatch for Pod "kafka/kafka-1": no matching NodeSelectorTerms
I0920 17:30:00.586802 1 scheduler_binder.go:803] Could not get a CSINode object for the node "ip-10-121-241-242.ap-south-1.compute.internal": csinode.storage.k8s.io "ip-10-121-241-242.ap-south-1.compute.internal" not found
I0920 17:30:00.586827 1 scheduler_binder.go:823] PersistentVolume "pvc-0c9887c2-eea3-4ef7-baae-c4c0aca78699", Node "ip-10-121-241-242.ap-south-1.compute.internal" mismatch for Pod "kafka/kafka-1": no matching NodeSelectorTerms
I0920 17:30:00.586869 1 scheduler_binder.go:803] Could not get a CSINode object for the node "ip-10-121-5-204.ap-south-1.compute.internal": csinode.storage.k8s.io "ip-10-121-5-204.ap-south-1.compute.internal" not found
I0920 17:30:00.586907 1 scheduler_binder.go:823] PersistentVolume "pvc-df590cf4-a584-4842-9842-9629312c0e45", Node "ip-10-121-5-204.ap-south-1.compute.internal" mismatch for Pod "kafka/kafka-1": no matching NodeSelectorTerms
I0920 17:30:00.586929 1 filter_out_schedulable.go:170] 0 pods were kept as unschedulable based on caching
I0920 17:30:00.586938 1 filter_out_schedulable.go:171] 0 pods marked as unschedulable can be scheduled.
I0920 17:30:00.586952 1 filter_out_schedulable.go:82] No schedulable pods
I0920 17:30:00.586966 1 klogx.go:86] Pod kafka/kafka-0 is unschedulable
I0920 17:30:00.586972 1 klogx.go:86] Pod kafka/kafka-1 is unschedulable
I0920 17:30:00.587014 1 scale_up.go:376] Upcoming 0 nodes
I0920 17:30:00.587153 1 scheduler_binder.go:803] Could not get a CSINode object for the node "template-node-for-eks-atlan-node-kafka-pod-spot-20220920151848482300000005-36c1adbd-7aef-51ce-830e-d848e9f27e09-6789034556239763083": csinode.storage.k8s.io "template-node-for-eks-atlan-node-kafka-pod-spot-20220920151848482300000005-36c1adbd-7aef-51ce-830e-d848e9f27e09-6789034556239763083" not found
I0920 17:30:00.587188 1 scheduler_binder.go:823] PersistentVolume "pvc-31af46c4-0d27-4eea-8ef6-148bbb2b4f0b", Node "template-node-for-eks-atlan-node-kafka-pod-spot-20220920151848482300000005-36c1adbd-7aef-51ce-830e-d848e9f27e09-6789034556239763083" mismatch for Pod "kafka/kafka-0": no matching NodeSelectorTerms
I0920 17:30:00.587210 1 scale_up.go:300] Pod kafka-0 can't be scheduled on eks-atlan-node-kafka-pod-spot-20220920151848482300000005-36c1adbd-7aef-51ce-830e-d848e9f27e09, predicate checking error: node(s) had volume node affinity conflict; predicateName=VolumeBinding; reasons: node(s) had volume node affinity conflict; debugInfo=
I0920 17:30:00.587316 1 scheduler_binder.go:803] Could not get a CSINode object for the node "template-node-for-eks-atlan-node-kafka-pod-spot-20220920151848482300000005-36c1adbd-7aef-51ce-830e-d848e9f27e09-6789034556239763083": csinode.storage.k8s.io "template-node-for-eks-atlan-node-kafka-pod-spot-20220920151848482300000005-36c1adbd-7aef-51ce-830e-d848e9f27e09-6789034556239763083" not found
I0920 17:30:00.587361 1 scheduler_binder.go:823] PersistentVolume "pvc-df590cf4-a584-4842-9842-9629312c0e45", Node "template-node-for-eks-atlan-node-kafka-pod-spot-20220920151848482300000005-36c1adbd-7aef-51ce-830e-d848e9f27e09-6789034556239763083" mismatch for Pod "kafka/kafka-1": no matching NodeSelectorTerms
I0920 17:30:00.587386 1 scale_up.go:300] Pod kafka-1 can't be scheduled on eks-atlan-node-kafka-pod-spot-20220920151848482300000005-36c1adbd-7aef-51ce-830e-d848e9f27e09, predicate checking error: node(s) had volume node affinity conflict; predicateName=VolumeBinding; reasons: node(s) had volume node affinity conflict; debugInfo=
I0920 17:30:00.587417 1 scale_up.go:449] No pod can fit to eks-atlan-node-kafka-pod-spot-20220920151848482300000005-36c1adbd-7aef-51ce-830e-d848e9f27e09
Our pods are in pending state due to volume node affinity conflict.
Describe kafka-1 pod
LAST SEEN TYPE REASON OBJECT MESSAGE
6m52s Warning FailedScheduling pod/kafka-0 0/5 nodes are available: 5 node(s) had volume node affinity conflict.
6m52s Warning FailedScheduling pod/kafka-1 0/5 nodes are available: 5 node(s) had volume node affinity conflict.
73s Normal NotTriggerScaleUp pod/kafka-0 pod didn't trigger scale-up: 1 node(s) had volume node affinity conflict
73s Normal NotTriggerScaleUp pod/kafka-1 pod didn't trigger scale-up: 1 node(s) had volume node affinity conflict
Hi @decipher27 ,
Could you show us the labels on you AWS ASG aws autoscaling describe-auto-scaling-groups
?
My understanding of this issue is that you need the topology tags:
{
"ResourceId": "eks-spot-2-XXXX",
"ResourceType": "auto-scaling-group",
"Key": "k8s.io/cluster-autoscaler/node-template/label/topology.ebs.csi.aws.com/zone",
"Value": "us-east-1c",
"PropagateAtLaunch": false
},
I've also added
{
"ResourceId": "eks-spot-2-5xxxx",
"ResourceType": "auto-scaling-group",
"Key": "k8s.io/cluster-autoscaler/node-template/label/topology.kubernetes.io/zone",
"Value": "us-east-1c",
"PropagateAtLaunch": false
},
When your ASG is at 0, there no node to retrieve the topogy from. You must have topology labels on ASG itself to allow CA and CSI Driver to retrieve the topology.
We don't have the mentioned tags mentioned above, and it was working earlier. Though, we found the issue was with the scheduler. we are using a custom scheduler.. Our vendor had made some tweaks and it's fixed. Thank you @JBOClara
Also, from your comment, what do you mean by When your ASG is at 0
? You mean if I set the desired count to be '0'?
Also, from your comment, what do you mean by
When your ASG is at 0
? You mean if I set the desired count to be '0'? @decipher27
Exactly, when an ASG desired value is set to 0 (for instance, after a downscale of all replicas with kube-downscaler, except those from CA itself). CA will not be able to read node labels, because there is no node.
Got the same issue, if a pvc & pod created and then suspend the asg group & scaled down the asg to 0 to save cost at weekend, but on Monday this pod is not able to start from 0, other stateless pods are okay
@debu99 Look at:
Hi @decipher27 ,
Could you show us the labels on you AWS ASG
aws autoscaling describe-auto-scaling-groups
?My understanding of this issue is that you need the topology tags:
{ "ResourceId": "eks-spot-2-XXXX", "ResourceType": "auto-scaling-group", "Key": "k8s.io/cluster-autoscaler/node-template/label/topology.ebs.csi.aws.com/zone", "Value": "us-east-1c", "PropagateAtLaunch": false },
I've also added
{ "ResourceId": "eks-spot-2-5xxxx", "ResourceType": "auto-scaling-group", "Key": "k8s.io/cluster-autoscaler/node-template/label/topology.kubernetes.io/zone", "Value": "us-east-1c", "PropagateAtLaunch": false },
When your ASG is at 0, there no node to retrieve the topogy from. You must have topology labels on ASG itself to allow CA and CSI Driver to retrieve the topology.
my pv requires
Node Affinity:
Required Terms:
Term 0: topology.ebs.csi.aws.com/zone in [ap-southeast-1a]
But I believe this label is added automatically to all nodes? as i didn't add it into ASG tags, but all my nodes has it
ip-10-40-44-63.ap-southeast-1.compute.internal Ready <none> 5h3m v1.21.14-eks-ba74326 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=t3a.large,beta.kubernetes.io/os=linux,dedicated=redis,failure-domain.beta.kubernetes.io/region=ap-southeast-1,failure-domain.beta.kubernetes.io/zone=ap-southeast-1b,k8s-node-lifecycle=on-demand,k8s-node-role/on-demand-worker=true,k8s-node-role/type=none,k8s-node/instance-level=large,k8s-node/worker-type=t-type,k8s.io/cloud-provider-aws=be298adc77b66eafc3745cf0a9c131e0,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-40-44-63.ap-southeast-1.compute.internal,kubernetes.io/os=linux,node.kubernetes.io/instance-type=t3a.large,sb-subnet/type=primary,sb-subnet/zone-id=1,topology.ebs.csi.aws.com/zone=ap-southeast-1b,topology.kubernetes.io/region=ap-southeast-1,topology.kubernetes.io/zone=ap-southeast-1b
ip-10-40-7-219.ap-southeast-1.compute.internal Ready <none> 25m v1.21.14-eks-ba74326 beta.kubernetes.io/arch=arm64,beta.kubernetes.io/instance-type=r6g.large,beta.kubernetes.io/os=linux,dedicated=prometheus-operator,failure-domain.beta.kubernetes.io/region=ap-southeast-1,failure-domain.beta.kubernetes.io/zone=ap-southeast-1a,k8s-node-lifecycle=on-demand,k8s-node-role/on-demand-worker=true,k8s-node-role/type=none,k8s-node/instance-level=large,k8s-node/worker-type=r-type,k8s.io/cloud-provider-aws=be298adc77b66eafc3745cf0a9c131e0,kubernetes.io/arch=arm64,kubernetes.io/hostname=ip-10-40-7-219.ap-southeast-1.compute.internal,kubernetes.io/os=linux,node.kubernetes.io/instance-type=r6g.large,sb-subnet/type=primary,sb-subnet/zone-id=0,topology.ebs.csi.aws.com/zone=ap-southeast-1a,topology.kubernetes.io/region=ap-southeast-1,topology.kubernetes.io/zone=ap-southeast-1a
Yes, but when the ASG is at 0, there are no nodes. cluster-autoscaler needs the labels tagged on the ASG to know what labels the node would have if it would scale up the ASG from 0.
We are facing the same issue with VolumeNodeAffinity error, and our ASG has node Spun Across AZs, What is the best way for CA to spin up the nodes based on the right AZ. We use the priority expander. Also CA takes throws the error:
I0103 17:43:29.663090 1 scale_up.go:449] No pod can fit to eks-atlan-node-spot-c2c299ee-8af5-1b60-2ce3-2e4dc50b5484
I0103 17:43:29.663106 1 scale_up.go:453] No expansion options
Above error comes when there is enough room for CA to spin up new nodes in the Nodegroup and also there is one more nodegroup where CA can launch, but CA not functioning as expected. CA version: 1.21
@KiranReddy230 if you read the comments above yours, the question has been answered three times already. You need to add the tags mentioned above to your ASG. In order for this to work properly, each node group (and thus each ASG) should have only one zone (this is the recommended architecture anyway).
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
I have this issue despite (I believe) having everything set up correctly.
EKS - 1.25
CA - 1.25.2:
- command:
- ./cluster-autoscaler
- --cloud-provider=aws
- --namespace=kube-system
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/XXX
- --balance-similar-node-groups=true
- --emit-per-nodegroup-metrics=true
- --expander=most-pods,least-waste
- --ignore-taint=node.cilium.io/agent-not-ready
- --logtostderr=true
- --namespace=kube-system
- --regional=true
- --scan-interval=1m
- --skip-nodes-with-local-storage=false
- --skip-nodes-with-system-pods=false
- --stderrthreshold=error
- --v=0
env:
- name: AWS_REGION
value: eu-west-1
My 3 ASGs are tagged as following (each of them covers single region a/b/c):
k8s.io/cluster-autoscaler/node-template/label/failure-domain.beta.kubernetes.io/zone eu-west-1a / eu-west-1b / eu-west-1c
k8s.io/cluster-autoscaler/node-template/label/node.kubernetes.io/instance-type m5.2xlarge
k8s.io/cluster-autoscaler/node-template/label/topology.ebs.csi.aws.com/zone eu-west-1a / eu-west-1b / eu-west-1c
k8s.io/cluster-autoscaler/node-template/label/topology.kubernetes.io/region eu-west-1
k8s.io/cluster-autoscaler/node-template/label/topology.kubernetes.io/zone eu-west-1a / eu-west-1b / eu-west-1c
k8s.io/cluster-autoscaler/node-template/taint/node.cilium.io/agent-not-ready true:NO_EXECUTE Yes
I'm running Prometheus as STS with PVC (affinity rules set to ensure replicas are spread across AZ and hosts):
apiVersion: apps/v1
kind: StatefulSet
metadata:
annotations:
polaris.fairwinds.com/automountServiceAccountToken-exempt: "true"
prometheus-operator-input-hash: "4772490143308579296"
creationTimestamp: "2023-03-03T20:52:48Z"
generation: 56
labels:
app: kube-prometheus-stack-prometheus
app.kubernetes.io/instance: prometheus
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/part-of: kube-prometheus-stack
app.kubernetes.io/version: 47.0.0
argocd.argoproj.io/instance: xxx-prometheus
chart: kube-prometheus-stack-47.0.0
heritage: Helm
operator.prometheus.io/mode: server
operator.prometheus.io/name: prometheus-prometheus
operator.prometheus.io/shard: "0"
release: prometheus
name: prometheus-prometheus-prometheus
namespace: prometheus
ownerReferences:
- apiVersion: monitoring.coreos.com/v1
blockOwnerDeletion: true
controller: true
kind: Prometheus
name: prometheus-prometheus
uid: ce818fdf-02b4-4718-a430-f4ff4c5acbc5
resourceVersion: "342440131"
uid: 662e082a-af26-40e4-b39e-d354a023fe0a
spec:
podManagementPolicy: Parallel
replicas: 2
revisionHistoryLimit: 10
selector:
matchLabels:
app.kubernetes.io/instance: prometheus-prometheus
app.kubernetes.io/managed-by: prometheus-operator
app.kubernetes.io/name: prometheus
operator.prometheus.io/name: prometheus-prometheus
operator.prometheus.io/shard: "0"
prometheus: prometheus-prometheus
serviceName: prometheus-operated
template:
metadata:
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
kubectl.kubernetes.io/default-container: prometheus
linkerd.io/inject: enabled
creationTimestamp: null
labels:
app.kubernetes.io/instance: prometheus-prometheus
app.kubernetes.io/managed-by: prometheus-operator
app.kubernetes.io/name: prometheus
app.kubernetes.io/version: 2.44.0
operator.prometheus.io/name: prometheus-prometheus
operator.prometheus.io/shard: "0"
prometheus: prometheus-prometheus
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/instance: prometheus-prometheus
app.kubernetes.io/name: prometheus
prometheus: prometheus-prometheus
topologyKey: topology.kubernetes.io/zone
- labelSelector:
matchLabels:
app.kubernetes.io/instance: prometheus-prometheus
app.kubernetes.io/name: prometheus
prometheus: prometheus-prometheus
topologyKey: kubernetes.io/hostname
automountServiceAccountToken: true
containers:
- args:
- --web.console.templates=/etc/prometheus/consoles
- --web.console.libraries=/etc/prometheus/console_libraries
- --config.file=/etc/prometheus/config_out/prometheus.env.yaml
- --web.enable-lifecycle
- --web.external-url=https://prometheus.xxx.xxx
- --web.route-prefix=/
- --log.level=error
- --log.format=json
- --storage.tsdb.retention.time=3h
- --storage.tsdb.path=/prometheus
- --storage.tsdb.wal-compression
- --web.config.file=/etc/prometheus/web_config/web-config.yaml
- --storage.tsdb.max-block-duration=2h
- --storage.tsdb.min-block-duration=2h
image: XXX.dkr.ecr.eu-west-1.amazonaws.com/quay.io/prometheus/prometheus:v2.44.0
imagePullPolicy: Always
livenessProbe:
failureThreshold: 6
httpGet:
path: /-/healthy
port: http-web
scheme: HTTP
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 3
name: prometheus
ports:
- containerPort: 9090
name: http-web
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /-/ready
port: http-web
scheme: HTTP
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 3
resources:
limits:
memory: 20Gi
requests:
cpu: 300m
memory: 20Gi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
startupProbe:
failureThreshold: 60
httpGet:
path: /-/ready
port: http-web
scheme: HTTP
periodSeconds: 15
successThreshold: 1
timeoutSeconds: 3
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: FallbackToLogsOnError
volumeMounts:
- mountPath: /etc/prometheus/config_out
name: config-out
readOnly: true
- mountPath: /etc/prometheus/certs
name: tls-assets
readOnly: true
- mountPath: /prometheus
name: prometheus-prometheus-prometheus-db
subPath: prometheus-db
- mountPath: /etc/prometheus/rules/prometheus-prometheus-prometheus-rulefiles-0
name: prometheus-prometheus-prometheus-rulefiles-0
- mountPath: /etc/prometheus/web_config/web-config.yaml
name: web-config
readOnly: true
subPath: web-config.yaml
- args:
- --listen-address=:8080
- --reload-url=http://127.0.0.1:9090/-/reload
- --config-file=/etc/prometheus/config/prometheus.yaml.gz
- --config-envsubst-file=/etc/prometheus/config_out/prometheus.env.yaml
- --watched-dir=/etc/prometheus/rules/prometheus-prometheus-prometheus-rulefiles-0
- --log-level=error
- --log-format=json
command:
- /bin/prometheus-config-reloader
env:
- name: POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: SHARD
value: "0"
image: XXX.dkr.ecr.eu-west-1.amazonaws.com/quay.io/prometheus-operator/prometheus-config-reloader:v0.66.0
imagePullPolicy: Always
name: config-reloader
ports:
- containerPort: 8080
name: reloader-web
protocol: TCP
resources:
limits:
cpu: 200m
memory: 50Mi
requests:
cpu: 50m
memory: 50Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: FallbackToLogsOnError
volumeMounts:
- mountPath: /etc/prometheus/config
name: config
- mountPath: /etc/prometheus/config_out
name: config-out
- mountPath: /etc/prometheus/rules/prometheus-prometheus-prometheus-rulefiles-0
name: prometheus-prometheus-prometheus-rulefiles-0
- args:
- sidecar
- --prometheus.url=http://127.0.0.1:9090/
- '--prometheus.http-client={"tls_config": {"insecure_skip_verify":true}}'
- --grpc-address=:10901
- --http-address=:10902
- --objstore.config=$(OBJSTORE_CONFIG)
- --tsdb.path=/prometheus
- --log.level=error
- --log.format=json
env:
- name: OBJSTORE_CONFIG
valueFrom:
secretKeyRef:
key: config
name: thanos-config
image: XXX.dkr.ecr.eu-west-1.amazonaws.com/bitnami/thanos:0.31.0
imagePullPolicy: Always
name: thanos-sidecar
ports:
- containerPort: 10902
name: http
protocol: TCP
- containerPort: 10901
name: grpc
protocol: TCP
resources:
limits:
memory: 256Mi
requests:
cpu: 10m
memory: 256Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: FallbackToLogsOnError
volumeMounts:
- mountPath: /prometheus
name: prometheus-prometheus-prometheus-db
subPath: prometheus-db
dnsPolicy: ClusterFirst
initContainers:
- args:
- --watch-interval=0
- --listen-address=:8080
- --config-file=/etc/prometheus/config/prometheus.yaml.gz
- --config-envsubst-file=/etc/prometheus/config_out/prometheus.env.yaml
- --watched-dir=/etc/prometheus/rules/prometheus-prometheus-prometheus-rulefiles-0
- --log-level=error
- --log-format=json
command:
- /bin/prometheus-config-reloader
env:
- name: POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: SHARD
value: "0"
image: XXX.dkr.ecr.eu-west-1.amazonaws.com/quay.io/prometheus-operator/prometheus-config-reloader:v0.66.0
imagePullPolicy: Always
name: init-config-reloader
ports:
- containerPort: 8080
name: reloader-web
protocol: TCP
resources:
limits:
cpu: 200m
memory: 50Mi
requests:
cpu: 50m
memory: 50Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: FallbackToLogsOnError
volumeMounts:
- mountPath: /etc/prometheus/config
name: config
- mountPath: /etc/prometheus/config_out
name: config-out
- mountPath: /etc/prometheus/rules/prometheus-prometheus-prometheus-rulefiles-0
name: prometheus-prometheus-prometheus-rulefiles-0
nodeSelector:
node.kubernetes.io/instance-type: m5.2xlarge
priorityClassName: prometheus
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
fsGroup: 2000
runAsGroup: 2000
runAsNonRoot: true
runAsUser: 1000
seccompProfile:
type: RuntimeDefault
serviceAccount: prometheus-prometheus
serviceAccountName: prometheus-prometheus
terminationGracePeriodSeconds: 600
topologySpreadConstraints:
- labelSelector:
matchLabels:
app.kubernetes.io/instance: prometheus-prometheus
app.kubernetes.io/name: prometheus
prometheus: prometheus-prometheus
maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
volumes:
- name: config
secret:
defaultMode: 420
secretName: prometheus-prometheus-prometheus
- name: tls-assets
projected:
defaultMode: 420
sources:
- secret:
name: prometheus-prometheus-prometheus-tls-assets-0
- emptyDir:
medium: Memory
name: config-out
- configMap:
defaultMode: 420
name: prometheus-prometheus-prometheus-rulefiles-0
name: prometheus-prometheus-prometheus-rulefiles-0
- name: web-config
secret:
defaultMode: 420
secretName: prometheus-prometheus-prometheus-web-config
updateStrategy:
type: RollingUpdate
volumeClaimTemplates:
- apiVersion: v1
kind: PersistentVolumeClaim
metadata:
creationTimestamp: null
name: prometheus-prometheus-prometheus-db
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
storageClassName: ebs-sc-preserve
volumeMode: Filesystem
status:
phase: Pending
~ k get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-e6df1f14-4f62-41ce-8f21-97b73b0c055f 10Gi RWO Retain Bound prometheus/prometheus-prometheus-prometheus-db-prometheus-prometheus-prometheus-0 ebs-sc-preserve 61d
pvc-f40a6589-6fcf-4419-9486-70e5efa43575 10Gi RWO Retain Bound prometheus/prometheus-prometheus-prometheus-db-prometheus-prometheus-prometheus-1 ebs-sc-preserve 9d
~ k describe pv pvc-e6df1f14-4f62-41ce-8f21-97b73b0c055f pvc-f40a6589-6fcf-4419-9486-70e5efa43575
Name: pvc-e6df1f14-4f62-41ce-8f21-97b73b0c055f
Labels: <none>
Annotations: pv.kubernetes.io/provisioned-by: ebs.csi.aws.com
volume.kubernetes.io/provisioner-deletion-secret-name:
volume.kubernetes.io/provisioner-deletion-secret-namespace:
Finalizers: [kubernetes.io/pv-protection external-attacher/ebs-csi-aws-com]
StorageClass: ebs-sc-preserve
Status: Bound
Claim: prometheus/prometheus-prometheus-prometheus-db-prometheus-prometheus-prometheus-0
Reclaim Policy: Retain
Access Modes: RWO
VolumeMode: Filesystem
Capacity: 10Gi
Node Affinity:
Required Terms:
Term 0: topology.ebs.csi.aws.com/zone in [eu-west-1c]
Message:
Source:
Type: CSI (a Container Storage Interface (CSI) volume source)
Driver: ebs.csi.aws.com
FSType: ext4
VolumeHandle: vol-08b0f4a31f192dad7
ReadOnly: false
VolumeAttributes: storage.kubernetes.io/csiProvisionerIdentity=1683859406228-8081-ebs.csi.aws.com
Events: <none>
Name: pvc-f40a6589-6fcf-4419-9486-70e5efa43575
Labels: <none>
Annotations: pv.kubernetes.io/provisioned-by: ebs.csi.aws.com
volume.kubernetes.io/provisioner-deletion-secret-name:
volume.kubernetes.io/provisioner-deletion-secret-namespace:
Finalizers: [kubernetes.io/pv-protection external-attacher/ebs-csi-aws-com]
StorageClass: ebs-sc-preserve
Status: Bound
Claim: prometheus/prometheus-prometheus-prometheus-db-prometheus-prometheus-prometheus-1
Reclaim Policy: Retain
Access Modes: RWO
VolumeMode: Filesystem
Capacity: 10Gi
Node Affinity:
Required Terms:
Term 0: topology.ebs.csi.aws.com/zone in [eu-west-1b]
Message:
Source:
Type: CSI (a Container Storage Interface (CSI) volume source)
Driver: ebs.csi.aws.com
FSType: ext4
VolumeHandle: vol-07d31d533b2e01a4b
ReadOnly: false
VolumeAttributes: storage.kubernetes.io/csiProvisionerIdentity=1687797020030-8081-ebs.csi.aws.com
Events: <none>
Every night between 00:00 - 06:00 (I believe this is when AWS rebalancing happens) at least one of prometheus replicas is being stuck in Pending
state. Once cluster-autoscaler is being restarted - k -n kube-system rollout restart deploy cluster-autoscaler
- ASG is being properly scheduled up.
For now I had to set minCapacity = 1
for these ASGs to prevent such situations.
This is closely related to issue #4739, which was fixed in cluster autoscaler version 1.22 onward. If you look at the function that generates a hypothetical new node to satisfy the pending pod, the new label that is needed to satisfy volumes created by the EBS CSI driver is not part of that function. It will not scale up unless you add the tag to the ASG manually.
Current function: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/aws_manager.go#L409
The next function is why adding the labels to the ASG makes this work
Since the annotation is widely used now, maybe we update the buildGenericLabels function to use the label topology.ebs.csi.aws.com/zone as well for the new node when its hypothetically being built.
I can make a stab at providing a PR with a fix.
Which component are you using?:
What version of the component are you using?:
Cluster-Autoscaler Deployment YAML
Component version:
What k8s version are you using (
kubectl version
)?:kubectl version
OutputWhat environment is this in?:
What did you expect to happen?: I do have an ASG dedicated to a single CronJob, that get's triggered 6 times a day. That ASG is pinned to a specific AWS AZ by it's assigned subnet. The Cronjob is pinned to that specific ASG by Affinity+Toleration The job uses a PV, that will be provisioned (AWS EBS) on the first ever run and then subsequently reused on each run. I expect the ASG to be scaled up to 1 after the Pod gets created and removed shortly after the Pod/Job has finished.
What happened instead?:
The ASG will not be scaled up by the cluster-autoscaler.
cluster-autoscaler log output after the Job is created and the Pod is pending
Anything else we need to know?: Basically this works fine without the volume. With the volume it works when the volume is not provisioned yet, but fails when it already has been provisioned. The job also get's scheduled right away when I manually upscale the ASG.
I noticed the volume affinity on the PVC :
That tag is probably set on the node by the "ebs-csi-node" DaemonSet and therefore is unknown for the cluster-autoscaler.
Am I expected to tag the ASG with
k8s.io/cluster-autoscaler/node-template/label/topology.ebs.csi.aws.com/zone
? If so, how am I supposed to set them in a Multi-AZ ASGs ?Possibly related: https://github.com/kubernetes/autoscaler/issues/3230