Closed dvianello closed 1 year ago
It's not so much an expected behavior as it is a longstanding known issue. CA has problem with multi-zonal clusters and PVs. More generally there are problems with any feature that relies on using informers inside scheduling predicates (ex. anything with zones and storage, pod affinity/antiaffinity)
Technical details: CA works by predicting 'what would scheduler do if I added a node of given type'. To do it CA imports part of scheduler that is called 'predicates'. Those are functions that tell if a given pod will be able to fit on the node. To scale-up CA uses an in-memory "template" node object to see if any of unschedulable pods could be scheduled if a new node was added. Predicate function takes a pod and a node as parameters (technically a NodeInfo object, but that's irrelevant - it's a wrapper for a Node). Those parameters are controlled by CA and we can inject non-existing in-memory object to simulate scheduler behavior. Unfortunately predicate functions also have access to Kubernetes client (technically: set of informers) that they can use to query the state of the cluster. Basically they can access parts of cluster state that are not explicit parameters that CA can fake. This is how storage predicates work - they just ask apiserver directly for list of existing volumes. CA can't create in-memory volumes simulating the ones that would be created by PV controller. And so the predicate fail and CA figures the scale-up wouldn't help the pod.
We know of this problem and we had some initial talks with sig-storage members, but it's not an easy thing to fix. It will require some large effort on both sides and, frankly, I don't expect it to be fixed in 2019. Certainly not in first half of 2019.
Sorry I don't have a better answer :(
Hi @MaciekPytel,
thanks for that! Always good to know the technical details :-)
If the scheduler is using the same predicates that CA is using, wouldn't that imply that the scheduler itself won't be able to "see" the PV backed by a storageclass
with WaitForFirstConsumer
enabled and thus fail to schedule? They must have injected some magic somewhere to get this to work - can't the same magic be imported in CA as well?
Not trying to push anything/anybody here, just merely trying to understand the very complex k8s world :-)
Just looked into scheduler and it seems the PVC should be created even with WaitForFirstCustomer before scheduling, but only after the pod is created. Is this accurate? If yes, perhaps allowed topologies can be set to ensure it is provisioned in one of the target zones: https://kubernetes.io/docs/concepts/storage/storage-classes/#allowed-topologies
Yes, that is my understanding as well.
Just tried again with our modified storage class, and a PVC gets creates but it stays in a Pending state:
$ k get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
dario2-workspace Pending gp2-topology 1m
katib-mysql Bound pvc-8def8765-2b8a-11e9-89ed-020c9ed9946e 10Gi RWO gp2 3h
minio-pv-claim Bound pvc-8749a987-2b8a-11e9-89ed-020c9ed9946e 10Gi RWO gp2 3h
mysql-pv-claim Bound pvc-87bf3176-2b8a-11e9-89ed-020c9ed9946e 10Gi RWO gp2 3h
$k describe pvc dario2-workspace
Name: dario2-workspace
Namespace: kubeflow
StorageClass: gp2-topology
Status: Pending
Volume:
Labels: app=jupyterhub
component=singleuser-storage
heritage=jupyterhub
Annotations: hub.jupyter.org/username: dario2
Finalizers: [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal WaitForFirstConsumer 7s (x13 over 2m46s) persistentvolume-controller waiting for first consumer to be created before binding
Mounted By: jupyter-dario2
Thing is, the pod will never be created because it requires a scale-up that the autoscaler isn't going to execute since it's waiting for a PV. And the loop goes on. Checkmate :-)
Help 🆘!
So it seems volume controller will wait for scheduler to give it a hint (put node name in annotation) before provisioning anything: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/persistentvolume/pv_controller.go#L289
And it kind of makes sense - in a fixed size cluster, there's no point provisioning a volume if there's no suitable node, and if there's a node, we want to provision it in that node's zone... but it all breaks when you want to create a node for the pod.
I just ran a quick experiment and it seems possible to use allowedTopologies together with volumeBindingMode: Immediate. If you haven't tried that yet by any chance, it may be worth checking.
And it kind of makes sense - in a fixed size cluster, there's no point provisioning a volume if there's no suitable node, and if there's a node, we want to provision it in that node's zone... but it all breaks when you want to create a node for the pod.
Is this something CA plans to support at some point? It would solve the multi-zone ASG issue, doesn't it?
I just ran a quick experiment and it seems possible to use allowedTopologies together with volumeBindingMode: Immediate. If you haven't tried that yet by any chance, it may be worth checking.
Uhm, you mean limiting the zones where the storageclass can create PVs? Yes, can do. We were trying to avoid having special storageclasses and only fly with a default "smart" one, so that if we don't need the special nodes we can take advantage of all the AZs.
Is this something CA plans to support at some point? It would solve the multi-zone ASG issue, doesn't it?
Ideally yes, but as I wrote above it's not likely to happen in the next 6 months.
I just ran into this issue too; what's the current workaround? Is it to pre-provision volumes in each AZ so that when an instance is started, it will have a pool to select from regardless of its placement?
@aleksandra-malinowska I didn't quite understand your comment about allowedTopologies
. Would that mean one would need to have multiple storage classes (one per AZ) and then specify that class in the statefulset? That would prevent distribution over multiple fault domains...
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
This is not fixed yet.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
I ran into this chicken & egg problem too.
Is this something CA plans to support at some point? It would solve the multi-zone ASG issue, doesn't it?
Ideally yes, but as I wrote above it's not likely to happen in the next 6 months.
@MaciekPytel Does it require some design changes or is it too big a code change? Can someone new to this codebase do this in a week or so?
I've just gotten the same error. Only manually resizing can help in this case.
Is there a known workaround or solution to this? It seems like it isn't possible to scale to/from 0 using CA and AWS EBS Volumes.
I don't think there is a good workaround. I'm actively looking into how this can be fixed, but it will require at least some changes in volume scheduling in Kubernetes and a huge change in autoscaler (comparable to scheduler framework migration which consisted of >100 commits and took us months to complete). At this point I can't give any guarantees regarding timeline. It certainly won't be ready in time for 1.19.
Correct me if I am wrong, but using EBS with CA seems rather counterproductive. I use CA with EFS because the latter is multi-AZ, meaning workloads can come and go in any AZ and resume with their persistent storage. There are some workloads that have issues with EFS (and NFS in general), but I think it is better to move them onto alternative multi-AZ storage system such as GlusterFS rather than tie workloads to a specific AZ, which is a requirement when using EBS.
I don't think there is a good workaround. I'm actively looking into how this can be fixed, but it will require at least some changes in volume scheduling in Kubernetes and a huge change in autoscaler (comparable to scheduler framework migration which consisted of >100 commits and took us months to complete). At this point I can't give any guarantees regarding timeline. It certainly won't be ready in time for 1.19.
Thanks @MaciekPytel I can appreciate it's a very complex problem at the moment. I was mostly trying to understand if I was missing some obvious workaround and getting clarity around if I can ever expect a solution. Sounds like a "possible solution" is in the "thinking about it" phase, but won't be ready (if ever) for a long time, possible 1.20+. Thanks for the update!
Correct me if I am wrong, but using EBS with CA seems rather counterproductive. I use CA with EFS because the latter is multi-AZ, meaning workloads can come and go in any AZ and resume with their persistent storage. There are some workloads that have issues with EFS (and NFS in general), but I think it is better to move them onto alternative multi-AZ storage system such as GlusterFS rather than tie workloads to a specific AZ, which is a requirement when using EBS.
You're not wrong @drewhemm It's just that for some workloads, particularly ephemeral workloads, it is desirable to have cheap, dynamic storage that is released when the pod terminates. EFS is 3x more expensive than EBS and does provide block-storage. Admittedly, for the use-cases I have in mind, I don't think block-storage is strictly necessary. What has your experience been with EFS on k8s? Are you using ClusterAutoscaler in the scale to/from 0 case successfully? Are you using dynamic provisioning with EFS per pod or mounting it to the node?
@groodt I am indeed using CA and scaling to and from 0. I am using the EFS Dynamic Provisioner, so one EFS for the whole cluster:
https://github.com/kubernetes-incubator/external-storage/tree/master/aws/efs
EFS may be more expensive per GB than EBS, but you only pay for the capacity you use, unlike EBS where a majority of what is paid for is often not utilised (no-one wants to run out of space on an EBS). There is also the option for lifecycle management that moves old files to a cheaper storage class:
https://docs.aws.amazon.com/efs/latest/ug/lifecycle-management-efs.html
I am using Provisioned Throughput, and that increases the cost somewhat, but overall I am pretty sure we are spending less than if we had lots of EBS volumes:
https://docs.aws.amazon.com/efs/latest/ug/performance.html#throughput-modes
@drewhemm I'm glad it is working for your use-cases, but I think there is still a caveat with the EFS approach that means it won't be ideal for many ephemeral use-cases.
Please correct me if I'm wrong. :)
The nodes might scale up and down to 0, but the storage does not. What I mean by this, is that an EFS file-system needs to be created ahead of time. Creating the filesystem itself ahead of time isn't so much a problem, but for ephemeral use-cases, the storage used by individual PVCs on this file-system will not be released when the pods terminate?
An example of a use-case where ephemeral disk is useful is for things like Machine Learning workflows or Spark jobs, where processing of large data happens on pods, but when done, the final output is stored somewhere permanent such as S3 or a database etc. These workflows often need large amounts of storage, but only temporarily.
Another example would be elastic Jupyter Notebook services, where some exploratory analysis can be performed with ephemeral storage that is released once the user is done. The Notebook example could probably get away with host storage, but they do often work with pretty large data as well and having a one-size-fits-all EBS volume attached to the nodes themselves isn't always suitable.
The "dream" for me (maybe others too) would be if it was possible to treat storage with Kubernetes entirely elastically in the same as compute. In such a way, a PVC of any size could be requested and mounted onto a Cluster Autoscaled compute node.
@groodt You're right that unless the PVC is deleted, then the space in EFS is not released; however, I believe this is also the case for dynamically provisioned EBS volumes?
For truly ephemeral workloads, where what you want is effectively a scratch disk, there is a third option: instances with local SSDs, i.e. m5ad, c5d, r5dn etc. These offer higher performance than EBS or EFS and combined with CA, once the workload has finished and the node is terminated, the data is automatically blown away.
I experimented with this and found it worked well, although as I had a need for persistent storage rather than ephemeral, I moved to EFS instead. I still have the relevant code snippet in our Jinja2 template for the userdata:
{% if item.local_pvs is defined %}
# Handle local PVs
yum install -y parted
{% for pv in item.local_pvs %}
# Wait for the device to be visible
while [ ! -b {{ pv.device }} ]; do sleep 2; done
parted -s {{ pv.device }} mklabel msdos
{% for partition in pv.partitions %}
parted -s -a optimal {{ pv.device }} mkpart primary {{ partition.start }} {{ partition.end }}
# Wait for partitioning to complete
while [ ! -b {{ pv.device }}p{{ loop.index }} ]; do sleep 1; done
mkfs -t xfs {{ pv.device }}p{{ loop.index }}
mkdir -p /mnt/pv{{ loop.index }}
echo "UUID=$(blkid -s UUID -o value {{ pv.device }}p{{ loop.index }}) \
/mnt/pv{{ loop.index }} xfs defaults,noatime 1 1" >> /etc/fstab
{% endfor %}
{% endfor %}
# Mount all volumes defined in /etc/fstab
mount -a
MOUNT_RC=$?
# Exit if there were any failures mounting the volumes
if [[ $MOUNT_RC != 0 ]]; then exit $MOUNT_RC; fi
{% endif %}
Corresponding PV manifest template:
{% for i in range(1, item.max_instances + 1) %}
{% for pv in item.local_pvs %}
{% for partition in pv.partitions %}
---
kind: PersistentVolume
apiVersion: v1
metadata:
name: {{ item.instance_type }}-{{ i }}-pv{{ loop.index }}
labels:
capacity: {{ partition.size }}
spec:
accessModes:
- ReadWriteOnce
storageClassName: {{ pv.storage_class_name }}
# -{{ partition.size | lower() }}
capacity:
storage: {{ partition.size }}
local:
path: /mnt/pv{{ loop.index }}
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: beta.kubernetes.io/instance-type
operator: In
values:
- {{ item.instance_type }}
{% endfor %}
{% endfor %}
{% endfor %}
I imagine that until or unless AWS provides multi-AZ EBS volumes, it is quite tricky for CA to handle them in a predictable way.
I believe this is also the case for dynamically provisioned EBS volumes?
Not with reclaimPolicy: "Delete",
which is what we use at the moment. It works really well, but not from the scale-to-zero case that this original GH issue pertains to.
there is a third option: instances with local SSDs, i.e. m5ad, c5d, r5dn etc
Yes, using this with emptyDir would work too, but I'm talking even larger data than this in some cases.
it is quite tricky for CA to handle them in a predictable way.
Yes, absolutely. I think this is what @MaciekPytel is working on. I don't think it will come soon, but I do hope it comes some day! :)
I believe this is also the case for dynamically provisioned EBS volumes?
Not with
reclaimPolicy: "Delete",
which is what we use at the moment. It works really well, but not from the scale-to-zero case that this original GH issue pertains to.
EFS also supports reclaimPolicy: "Delete"
, so in that sense thay are the same. See the example just below here:
https://github.com/kubernetes-incubator/external-storage/tree/master/aws/efs#parameters
EFS also supports
reclaimPolicy: "Delete"
, so in that sense thay are the same.
So are you saying if I have a single EFS filesystem that is referenced by a single StorageClass (reclaimPolicy: "Delete"), I would be able to create numerous different PVC (different names) that reference this StorageClass and when the various PVCs are deleted, their data will be cleaned from the single EFS filesystem?
That's correct. I mounted the root of one of our EFS to verify and there are only directories for currently-existing PVCs. You can test this by creating a PVC and then deleting it and verifying that it is no longer present in EFS:
https://docs.aws.amazon.com/efs/latest/ug/mounting-fs.html#mounting-fs-install-amazon-efs-utils
sudo yum install amazon-efs-utils
sudo mkdir /mnt/efs
sudo mount -t efs fs-ID#####:/ /mnt/efs
ls /mnt/efs
sudo umount /mnt/efs
$ ls /mnt/efs
jenkins-jenkins-master-0-pvc-71b1fcb1-12ca-11ea-afcc-0295425a16da storage-elasticsearch-data-0-pvc-45378da8-233c-11ea-afcc-0295425a16da
persistent-remote-desktop-0-pvc-f70fba2c-588c-11ea-86f0-024095786998 storage-elasticsearch-data-1-pvc-67646794-233c-11ea-afcc-0295425a16da
redis-data-redis-master-0-pvc-547d2c9f-2658-11ea-97ac-06174f247eaa storage-elasticsearch-data-2-pvc-860c0b29-233c-11ea-afcc-0295425a16da
redis-data-redis-slave-0-pvc-845e0404-265b-11ea-97ac-06174f247eaa storage-elasticsearch-master-0-pvc-48755588-233c-11ea-afcc-0295425a16da
redis-data-redis-slave-1-pvc-a44093d0-265b-11ea-97ac-06174f247eaa storage-elasticsearch-master-1-pvc-64c6b15b-233c-11ea-afcc-0295425a16da
redis-data-redis-slave-2-pvc-b458966a-265b-11ea-97ac-06174f247eaa storage-elasticsearch-master-2-pvc-6f3f7986-233c-11ea-afcc-0295425a16da
storage-elasticsearch-coord-0-pvc-4789d0eb-233c-11ea-afcc-0295425a16da storage-prometheus-0-pvc-1f138dfb-1107-11ea-afcc-0295425a16da
storage-elasticsearch-coord-1-pvc-478c163a-233c-11ea-afcc-0295425a16da storage-prometheus-1-pvc-8acf9999-1c2c-11ea-afcc-0295425a16da
That's correct
That's really interesting. TIL. Thank you! I'll give it a try.
I imagine for some EFS won't be suitable and hopefully it eventually does become possible to use EBS, but for now, EFS may be a good substitute.
Yeah, EFS has become my default for AWS, and I fall back to EBS only for workloads that are not suitable, such as those that need to create large numbers of file locks:
https://docs.amazonaws.cn/en_us/efs/latest/ug/limits.html#limits-fs-specific
For everything else, there are sufficient tuning options in EFS to ensure the right level of performance at reasonable price points. The elimination of wasted storage costs is a big plus.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
(assuming I can do this...)
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close
@fejta-bot: Closing this issue.
I'm not sure closing this is the right thing to do :-) @MaciekPytel ?
I am running in to this issue as well, when using the https://github.com/kubernetes-sigs/sig-storage-local-static-provisioner project to create PVCs from AWS ephemeral instance NVMe disks.
The pods can't get scheduled because there's no nodes, and the CA won't scale up because there's no disks there for the pods to attach.
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close
@fejta-bot: Closing this issue.
/remove-lifecycle rotten
I presume this is still considered an issue, even if we're not all going to drop everything and fix it right away!
/reopen
@leosunmo: You can't reopen an issue/PR unless you authored it or you are a collaborator.
/reopen
/remove-lifecycle rotten
@dvianello: Reopened this issue.
I am running in to this issue as well, when using the https://github.com/kubernetes-sigs/sig-storage-local-static-provisioner project to create PVCs from AWS ephemeral instance NVMe disks.
The pods can't get scheduled because there's no nodes, and the CA won't scale up because there's no disks there for the pods to attach.
@leosunmo We ran today into the same problem. Have you found any workarounds? Have you switched to other storage? We are using the storage-local-static-provisioner to create pv on NVMe disks to be consumed by Elastic pods controlled via the ECK operator.
I am running in to this issue as well, when using the https://github.com/kubernetes-sigs/sig-storage-local-static-provisioner project to create PVCs from AWS ephemeral instance NVMe disks. The pods can't get scheduled because there's no nodes, and the CA won't scale up because there's no disks there for the pods to attach.
@leosunmo We ran today into the same problem. Have you found any workarounds? Have you switched to other storage? We are using the storage-local-static-provisioner to create pv on NVMe disks to be consumed by Elastic pods controlled via the ECK operator.
@recollir I think we ended up simply manually scaling up the node group (AWS Autoscaling group. Set desired nodes == Elasticsearch cluster size) and then applying the statefulset manifest/ECK. Since they're tainted nodes to only accept Elasticsearch workloads anyway, it works. As long as you don't wait too long, or the Cluster Autoscaler will scale the extra nodes down before you deploy on them.
This is not ideal, and I imagine I'd have to manually scale the ASG again if I wanted to expand the Elasticsearch cluster in the future, but it's kind of OK since I rarely scale it.
Sorry I think there is some misunderstanding of how the VolumeBinding predicate works. For dynamic provisioning an EBS volume with WaitForFirstConsumer, it should work, and if it doesn't, then there is a bug somewhere in the predicate. How the predicate works is:
So it is expected that this predicate returns true for EBS volumes using WaitForFirstConsumer. Looking at the "no matching volume" error, I'm suspecting that this predicate is erroneously returning early at step 1 instead of continuing to step 2. I will investigate further.
I just tested this on a GKE 1.17 cluster with gce-pd (This is the GCP equivalent of EBS). I started with a 3 node cluster, each node in a different zone. Then I created a Statefulset with 4 replicas and pod anti-affinity on hostname. The autoscaler was able to properly scale up one more node to handle the 4th replica. Here's my StorageClass and StatefulSet specs if anyone wants to compare: https://gist.github.com/msau42/d6e4c12e13eca716371516a477ce94d3
I just tested what @msau42 did in AWS using 3 nodes, each in us-west-2a / us-west-2b / us-west-2c, and observed the cluster autoscaler provision a new node to schedule the 4th replica. Similarly, here's my output should you want to check it out. My server version is 1.18.12. This is using the same manifest, except I changed the StorageClass.Provisioner
to use my EBS CSI driver.
@msau42 does this statement mean that Cluster Autoscaler does not check that the maximum number of volumes are attached to a node (because the drivers never check this)?
2c. Does the node has enough capacity. This is an alpha feature and disabled by default. ie, all volume plugins today return true
We are observing that when there are no topology or affinity constraints and the maximum number of volumes are attached to nodes that scale-up does not occur. Despite sufficient CPU/Memory on the nodes, no more volumes are allowed attachment https://kubernetes.io/docs/concepts/storage/storage-limits/#dynamic-volume-limits
For maximum number of attachments to a node, my understanding is that the autoscaler will copy the Node object from a similar existing Node, so it will use the same limits.
For CSI drivers, the volume limit is stored in the CSINode object and we consider a missing CSINode object to have an infinite limit
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
Hello all,
we're running a multi-AZ k8s cluster in AWS (kops-)configured with the recommended 1 ags/per AZ setup to avoid imbalances and issues with PVs.
Unfortunately, some of the instance types are not available in all AZs, so while the rest of the nodes are across 3 AZs (
eu-west-1{a,b,c}
), the "special" nodes are only configured with ASGs in two of them (eu-west-1{b,c}
). However, this brings about another issue: if the pv for a pod requiring the special nodes gets created in eu-west-1a, thencluster autoscaler
will refuse to spin them up as they won't be able to be bound to the PV. Fair enough!We had hoped that
volumeBindingMode: WaitForFirstConsumer
would have helped delaying the PV creation to after scheduling of the pod happended, so we replaced the standardgp2
storageclass
that kops creates with one withWaitForFirstConsumer
enabled. However, in this case cluster autoscaler seems to refuse to spin up any instance in any AZs as it's waiting for a PV to be created:Is this expected behaviour, or am I missing a flag/config somewhere?
Thanks!