Cluster-autoscaler and WaitForFirstConsumer binding mode

dvianello commented 5 years ago

Hello all,

we're running a multi-AZ k8s cluster in AWS (kops-)configured with the recommended 1 ags/per AZ setup to avoid imbalances and issues with PVs.

Unfortunately, some of the instance types are not available in all AZs, so while the rest of the nodes are across 3 AZs (eu-west-1{a,b,c}), the "special" nodes are only configured with ASGs in two of them (eu-west-1{b,c}). However, this brings about another issue: if the pv for a pod requiring the special nodes gets created in eu-west-1a, then cluster autoscaler will refuse to spin them up as they won't be able to be bound to the PV. Fair enough!

We had hoped that volumeBindingMode: WaitForFirstConsumer would have helped delaying the PV creation to after scheduling of the pod happended, so we replaced the standard gp2 storageclass that kops creates with one with WaitForFirstConsumer enabled. However, in this case cluster autoscaler seems to refuse to spin up any instance in any AZs as it's waiting for a PV to be created:

I0208 12:41:51.882717       1 scheduler_binder.go:438] No matching volumes for Pod "kubeflow/jupyter-dario", PVC "kubeflow/dario-workspace" on node "template-node-for-cpu-thick-a.k8s-upgrade.k8s.local-3780623283797772889"
I0208 12:41:51.882748       1 utils.go:196] Pod jupyter-dario can't be scheduled on cpu-thick-a.k8s-upgrade.k8s.local, predicate failed: CheckVolumeBinding predicate mismatch, cannot put kubeflow/jupyter-dario on template-node-for-cpu-thick-a.k8s-upgrade.k8s.local-3780623283797772889, reason: node(s) didn't find available persistent volumes to bind
I0208 12:41:51.882778       1 scale_up.go:371] No pod can fit to cpu-thick-a.k8s-upgrade.k8s.local
I0208 12:41:51.882805       1 scheduler_binder.go:438] No matching volumes for Pod "kubeflow/jupyter-dario", PVC "kubeflow/dario-workspace" on node "template-node-for-cpu-thick-b.k8s-upgrade.k8s.local-6949585462030478254"
I0208 12:41:51.882818       1 utils.go:196] Pod jupyter-dario can't be scheduled on cpu-thick-b.k8s-upgrade.k8s.local, predicate failed: CheckVolumeBinding predicate mismatch, cannot put kubeflow/jupyter-dario on template-node-for-cpu-thick-b.k8s-upgrade.k8s.local-6949585462030478254, reason: node(s) didn't find available persistent volumes to bind
I0208 12:41:51.882826       1 scale_up.go:371] No pod can fit to cpu-thick-b.k8s-upgrade.k8s.local
I0208 12:41:51.882865       1 scheduler_binder.go:438] No matching volumes for Pod "kubeflow/jupyter-dario", PVC "kubeflow/dario-workspace" on node "template-node-for-cpu-thick-c.k8s-upgrade.k8s.local-4646905604613009770"
I0208 12:41:51.882881       1 utils.go:196] Pod jupyter-dario can't be scheduled on cpu-thick-c.k8s-upgrade.k8s.local, predicate failed: CheckVolumeBinding predicate mismatch, cannot put kubeflow/jupyter-dario on template-node-for-cpu-thick-c.k8s-upgrade.k8s.local-4646905604613009770, reason: node(s) didn't find available persistent volumes to bind
I0208 12:41:51.882889       1 scale_up.go:371] No pod can fit to cpu-thick-c.k8s-upgrade.k8s.local

Is this expected behaviour, or am I missing a flag/config somewhere?

Thanks!

MaciekPytel commented 5 years ago

It's not so much an expected behavior as it is a longstanding known issue. CA has problem with multi-zonal clusters and PVs. More generally there are problems with any feature that relies on using informers inside scheduling predicates (ex. anything with zones and storage, pod affinity/antiaffinity)

Technical details: CA works by predicting 'what would scheduler do if I added a node of given type'. To do it CA imports part of scheduler that is called 'predicates'. Those are functions that tell if a given pod will be able to fit on the node. To scale-up CA uses an in-memory "template" node object to see if any of unschedulable pods could be scheduled if a new node was added. Predicate function takes a pod and a node as parameters (technically a NodeInfo object, but that's irrelevant - it's a wrapper for a Node). Those parameters are controlled by CA and we can inject non-existing in-memory object to simulate scheduler behavior. Unfortunately predicate functions also have access to Kubernetes client (technically: set of informers) that they can use to query the state of the cluster. Basically they can access parts of cluster state that are not explicit parameters that CA can fake. This is how storage predicates work - they just ask apiserver directly for list of existing volumes. CA can't create in-memory volumes simulating the ones that would be created by PV controller. And so the predicate fail and CA figures the scale-up wouldn't help the pod.

We know of this problem and we had some initial talks with sig-storage members, but it's not an easy thing to fix. It will require some large effort on both sides and, frankly, I don't expect it to be fixed in 2019. Certainly not in first half of 2019.

Sorry I don't have a better answer :(

dvianello commented 5 years ago

Hi @MaciekPytel,

thanks for that! Always good to know the technical details :-)

If the scheduler is using the same predicates that CA is using, wouldn't that imply that the scheduler itself won't be able to "see" the PV backed by a storageclass with WaitForFirstConsumer enabled and thus fail to schedule? They must have injected some magic somewhere to get this to work - can't the same magic be imported in CA as well?

Not trying to push anything/anybody here, just merely trying to understand the very complex k8s world :-)

aleksandra-malinowska commented 5 years ago

Just looked into scheduler and it seems the PVC should be created even with WaitForFirstCustomer before scheduling, but only after the pod is created. Is this accurate? If yes, perhaps allowed topologies can be set to ensure it is provisioned in one of the target zones: https://kubernetes.io/docs/concepts/storage/storage-classes/#allowed-topologies

dvianello commented 5 years ago

Yes, that is my understanding as well.

Just tried again with our modified storage class, and a PVC gets creates but it stays in a Pending state:

$ k get pvc                                                                                                                                                 
NAME               STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
dario2-workspace   Pending                                                                        gp2-topology   1m
katib-mysql        Bound     pvc-8def8765-2b8a-11e9-89ed-020c9ed9946e   10Gi       RWO            gp2            3h
minio-pv-claim     Bound     pvc-8749a987-2b8a-11e9-89ed-020c9ed9946e   10Gi       RWO            gp2            3h
mysql-pv-claim     Bound     pvc-87bf3176-2b8a-11e9-89ed-020c9ed9946e   10Gi       RWO            gp2            3h

$k describe pvc dario2-workspace
Name:          dario2-workspace
Namespace:     kubeflow
StorageClass:  gp2-topology
Status:        Pending
Volume:
Labels:        app=jupyterhub
               component=singleuser-storage
               heritage=jupyterhub
Annotations:   hub.jupyter.org/username: dario2
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
Events:
  Type       Reason                Age                  From                         Message
  ----       ------                ----                 ----                         -------
  Normal     WaitForFirstConsumer  7s (x13 over 2m46s)  persistentvolume-controller  waiting for first consumer to be created before binding
Mounted By:  jupyter-dario2

Thing is, the pod will never be created because it requires a scale-up that the autoscaler isn't going to execute since it's waiting for a PV. And the loop goes on. Checkmate :-)

Help 🆘!

aleksandra-malinowska commented 5 years ago

So it seems volume controller will wait for scheduler to give it a hint (put node name in annotation) before provisioning anything: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/persistentvolume/pv_controller.go#L289

And it kind of makes sense - in a fixed size cluster, there's no point provisioning a volume if there's no suitable node, and if there's a node, we want to provision it in that node's zone... but it all breaks when you want to create a node for the pod.

I just ran a quick experiment and it seems possible to use allowedTopologies together with volumeBindingMode: Immediate. If you haven't tried that yet by any chance, it may be worth checking.

dvianello commented 5 years ago

And it kind of makes sense - in a fixed size cluster, there's no point provisioning a volume if there's no suitable node, and if there's a node, we want to provision it in that node's zone... but it all breaks when you want to create a node for the pod.

Is this something CA plans to support at some point? It would solve the multi-zone ASG issue, doesn't it?

I just ran a quick experiment and it seems possible to use allowedTopologies together with volumeBindingMode: Immediate. If you haven't tried that yet by any chance, it may be worth checking.

Uhm, you mean limiting the zones where the storageclass can create PVs? Yes, can do. We were trying to avoid having special storageclasses and only fly with a default "smart" one, so that if we don't need the special nodes we can take advantage of all the AZs.

MaciekPytel commented 5 years ago

Is this something CA plans to support at some point? It would solve the multi-zone ASG issue, doesn't it?

Ideally yes, but as I wrote above it's not likely to happen in the next 6 months.

drewhemm commented 5 years ago

I just ran into this issue too; what's the current workaround? Is it to pre-provision volumes in each AZ so that when an instance is started, it will have a pool to select from regardless of its placement?

@aleksandra-malinowska I didn't quite understand your comment about allowedTopologies. Would that mean one would need to have multiple storage classes (one per AZ) and then specify that class in the statefulset? That would prevent distribution over multiple fault domains...

fejta-bot commented 5 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

selslack commented 5 years ago

/remove-lifecycle stale

This is not fixed yet.

fejta-bot commented 5 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

jhohertz commented 5 years ago

/remove-lifecycle stale

bjethwan commented 5 years ago

I ran into this chicken & egg problem too.

Is this something CA plans to support at some point? It would solve the multi-zone ASG issue, doesn't it?

Ideally yes, but as I wrote above it's not likely to happen in the next 6 months.

@MaciekPytel Does it require some design changes or is it too big a code change? Can someone new to this codebase do this in a week or so?

okgolove commented 4 years ago

I've just gotten the same error. Only manually resizing can help in this case.

groodt commented 4 years ago

Is there a known workaround or solution to this? It seems like it isn't possible to scale to/from 0 using CA and AWS EBS Volumes.

MaciekPytel commented 4 years ago

I don't think there is a good workaround. I'm actively looking into how this can be fixed, but it will require at least some changes in volume scheduling in Kubernetes and a huge change in autoscaler (comparable to scheduler framework migration which consisted of >100 commits and took us months to complete). At this point I can't give any guarantees regarding timeline. It certainly won't be ready in time for 1.19.

drewhemm commented 4 years ago

Correct me if I am wrong, but using EBS with CA seems rather counterproductive. I use CA with EFS because the latter is multi-AZ, meaning workloads can come and go in any AZ and resume with their persistent storage. There are some workloads that have issues with EFS (and NFS in general), but I think it is better to move them onto alternative multi-AZ storage system such as GlusterFS rather than tie workloads to a specific AZ, which is a requirement when using EBS.

groodt commented 4 years ago

I don't think there is a good workaround. I'm actively looking into how this can be fixed, but it will require at least some changes in volume scheduling in Kubernetes and a huge change in autoscaler (comparable to scheduler framework migration which consisted of >100 commits and took us months to complete). At this point I can't give any guarantees regarding timeline. It certainly won't be ready in time for 1.19.

Thanks @MaciekPytel I can appreciate it's a very complex problem at the moment. I was mostly trying to understand if I was missing some obvious workaround and getting clarity around if I can ever expect a solution. Sounds like a "possible solution" is in the "thinking about it" phase, but won't be ready (if ever) for a long time, possible 1.20+. Thanks for the update!

Correct me if I am wrong, but using EBS with CA seems rather counterproductive. I use CA with EFS because the latter is multi-AZ, meaning workloads can come and go in any AZ and resume with their persistent storage. There are some workloads that have issues with EFS (and NFS in general), but I think it is better to move them onto alternative multi-AZ storage system such as GlusterFS rather than tie workloads to a specific AZ, which is a requirement when using EBS.

You're not wrong @drewhemm It's just that for some workloads, particularly ephemeral workloads, it is desirable to have cheap, dynamic storage that is released when the pod terminates. EFS is 3x more expensive than EBS and does provide block-storage. Admittedly, for the use-cases I have in mind, I don't think block-storage is strictly necessary. What has your experience been with EFS on k8s? Are you using ClusterAutoscaler in the scale to/from 0 case successfully? Are you using dynamic provisioning with EFS per pod or mounting it to the node?

drewhemm commented 4 years ago

@groodt I am indeed using CA and scaling to and from 0. I am using the EFS Dynamic Provisioner, so one EFS for the whole cluster:

https://github.com/kubernetes-incubator/external-storage/tree/master/aws/efs

EFS may be more expensive per GB than EBS, but you only pay for the capacity you use, unlike EBS where a majority of what is paid for is often not utilised (no-one wants to run out of space on an EBS). There is also the option for lifecycle management that moves old files to a cheaper storage class:

https://docs.aws.amazon.com/efs/latest/ug/lifecycle-management-efs.html

I am using Provisioned Throughput, and that increases the cost somewhat, but overall I am pretty sure we are spending less than if we had lots of EBS volumes:

https://docs.aws.amazon.com/efs/latest/ug/performance.html#throughput-modes

groodt commented 4 years ago

@drewhemm I'm glad it is working for your use-cases, but I think there is still a caveat with the EFS approach that means it won't be ideal for many ephemeral use-cases.

Please correct me if I'm wrong. :)

The nodes might scale up and down to 0, but the storage does not. What I mean by this, is that an EFS file-system needs to be created ahead of time. Creating the filesystem itself ahead of time isn't so much a problem, but for ephemeral use-cases, the storage used by individual PVCs on this file-system will not be released when the pods terminate?

An example of a use-case where ephemeral disk is useful is for things like Machine Learning workflows or Spark jobs, where processing of large data happens on pods, but when done, the final output is stored somewhere permanent such as S3 or a database etc. These workflows often need large amounts of storage, but only temporarily.

Another example would be elastic Jupyter Notebook services, where some exploratory analysis can be performed with ephemeral storage that is released once the user is done. The Notebook example could probably get away with host storage, but they do often work with pretty large data as well and having a one-size-fits-all EBS volume attached to the nodes themselves isn't always suitable.

The "dream" for me (maybe others too) would be if it was possible to treat storage with Kubernetes entirely elastically in the same as compute. In such a way, a PVC of any size could be requested and mounted onto a Cluster Autoscaled compute node.

drewhemm commented 4 years ago

@groodt You're right that unless the PVC is deleted, then the space in EFS is not released; however, I believe this is also the case for dynamically provisioned EBS volumes?

For truly ephemeral workloads, where what you want is effectively a scratch disk, there is a third option: instances with local SSDs, i.e. m5ad, c5d, r5dn etc. These offer higher performance than EBS or EFS and combined with CA, once the workload has finished and the node is terminated, the data is automatically blown away.

I experimented with this and found it worked well, although as I had a need for persistent storage rather than ephemeral, I moved to EFS instead. I still have the relevant code snippet in our Jinja2 template for the userdata:

{% if item.local_pvs is defined %}
# Handle local PVs
yum install -y parted
{% for pv in item.local_pvs %}
# Wait for the device to be visible
while [ ! -b {{ pv.device }} ]; do sleep 2; done
parted -s {{ pv.device }} mklabel msdos
{% for partition in pv.partitions %}
parted -s -a optimal {{ pv.device }} mkpart primary {{ partition.start }} {{ partition.end }}
# Wait for partitioning to complete
while [ ! -b {{ pv.device }}p{{ loop.index }} ]; do sleep 1; done
mkfs -t xfs {{ pv.device }}p{{ loop.index }}
mkdir -p /mnt/pv{{ loop.index }}
echo "UUID=$(blkid -s UUID -o value {{ pv.device }}p{{ loop.index }}) \
 /mnt/pv{{ loop.index }} xfs defaults,noatime 1 1" >> /etc/fstab
{% endfor %}
{% endfor %}
# Mount all volumes defined in /etc/fstab
mount -a
MOUNT_RC=$?
# Exit if there were any failures mounting the volumes
if [[ $MOUNT_RC != 0 ]]; then exit $MOUNT_RC; fi
{% endif %}

Corresponding PV manifest template:

{% for i in range(1, item.max_instances + 1) %}
{% for pv in item.local_pvs %}
{% for partition in pv.partitions %}
---
kind: PersistentVolume
apiVersion: v1
metadata:
  name: {{ item.instance_type }}-{{ i }}-pv{{ loop.index }}
  labels:
    capacity: {{ partition.size }}
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: {{ pv.storage_class_name }}
  # -{{ partition.size | lower() }}
  capacity:
      storage: {{ partition.size }}
  local:
    path: /mnt/pv{{ loop.index }}
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: beta.kubernetes.io/instance-type
          operator: In
          values:
          - {{ item.instance_type }}

{% endfor %}
{% endfor %}
{% endfor %}

I imagine that until or unless AWS provides multi-AZ EBS volumes, it is quite tricky for CA to handle them in a predictable way.

groodt commented 4 years ago

I believe this is also the case for dynamically provisioned EBS volumes?

Not with reclaimPolicy: "Delete", which is what we use at the moment. It works really well, but not from the scale-to-zero case that this original GH issue pertains to.

there is a third option: instances with local SSDs, i.e. m5ad, c5d, r5dn etc

Yes, using this with emptyDir would work too, but I'm talking even larger data than this in some cases.

it is quite tricky for CA to handle them in a predictable way.

Yes, absolutely. I think this is what @MaciekPytel is working on. I don't think it will come soon, but I do hope it comes some day! :)

drewhemm commented 4 years ago

I believe this is also the case for dynamically provisioned EBS volumes?

Not with reclaimPolicy: "Delete", which is what we use at the moment. It works really well, but not from the scale-to-zero case that this original GH issue pertains to.

EFS also supports reclaimPolicy: "Delete", so in that sense thay are the same. See the example just below here:

https://github.com/kubernetes-incubator/external-storage/tree/master/aws/efs#parameters

groodt commented 4 years ago

EFS also supports reclaimPolicy: "Delete", so in that sense thay are the same.

So are you saying if I have a single EFS filesystem that is referenced by a single StorageClass (reclaimPolicy: "Delete"), I would be able to create numerous different PVC (different names) that reference this StorageClass and when the various PVCs are deleted, their data will be cleaned from the single EFS filesystem?

drewhemm commented 4 years ago

That's correct. I mounted the root of one of our EFS to verify and there are only directories for currently-existing PVCs. You can test this by creating a PVC and then deleting it and verifying that it is no longer present in EFS:

https://docs.aws.amazon.com/efs/latest/ug/mounting-fs.html#mounting-fs-install-amazon-efs-utils

SSH onto an EC2 instance
Install the efs package sudo yum install amazon-efs-utils
Create a directory and mount the root of the EFS volume
- sudo mkdir /mnt/efs
- sudo mount -t efs fs-ID#####:/ /mnt/efs
- ls /mnt/efs
- sudo umount /mnt/efs

$ ls /mnt/efs
jenkins-jenkins-master-0-pvc-71b1fcb1-12ca-11ea-afcc-0295425a16da       storage-elasticsearch-data-0-pvc-45378da8-233c-11ea-afcc-0295425a16da
persistent-remote-desktop-0-pvc-f70fba2c-588c-11ea-86f0-024095786998    storage-elasticsearch-data-1-pvc-67646794-233c-11ea-afcc-0295425a16da
redis-data-redis-master-0-pvc-547d2c9f-2658-11ea-97ac-06174f247eaa      storage-elasticsearch-data-2-pvc-860c0b29-233c-11ea-afcc-0295425a16da
redis-data-redis-slave-0-pvc-845e0404-265b-11ea-97ac-06174f247eaa       storage-elasticsearch-master-0-pvc-48755588-233c-11ea-afcc-0295425a16da
redis-data-redis-slave-1-pvc-a44093d0-265b-11ea-97ac-06174f247eaa       storage-elasticsearch-master-1-pvc-64c6b15b-233c-11ea-afcc-0295425a16da
redis-data-redis-slave-2-pvc-b458966a-265b-11ea-97ac-06174f247eaa       storage-elasticsearch-master-2-pvc-6f3f7986-233c-11ea-afcc-0295425a16da
storage-elasticsearch-coord-0-pvc-4789d0eb-233c-11ea-afcc-0295425a16da  storage-prometheus-0-pvc-1f138dfb-1107-11ea-afcc-0295425a16da
storage-elasticsearch-coord-1-pvc-478c163a-233c-11ea-afcc-0295425a16da  storage-prometheus-1-pvc-8acf9999-1c2c-11ea-afcc-0295425a16da

groodt commented 4 years ago

That's correct

That's really interesting. TIL. Thank you! I'll give it a try.

I imagine for some EFS won't be suitable and hopefully it eventually does become possible to use EBS, but for now, EFS may be a good substitute.

drewhemm commented 4 years ago

Yeah, EFS has become my default for AWS, and I fall back to EBS only for workloads that are not suitable, such as those that need to create large numbers of file locks:

https://docs.amazonaws.cn/en_us/efs/latest/ug/limits.html#limits-fs-specific

For everything else, there are sufficient tuning options in EFS to ensure the right level of performance at reasonable price points. The elimination of wasted storage costs is a big plus.

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

dvianello commented 4 years ago

/remove-lifecycle stale

(assuming I can do this...)

fejta-bot commented 4 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot commented 4 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

k8s-ci-robot commented 4 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes/autoscaler/issues/1658#issuecomment-712152981): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-testing, kubernetes/test-infra and/or [fejta](https://github.com/fejta). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

dvianello commented 4 years ago

I'm not sure closing this is the right thing to do :-) @MaciekPytel ?

leosunmo commented 3 years ago

I am running in to this issue as well, when using the https://github.com/kubernetes-sigs/sig-storage-local-static-provisioner project to create PVCs from AWS ephemeral instance NVMe disks.

The pods can't get scheduled because there's no nodes, and the CA won't scale up because there's no disks there for the pods to attach.

fejta-bot commented 3 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

k8s-ci-robot commented 3 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes/autoscaler/issues/1658#issuecomment-743780401): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-testing, kubernetes/test-infra and/or [fejta](https://github.com/fejta). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

leosunmo commented 3 years ago

/remove-lifecycle rotten

I presume this is still considered an issue, even if we're not all going to drop everything and fix it right away!

leosunmo commented 3 years ago

/reopen

k8s-ci-robot commented 3 years ago

@leosunmo: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to [this](https://github.com/kubernetes/autoscaler/issues/1658#issuecomment-744072386): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

dvianello commented 3 years ago

/reopen

/remove-lifecycle rotten

k8s-ci-robot commented 3 years ago

@dvianello: Reopened this issue.

In response to [this](https://github.com/kubernetes/autoscaler/issues/1658#issuecomment-744294418): >/reopen > >/remove-lifecycle rotten Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

recollir commented 3 years ago

I am running in to this issue as well, when using the https://github.com/kubernetes-sigs/sig-storage-local-static-provisioner project to create PVCs from AWS ephemeral instance NVMe disks.

The pods can't get scheduled because there's no nodes, and the CA won't scale up because there's no disks there for the pods to attach.

@leosunmo We ran today into the same problem. Have you found any workarounds? Have you switched to other storage? We are using the storage-local-static-provisioner to create pv on NVMe disks to be consumed by Elastic pods controlled via the ECK operator.

leosunmo commented 3 years ago

I am running in to this issue as well, when using the https://github.com/kubernetes-sigs/sig-storage-local-static-provisioner project to create PVCs from AWS ephemeral instance NVMe disks. The pods can't get scheduled because there's no nodes, and the CA won't scale up because there's no disks there for the pods to attach.

@leosunmo We ran today into the same problem. Have you found any workarounds? Have you switched to other storage? We are using the storage-local-static-provisioner to create pv on NVMe disks to be consumed by Elastic pods controlled via the ECK operator.

@recollir I think we ended up simply manually scaling up the node group (AWS Autoscaling group. Set desired nodes == Elasticsearch cluster size) and then applying the statefulset manifest/ECK. Since they're tainted nodes to only accept Elasticsearch workloads anyway, it works. As long as you don't wait too long, or the Cluster Autoscaler will scale the extra nodes down before you deploy on them.

This is not ideal, and I imagine I'd have to manually scale the ASG again if I wanted to expand the Elasticsearch cluster in the future, but it's kind of OK since I rarely scale it.

msau42 commented 3 years ago

Sorry I think there is some misunderstanding of how the VolumeBinding predicate works. For dynamic provisioning an EBS volume with WaitForFirstConsumer, it should work, and if it doesn't, then there is a bug somewhere in the predicate. How the predicate works is:

It first checks if there are any available PVs that satisfy the PVC: https://github.com/kubernetes/kubernetes/blob/473af0b8d1951b5c362df60925bb8d9fb007fab9/pkg/controller/volume/scheduling/scheduler_binder.go#L327
If not, then it checks if we can dynamically provision a PVC: https://github.com/kubernetes/kubernetes/blob/473af0b8d1951b5c362df60925bb8d9fb007fab9/pkg/controller/volume/scheduling/scheduler_binder.go#L337 2a. This checks for a couple of things. First, is the StorageClass.provisioner set to "kubernetes.io/no-provisioner". This doesn't apply to EBS: https://github.com/kubernetes/kubernetes/blob/473af0b8d1951b5c362df60925bb8d9fb007fab9/pkg/controller/volume/scheduling/scheduler_binder.go#L868 2b. Does StorageClass.AllowedTopologies match the node. If nothing is set, it matches anything. If a specific zone is set, then it needs to match the node's zone: https://github.com/kubernetes/kubernetes/blob/473af0b8d1951b5c362df60925bb8d9fb007fab9/pkg/controller/volume/scheduling/scheduler_binder.go#L874 2c. Does the node has enough capacity. This is an alpha feature and disabled by default. ie, all volume plugins today return true: https://github.com/kubernetes/kubernetes/blob/473af0b8d1951b5c362df60925bb8d9fb007fab9/pkg/controller/volume/scheduling/scheduler_binder.go#L880

So it is expected that this predicate returns true for EBS volumes using WaitForFirstConsumer. Looking at the "no matching volume" error, I'm suspecting that this predicate is erroneously returning early at step 1 instead of continuing to step 2. I will investigate further.

msau42 commented 3 years ago

I just tested this on a GKE 1.17 cluster with gce-pd (This is the GCP equivalent of EBS). I started with a 3 node cluster, each node in a different zone. Then I created a Statefulset with 4 replicas and pod anti-affinity on hostname. The autoscaler was able to properly scale up one more node to handle the 4th replica. Here's my StorageClass and StatefulSet specs if anyone wants to compare: https://gist.github.com/msau42/d6e4c12e13eca716371516a477ce94d3

chrishenzie commented 3 years ago

I just tested what @msau42 did in AWS using 3 nodes, each in us-west-2a / us-west-2b / us-west-2c, and observed the cluster autoscaler provision a new node to schedule the 4th replica. Similarly, here's my output should you want to check it out. My server version is 1.18.12. This is using the same manifest, except I changed the StorageClass.Provisioner to use my EBS CSI driver.

hobti01 commented 3 years ago

@msau42 does this statement mean that Cluster Autoscaler does not check that the maximum number of volumes are attached to a node (because the drivers never check this)?

2c. Does the node has enough capacity. This is an alpha feature and disabled by default. ie, all volume plugins today return true

We are observing that when there are no topology or affinity constraints and the maximum number of volumes are attached to nodes that scale-up does not occur. Despite sufficient CPU/Memory on the nodes, no more volumes are allowed attachment https://kubernetes.io/docs/concepts/storage/storage-limits/#dynamic-volume-limits

msau42 commented 3 years ago

For maximum number of attachments to a node, my understanding is that the autoscaler will copy the Node object from a similar existing Node, so it will use the same limits.

For CSI drivers, the volume limit is stored in the CSINode object and we consider a missing CSINode object to have an infinite limit

k8s-triage-robot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

k8s-triage-robot commented 3 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

kubernetes / autoscaler

Cluster-autoscaler and WaitForFirstConsumer binding mode #1658