Open maacarbo opened 1 year ago
+1
+1 I kindof expected this already happened.
+1
My exact question, the top issue in the list!
This is likely a major use case in Machine Learning where a) GPUs are more expensive so typically scale often and b) images are large.
In this auto-scale-up case, Pods are waiting to be scheduled immediately so will probably not be able to take advantage of the kube-fledged cache refresh to load images into the new node (which I assume at least works?). Perhaps kube-fledged could be configured to manage a taint on newly provisioned nodes that's removed when images have been loaded from the cache. In cluster-autoscaler, taints can be prefixed with ignore-taint.cluster-autoscaler.kubernetes.io/ so they do not effect auto scaling groups selection.
In AWS EKS, we intensively use auto scaling clusters. It would be handy if the controller knows when a new node is spin up and directly starts to cache the images.