Closed mkbhanda closed 1 week ago
Good suggestion. I will report this feature.
@mkbhanda, please suggest correct assignees
Thanks
Scheduling failures are well summarized in https://www.howtouselinux.com/post/kubernetes-pod-pending I was thinking of something quick, particularly when a user brings say a Gaudi specific image and wants to launch it on a cluster without Gaudi nodes, generalizing this to specialized hardware. Something like an admission control webhook. That said, how quickly does docker or Kubernetes fail when resources are inappropriate or indequate? If fast, perhaps not worth constructing the feature. Let me get Iris to comment to help de/prioritize this enhancement request.
The broader problem of when to expand / contract a cluster involves which type of node to add/drop based on the existing workload needs, node utilization, and workload activity patterns. https://cluster-api.sigs.k8s.io/tasks/automated-machine-management/autoscaling. Does cluster-api monitor Gaudi useage metrics == my guess is not, or perhaps not adequately. Must CSPs support Cluster-API. Intel Developer Cloud is considering supporting it.
Also what should the policy be? Fail to schedule a workload or try to expand the cluster with the right type and number of nodes to meet a workload requirement?
And let me ask Sasha on this aspect for scale testing? Growing/shrinking a cluster may be future work.
As already mentioned in the documentation of k8s, if workload specifies any extended resource that is not present in the cluster, pod will be in pending state with explanation that resource is not available. Deploying any kind of admission hoooks is possible, but it is not "standard" setup, thus require additional effort from cluster administrators. It is possible, but benefit of getting error message vs. "pending state" is not worth the effort. Cluster autoscalers (publicly available) are quite badly support extended resources or accelerators, often they have hardcoded logic realted to specific CSP (to fetch metadata about available accelerators from CSP speficific tags) or specific to accelerator (e.g. special code for Nvidia, special for Habana was seen in Karpenter)... In my opinion, it would be better to have "cluster autoscaler on per vendor" for accelerator devices, but that is probably not that SIG-Scheduling/autoscalers community would agree on, so we might later on propose specific to Gaudi patches to Karpenter or k8s-sigs/ cluster autoscaler.
Based on all input, closing this issue.
For example, if the deployment infrastructure is a Kubernetes cluster and the user has requested the use of GPUs or special purpose accelerators that do not exist, promptly return failure message. Occasionally there may be inadequate resources to meet a request, and either the cluster must grow or the request fail to deploy.