skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.69k stars 494 forks source link

[k8s] Support inferentia on EKS #3915

Open romilbhardwaj opened 1 month ago

romilbhardwaj commented 1 month ago

SkyPilot currently does not support --gpus inferentia on Kubernetes (EKS). To support it, we would need to add a label formatter and choose the right container image with the inferentia dependencies.

For reference, labels on an inferentia node look like this:

Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=inf2.48xlarge
                    beta.kubernetes.io/os=linux
                    eks.amazonaws.com/capacityType=ON_DEMAND
                    eks.amazonaws.com/nodegroup=inf2-ng
                    eks.amazonaws.com/nodegroup-image=ami-xxx
                    eks.amazonaws.com/sourceLaunchTemplateId=lt-xxx
                    eks.amazonaws.com/sourceLaunchTemplateVersion=1
                    failure-domain.beta.kubernetes.io/region=us-east-2
                    failure-domain.beta.kubernetes.io/zone=us-east-2b
                    gpu.nvidia.com/class=Inferentia
                    k8s.io/cloud-provider-aws=xxxx
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-xxx-xxx-xxx-xxx.us-east-2.compute.internal
                    kubernetes.io/os=linux
                    node.kubernetes.io/instance-type=inf2.48xlarge
                    topology.k8s.aws/zone-id=use2-az2
                    topology.kubernetes.io/region=us-east-2
                    topology.kubernetes.io/zone=us-east-2b
Annotations:        alpha.kubernetes.io/provided-node-ip: xxx.xxx.xxx.xxx
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
concretevitamin commented 3 weeks ago

A user bumped this.