skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.82k stars 514 forks source link

sky local up failed. #3901

Open ZJU-lishuang opened 2 months ago

ZJU-lishuang commented 2 months ago

sky local up failed.

Version & Commit info:

this is my local_up.log

No kind clusters found.
Generating /tmp/skypilot-kind.yaml
Creating cluster "skypilot" ...
 • Ensuring node image (kindest/node:v1.31.0) 🖼  ...
 ✓ Ensuring node image (kindest/node:v1.31.0) 🖼
 • Preparing nodes 📦   ...
 ✓ Preparing nodes 📦 
 • Writing configuration 📜  ...
 ✓ Writing configuration 📜
 • Starting control-plane 🕹️  ...
 ✓ Starting control-plane 🕹️
 • Installing CNI 🔌  ...
 ✓ Installing CNI 🔌
 • Installing StorageClass 💾  ...
 ✓ Installing StorageClass 💾
Set kubectl context to "kind-skypilot"
You can now use your cluster with:

kubectl cluster-info --context kind-skypilot

Not sure what to do next? 😅  Check out https://kind.sigs.k8s.io/docs/user/quick-start/
Kind cluster created.
Enabling GPU support...
Installing NVIDIA GPU operator...
"nvidia" already exists with the same configuration, skipping
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "nvidia" chart repository
Update Complete. ⎈Happy Helming!⎈
W0901 20:02:28.761141   43425 warnings.go:70] spec.template.spec.affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].preference.matchExpressions[0].key: node-role.kubernetes.io/master is use "node-role.kubernetes.io/control-plane" instead
W0901 20:02:28.761143   43425 warnings.go:70] spec.template.spec.affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].preference.matchExpressions[0].key: node-role.kubernetes.io/master is use "node-role.kubernetes.io/control-plane" instead
Error: INSTALLATION FAILED: context deadline exceeded

what is the problem? How to display detailed error information?

romilbhardwaj commented 2 months ago

Can you confirm if Nvidia GPU operator was successfully installed? You can test it by running kubectl describe nodes and checking if your node has a nvidia.com/gpu resource.

If it was not installed, try manually running:

helm install --wait --generate-name \
     -n gpu-operator --create-namespace \
     nvidia/gpu-operator --set driver.enabled=false