skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.5k stars 464 forks source link

[Example] Choose specific VM to run tasks #3433

Closed Biga-incorta closed 2 weeks ago

Biga-incorta commented 5 months ago

I have an OCI GPU VM and I want to launch mixtral 8x7b without configure any cloud configuration, the limitation here that I have a dedicated VM to run model serving so how can I start with ?

I'm thinking about launch jobs using docker but the documentation example not so useful, any thoughts ?

romilbhardwaj commented 5 months ago

Hey @Biga-incorta - if you can ssh into the machine, try running sky local up. This is an experimental feature that will spin up a Kubernetes cluster on your machine, and you will be be able to run sky launch on the machine to run tasks locally.

Once you have that running, you could try running the mixtral example here.

Note that sky serve up will be supported after https://github.com/skypilot-org/skypilot/pull/3377, but sky launch should allow you to set up mixtral serving. You can check the endpoint URL of the exposed service with sky status --endpoints <cluster_name>.

Biga-incorta commented 4 months ago

@romilbhardwaj I got the following error when run sky local up

 sky local up
Creating local cluster...
To view detailed progress: tail -n100 -f ~/sky_logs/sky-2024-04-16-23-19-14-092785/local_up.log
I 04-16 23:19:28 log_utils.py:79] Kubernetes is running.
I 04-16 23:20:01 log_utils.py:117] SkyPilot CPU image pulled.
No cloud is enabled. SkyPilot will not be able to run any task. Run `sky check` for more info.

Full log details ` tail -n100 -f ~/sky_logs/sky-2024-04-16-23-19-14-092785/local_up.log No kind clusters found. Generating /tmp/skypilot-kind.yaml Creating cluster "skypilot" ... â€ĸ Ensuring node image (kindest/node:v1.29.2) đŸ–ŧ ... ✓ Ensuring node image (kindest/node:v1.29.2) đŸ–ŧ â€ĸ Preparing nodes đŸ“Ļ ... ✓ Preparing nodes đŸ“Ļ â€ĸ Writing configuration 📜 ... ✓ Writing configuration 📜 â€ĸ Starting control-plane 🕹ī¸ ... ✓ Starting control-plane 🕹ī¸ â€ĸ Installing CNI 🔌 ... ✓ Installing CNI 🔌 â€ĸ Installing StorageClass 💾 ... ✓ Installing StorageClass 💾 Set kubectl context to "kind-skypilot" You can now use your cluster with:

kubectl cluster-info --context kind-skypilot

Have a nice day! 👋 Kind cluster created. Pulling SkyPilot CPU image... latest: Pulling from skypilot-375900/skypilotk8s/skypilot Digest: sha256:eb8f58d9ced1a7b64269fca7e04ad0576f4b0a3f3b6f271f23399b12347dde76 Status: Image is up to date for us-central1-docker.pkg.dev/skypilot-375900/skypilotk8s/skypilot:latest us-central1-docker.pkg.dev/skypilot-375900/skypilotk8s/skypilot:latest Loading SkyPilot CPU image into kind cluster... Image: "us-central1-docker.pkg.dev/skypilot-375900/skypilotk8s/skypilot:latest" with ID "sha256:496456d4688b54d22bccc3a5b22b581149bac9761e199aa252bfe94025a6e58f" not yet present on node "skypilot-control-plane", loading... SkyPilot CPU image loaded into kind cluster. Kubernetes cluster ready! Run Checking credentials to enable clouds for SkyPilot. AWS: disabled Reason: AWS dependencies are not installed. Run the following commands: $ pip install skypilot[aws] Credentials may also need to be set. Run the following commands: $ aws configure $ aws configure list # Ensure that this shows identity is set. For more info: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html Azure: disabled Reason: ~/.azure/msal_token_cache.json does not exist. Run the following commands: $ az login $ az account set -s For more info: https://docs.microsoft.com/en-us/cli/azure/get-started-with-azure-cli Cloudflare, for R2 object store: disabled Reason: [r2] profile is not set in ~/.cloudflare/r2.credentials. Additionally, Account ID from R2 dashboard is not set. Run the following commands: $ pip install boto3 $ AWS_SHARED_CREDENTIALS_FILE=~/.cloudflare/r2.credentials aws configure --profile r2 $ mkdir -p ~/.cloudflare $ echo > ~/.cloudflare/accountid For more info: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#cloudflare-r2 Cudo: disabled Reason: Cudo tools are not installed. Run the following commands: $ pip install cudo-compute [ModuleNotFoundError] No module named 'cudo_compute' Fluidstack: disabled Reason: Failed to access FluidStack Cloud with credentials. To configure credentials, go to: https://console.fluidstack.io to obtain an API key and API Token, then add save the contents to ~/.fluidstack/api_key and ~/.fluidstack/api_token

GCP: disabled Reason: GCP tools are not installed. Run the following commands: $ pip install google-api-python-client $ conda install -c conda-forge google-cloud-sdk -y Credentials may also need to be set. Run the following commands: $ gcloud init $ gcloud auth application-default login For more info: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#google-cloud-platform-gcp Details: [ModuleNotFoundError] No module named 'googleapiclient' IBM: disabled Reason: Missing credential file at /home/ubuntu/.ibm/credentials.yaml. Store your API key and Resource Group id in ~/.ibm/credentials.yaml in the following format: iam_api_key: resource_group_id: Kubernetes: disabled Reason: kubernetes package is not installed. Install it with: pip install kubernetes Lambda: disabled Reason: Failed to access Lambda Cloud with credentials. To configure credentials, go to: https://cloud.lambdalabs.com/api-keys to generate API key and add the line api_key = [YOUR API KEY] to ~/.lambda_cloud/lambda_keys OCI: disabled Reason: Missing credential file at ~/.oci/config. For more details, refer to: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#oracle-cloud-infrastructure-oci RunPod: disabled Reason: Failed to import runpod. To install, run: pip install skypilot[runpod] SCP: disabled Reason: Failed to access SCP with credentials. To configure credentials, see: https://cloud.samsungsds.com/openapiguide Generate API key and add the following line to ~/.scp/scp_credential: access_key = [YOUR API ACCESS KEY] secret_key = [YOUR API SECRET KEY] project_id = [YOUR PROJECT ID] vSphere: disabled Reason: vSphere dependencies are not installed. Run the following commands: $ pip install skypilot[vSphere] Credentials may also need to be set. For more details. See https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#vmware-vsphere[ModuleNotFoundError] No module named 'pyVmomi' No cloud is enabled. SkyPilot will not be able to run any task. Run sky check for more info. to setup Kubernetes access. Number of CPUs available on the local cluster: 60`

Biga-incorta commented 4 months ago

and when run launch to run Mistral I got

Task from YAML spec: ./serve.yaml
No cloud is enabled. SkyPilot will not be able to run any task. Run `sky check` for more info.

so which cloud should enable when run local

Biga-incorta commented 4 months ago

I see on the sky check that kubernetes client not installed so I run pip install kubernetes as requested but get same launch error message @romilbhardwaj

 Kubernetes: disabled
    Reason: `kubernetes` package is not installed. Install it with: pip install kubernetes
romilbhardwaj commented 4 months ago

so which cloud should enable when run local sky check should show Kubernetes as enabled to run locally.

Looks like kubernetes client library is not correctly installed in your environment. What's the output of python -c "import kubernetes;print(kubernetes.__version__)" and python -c "import sky;print(sky.__version__);print(sky.__commit__)"?

I would recommend using the latest nightly - pip uninstall skypilot skypilot-nightly; pip install -U "skypilot-nightly[kubernetes]"

Biga-incorta commented 4 months ago

@romilbhardwaj thanks for your reply I fix the kubernetes lib issue however I noticed when used sky local launch I can't can't see the GPU ACCELERATORS my machine is OCI VM.GPU.A10.2 however when cluster launch I can't see it

I attached a screenshot Image 17-04-2024 at 2 29 AM

romilbhardwaj commented 4 months ago

What was the output of sky local up? What's the output of kubectl describe nodes?

Try running sky show-gpus --cloud kubernetes to see which gpus are available to you, then run sky launch -c mixtral --gpus <your_gpu>:1 serve.yaml.

github-actions[bot] commented 4 weeks ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] commented 2 weeks ago

This issue was closed because it has been stalled for 10 days with no activity.