skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.84k stars 517 forks source link

Scheduling multiple GPUs on GCP via skypilot versus Vertex AI #2239

Open fozziethebeat opened 1 year ago

fozziethebeat commented 1 year ago

Hi All! I found Skypilot last week and it's been a huge improvement for deploying single gpu workflows (training and serving).

But I've found a situation where it's not working so well when scheduling multi-gpu setups on GCP.

Right now I'm trying to run a simple trainer on GCP with a little HuggingFace based trainer library I've got. I want a n1-highmem-16 or n1-standard-x type machine attached to two V100s. When trying to run skypilot, it does its best to try and find quota but consistently fails. When making a similar request via GCP's VertexAI command line tools, the job gets quota much faster and schedules.

Is there some kind of bias in GCP where it prefers VertexAI based scheduling over the tools skypilot has access to?

For reference, he's a copy of my two configs:

skypilot:

name: unified-pythia160m-peft
resources:
  accelerators: V100:2

workdir: .

file_mounts:
  /gcs-data:
    source: gs://my-bucket
    mode: MOUNT

setup: |
  sudo apt-get install -y git-lfs
  conda create -n cubrio-trainer python=3.9 -y
  conda activate cubrio-trainer
  pip install .

run: |
  conda activate cubrio-trainer
  python -m torch.distributed.launch --nproc_per_node=${NUM_GPUS=2} \
    -m cubrio_ml_training.train_peft \
    --model_name_or_path=EleutherAI/pythia-160m \
    --project_name=unified_pythia160m \
    --chat_dataset_path=/gcs-data//data_path.jsonl \
    --output_dir /gcs-data/checkpoints/unified_pythia160m \
    --save_model \
    --save_merged_model

vertex AI:

workerPoolSpecs:
  machineSpec:
    machineType: n1-standard-8
    acceleratorType: NVIDIA_TESLA_V100
    acceleratorCount: 2
  replicaCount: 1
  containerSpec:
    imageUri: gcr.io/my_project_id/pytorch_gpu_train_hf_peft_creator:latest
    command:
      - python3.10
    args:
      - -m
      - torch.distributed.launch
      - --nproc_per_node=2
      - -m
      - cubrio_ml_training.train_peft
      - --model_name_or_path=EleutherAI/pythia-160m
      - --project_name=unified_pythia160m
      - --chat_dataset_path=/gcs/data_path.jsonl
      - --output_dir=checkpoints/unified_pythia160m
      - --save_model
      - --save_merged_model

The only substantial difference between the two is the machine type, skypilot picks a n1-highmem-16 VM but I doubt that's causing the issue.

fozziethebeat commented 1 year ago

Ah nuts I just answered my own question. I think VertexAI lets you get around VM quota limits for some reason. I was assuming that since I could request 2 v100s via VertexAI I could do it via GCE VMs. That is in fact wrong.

concretevitamin commented 1 year ago

@fozziethebeat Thanks for the report! We’d love to make multigpu setup better too.

RE: “it does its best to try and find quota but consistently fails” — Is it possible to see an actual provision log where the quota errors in different regions are included? Do you mean all regions in GCP failed to provide V100:2?

If GCE VM quotas are indeed insufficient for all regions, requesting quota bumps may help: https://skypilot.readthedocs.io/en/latest/cloud-setup/quota.html#gcp

fozziethebeat commented 1 year ago

Yeah, skypilot reports something like

Provision failed for 1x GCP(n1-highmem-16, {'V100': 2}) in us-central1. Trying other locations (if any).

across a wide range of regions because our quota is only 1 per region. We're increasing that now.

For reasons beyond my understanding, VertexAI side-steps that and let's me schedule 2 gpus.

Michaelvll commented 9 months ago

Supporting Vertex AI can be an interesting direction to look at for SkyPilot as it may have more capacity. : ) Reopening this issue for tracking that.