Open fozziethebeat opened 1 year ago
Ah nuts I just answered my own question. I think VertexAI lets you get around VM quota limits for some reason. I was assuming that since I could request 2 v100s via VertexAI I could do it via GCE VMs. That is in fact wrong.
@fozziethebeat Thanks for the report! We’d love to make multigpu setup better too.
RE: “it does its best to try and find quota but consistently fails” — Is it possible to see an actual provision log where the quota errors in different regions are included? Do you mean all regions in GCP failed to provide V100:2?
If GCE VM quotas are indeed insufficient for all regions, requesting quota bumps may help: https://skypilot.readthedocs.io/en/latest/cloud-setup/quota.html#gcp
Yeah, skypilot reports something like
Provision failed for 1x GCP(n1-highmem-16, {'V100': 2}) in us-central1. Trying other locations (if any).
across a wide range of regions because our quota is only 1 per region. We're increasing that now.
For reasons beyond my understanding, VertexAI side-steps that and let's me schedule 2 gpus.
Supporting Vertex AI can be an interesting direction to look at for SkyPilot as it may have more capacity. : ) Reopening this issue for tracking that.
Hi All! I found Skypilot last week and it's been a huge improvement for deploying single gpu workflows (training and serving).
But I've found a situation where it's not working so well when scheduling multi-gpu setups on GCP.
Right now I'm trying to run a simple trainer on GCP with a little HuggingFace based trainer library I've got. I want a
n1-highmem-16
orn1-standard-x
type machine attached to two V100s. When trying to run skypilot, it does its best to try and find quota but consistently fails. When making a similar request via GCP's VertexAI command line tools, the job gets quota much faster and schedules.Is there some kind of bias in GCP where it prefers VertexAI based scheduling over the tools skypilot has access to?
For reference, he's a copy of my two configs:
skypilot:
vertex AI:
The only substantial difference between the two is the machine type, skypilot picks a
n1-highmem-16
VM but I doubt that's causing the issue.