skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.69k stars 494 forks source link

Default instance type on GCP is not suitable for A100 accelerators #212

Closed gmittal closed 2 years ago

gmittal commented 2 years ago

By default we attach accelerators to an n1-highmem-8 instance, but this is not suitable for A100s. We should find a way to add this information to our optimizer or service catalog.

sky gpunode --gpus A100 results in:

googleapiclient.errors.HttpError: <HttpError 400 when requesting https://compute.googleapis.com/compute/v1/projects/intercloud-320520/zones/us-central1-a/instances?alt=json returned "[nvidia-tesla-a100, n1-highmem-8] features are not compatible for creating instance.". Details: "[{'message': '[nvidia-tesla-a100, n1-highmem-8] features are not compatible for creating instance.', 'domain': 'global', 'reason': 'badRequest'}]">
concretevitamin commented 2 years ago

Maybe: change the GCP default instance type to something comparable (in vCPUs, mem, cost) and compatible with A100.

On Wed, Jan 19, 2022 at 1:03 AM Gautam Mittal @.***> wrote:

By default we attach accelerators to an n1-highmem-8 instance, but this is not suitable for A100s. We should find a way to add this information to our optimizer or service catalog.

sky gpunode --gpus A100 results in:

googleapiclient.errors.HttpError: <HttpError 400 when requesting https://compute.googleapis.com/compute/v1/projects/intercloud-320520/zones/us-central1-a/instances?alt=json returned "[nvidia-tesla-a100, n1-highmem-8] features are not compatible for creating instance.". Details: "[{'message': '[nvidia-tesla-a100, n1-highmem-8] features are not compatible for creating instance.', 'domain': 'global', 'reason': 'badRequest'}]">

— Reply to this email directly, view it on GitHub https://github.com/concretevitamin/sky-experiments/issues/212, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEQWHV3EAPOJHHG5S72LLTUWWMPHANCNFSM5MHWQSGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you are subscribed to this thread.Message ID: @.***>

gmittal commented 2 years ago

A100s are the only accelerator on GCP that has its own dedicated instance type (special "a2" instances), this is different from all other accelerators on GCP (which can normally be attached to any instance). This complicates some of the logic on our end but is doable (see #253).

The bigger issue is that a2 instances do not support live migration, which means that GCP will need to take down the node for maintenance every once in a while which our system currently can't support (perhaps once fault-tolerance for spot instances is enabled we can treat instance types that do not support live migration in the same category).

Using #253 it seems that is not possible to provision an A100 GPU on GCP for this reason:

googleapiclient.errors.HttpError: <HttpError 400 when requesting https://compute.googleapis.com/compute/v1/projects/intercloud-320520/zones/us-central1-a/instances?alt=json returned "Instances with guest accelerators do not support live migration.". Details: "[{'message': 'Instances with guest accelerators do not support live migration.', 'domain': 'global', 'reason': 'badRequest'}]">

Since this is not immediately actionable, I suggest closing #253 and marking this as blocked by the live migration issue.

concretevitamin commented 2 years ago

Can we make sure we update the catalog to disallow A100s on GCP + sky show-gpus --cloud gcp --all to not show A100s?

Michaelvll commented 2 years ago

I encountered this problem again when trying to launch A100 with managed spot instance. It will break the retry process of sky launch a bit, as it will leave an INIT cluster in the sky status after failure.