Closed gmittal closed 2 years ago
Maybe: change the GCP default instance type to something comparable (in vCPUs, mem, cost) and compatible with A100.
On Wed, Jan 19, 2022 at 1:03 AM Gautam Mittal @.***> wrote:
By default we attach accelerators to an n1-highmem-8 instance, but this is not suitable for A100s. We should find a way to add this information to our optimizer or service catalog.
sky gpunode --gpus A100 results in:
googleapiclient.errors.HttpError: <HttpError 400 when requesting https://compute.googleapis.com/compute/v1/projects/intercloud-320520/zones/us-central1-a/instances?alt=json returned "[nvidia-tesla-a100, n1-highmem-8] features are not compatible for creating instance.". Details: "[{'message': '[nvidia-tesla-a100, n1-highmem-8] features are not compatible for creating instance.', 'domain': 'global', 'reason': 'badRequest'}]">
— Reply to this email directly, view it on GitHub https://github.com/concretevitamin/sky-experiments/issues/212, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEQWHV3EAPOJHHG5S72LLTUWWMPHANCNFSM5MHWQSGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you are subscribed to this thread.Message ID: @.***>
A100s are the only accelerator on GCP that has its own dedicated instance type (special "a2" instances), this is different from all other accelerators on GCP (which can normally be attached to any instance). This complicates some of the logic on our end but is doable (see #253).
The bigger issue is that a2 instances do not support live migration, which means that GCP will need to take down the node for maintenance every once in a while which our system currently can't support (perhaps once fault-tolerance for spot instances is enabled we can treat instance types that do not support live migration in the same category).
Using #253 it seems that is not possible to provision an A100 GPU on GCP for this reason:
googleapiclient.errors.HttpError: <HttpError 400 when requesting https://compute.googleapis.com/compute/v1/projects/intercloud-320520/zones/us-central1-a/instances?alt=json returned "Instances with guest accelerators do not support live migration.". Details: "[{'message': 'Instances with guest accelerators do not support live migration.', 'domain': 'global', 'reason': 'badRequest'}]">
Since this is not immediately actionable, I suggest closing #253 and marking this as blocked by the live migration issue.
Can we make sure we update the catalog to disallow A100s on GCP + sky show-gpus --cloud gcp --all
to not show A100s?
I encountered this problem again when trying to launch A100 with managed spot instance. It will break the retry process of sky launch
a bit, as it will leave an INIT cluster in the sky status
after failure.
By default we attach accelerators to an
n1-highmem-8
instance, but this is not suitable for A100s. We should find a way to add this information to our optimizer or service catalog.sky gpunode --gpus A100
results in: