skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.45k stars 460 forks source link

`sky launch --memory 2048+` failed to show valid AWS instance types #3047

Open concretevitamin opened 7 months ago

concretevitamin commented 7 months ago
$ sky launch --memory 2048+
I 01-29 22:09:41 optimizer.py:1222] No resource satisfying <Cloud>(mem=2048+) on [GCP, AWS, Lambda].
I 01-29 22:09:41 optimizer.py:1236] Try specifying a different memory size, or add "+" to the end of the memory size to allow for larger instances.
sky.exceptions.ResourcesUnavailableError: Catalog does not contain any instances satisfying the request:
Task<name=sky-cmd>(run=<empty>)
  resources: <Cloud>(mem=2048+).

But both of these show >= 2048GB memory instance types:

» sky launch -t x2iedn.metal
» sky launch -t p5.48xlarge

cc @cblmemo @Michaelvll

cblmemo commented 7 months ago

Seems like this is because we will only select default instance types when no accelerators are specified: https://github.com/skypilot-org/skypilot/blob/ef211925c76e3e006c938edcf07678f6a9a3c45d/sky/clouds/aws.py#L438-L449

Are there any reasons we take this design?

concretevitamin commented 7 months ago

No particular reason other than for simplicity. That comment is outdated since --cpus acts as a filter already. It'd be great to support using most of the resource args as filters.

Michaelvll commented 7 months ago

No particular reason other than for simplicity. That comment is outdated since --cpus acts as a filter already. It'd be great to support using most of the resource args as filters.

The main reason we only uses the default instance type families is to avoid weird CPU + memory combination, when only one of --cpus or --memory is specified. This is to make sure that most of the users will get a good default instance types.

Another reason for this is that the instances we selected have similar CPU type, i.e. intel Ice Lake. This is to avoid any compatibility issue for the user program caused by the different CPU series, e.g. ARM or AMD may not support some Intel CPU specific instruction set. We can first add those large-memory instance types with Intel CPUs.

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

cblmemo commented 3 months ago

Just tested on the latest master and this persists. Removing the stale label now 🫡