skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.55k stars 468 forks source link

Support AWS Graviton instances #1586

Open WoosukKwon opened 1 year ago

WoosukKwon commented 1 year ago

Now that the Ray has started to provide PyPI wheels for ARM64 CPUs (https://github.com/ray-project/ray/pull/31566), we can also add official support for AWS Graviton instances. In the future, we will further be able to support ARM machines in other clouds such GCP T2A and Azure Dpsv5.

franklsf95 commented 1 year ago

Woohoo!!

WoosukKwon commented 1 year ago

Just for note: This is currently blocked by #1618 (because the ARM PyPI wheels are only available for Ray v2.2) and #1616 (because those are the only AMIs that support ARM instances).

romilbhardwaj commented 1 year ago

Bumping this - raised again by user in #1885 and also useful for k8s dev work on apple silicon.

franklsf95 commented 1 year ago

I should mention that Graviton and Apple Silicon are not the same architecture. Ray has an M1 build but not for Graviton.

romilbhardwaj commented 1 year ago

With #1734 in, we should be able to support Graviton now. Ray 2.4.0 works out of the box (pip install ray) on a graviton instance:

ubuntu@ip-172-31-56-38:~$ python3
Python 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import ray
>>> ray.init()
2023-05-26 19:07:15,178 INFO worker.py:1625 -- Started a local Ray instance.
RayContext(dashboard_url='', python_version='3.10.6', ray_version='2.4.0', ray_commit='4479f66d4db967d3c9dd0af2572061276ba926ba', address_info={'node_ip_address': '172.31.56.38', 'raylet_ip_address': '172.31.56.38', 'redis_address': None, 'object_store_address': '/tmp/ray/session_2023-05-26_19-07-12_299295_2702/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2023-05-26_19-07-12_299295_2702/sockets/raylet', 'webui_url': '', 'session_dir': '/tmp/ray/session_2023-05-26_19-07-12_299295_2702', 'metrics_export_port': 61955, 'gcs_address': '172.31.56.38:54315', 'address': '172.31.56.38:54315', 'dashboard_agent_listen_port': 52365, 'node_id': '5b0e1d985f8a06e0eaa70ece6813c40156b40554c5ad164aec80acb7'})
>>> ray.available_resources()
{'CPU': 1.0, 'object_store_memory': 1109234073.0, 'node:172.31.56.38': 1.0, 'memory': 2218468148.0}
romilbhardwaj commented 1 year ago

Seems like SkyPilot is trying to use a x86 AMI which is causing launch to fail

(base) ➜  ~ sky launch -c arm -t m7g.xlarge
...
create_instances: Attempt failed with An error occurred (InvalidParameterValue) when calling the RunInstances operation: The architecture 'arm64' of the specified instance type does not match the architecture 'x86_64' of the specified AMI. Specify an instance type and an AMI that have matching architectures, and try again. You can use 'describe-instance-types' or 'describe-images' to discover the architecture of the instance type or AMI., retrying.

Explicitly specifying AMI fails at building psutil, probably because the AMI I tried (Ubuntu 22) doesn't come with gcc ([full log])(https://gist.github.com/romilbhardwaj/2560441fad1eaca47f48f6c430b86073)). Unfortunately there's no nice DL AMI for ARM instances on AWS yet:

$ sky launch -c arm -t m7g.xlarge --image-id ami-0a0c8eebcdd6dcbd0 --region us-east-1 --cloud aws

...
      psutil could not be installed from sources because gcc is not installed. Try running:
        sudo apt-get install gcc python3-dev
      error: command 'aarch64-linux-gnu-gcc' failed: No such file or directory
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for psutil
Successfully built pendulum
Failed to build psutil
ERROR: Could not build wheels for psutil, which is required to install pyproject.toml-based projects
Michaelvll commented 1 year ago

Will the graviton DLAMI in the AWS doc work https://docs.aws.amazon.com/dlami/latest/devguide/tutorial-graviton.html?

github-actions[bot] commented 12 months ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] commented 8 months ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] commented 7 months ago

This issue was closed because it has been stalled for 10 days with no activity.