[Train] Ray Train should support AWS trainium instances

ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

https://ray.io

Apache License 2.0

31.96k stars 5.45k forks source link

[Train] Ray Train should support AWS trainium instances #33504

Open gilvikra opened 1 year ago

gilvikra commented 1 year ago

Description

I would like AWS trainium instances requiring "xla" torch backend be supported with ray.

https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/distributed_data_parallel.html#neur[…]rial

There is a great push towards Trainium and right now ray does not seem to support it natively like CPU and GPUs

Use case

Use of AWS Trainium chips for efficient, performant, cost effective distributed training on top of ray.

swaroopch commented 11 months ago

+1. Given the shortage of GPUs in the industry, it would be beneficial for us to have Ray tested and supported on AWS Trainium, to unblock LLM use cases.

pdames commented 10 months ago

Follow-up issue: https://github.com/ray-project/ray/issues/38473. This improves the maintainability of https://github.com/ray-project/ray/pull/37998 by removing the need to continuously update a hard-coded dictionary of EC2 instance types to neuron core counts.

anyscalesam commented 3 months ago

@woshiyyya can you take a look; I'm adding triage as well in case we want to punt this to the next on-call rotation.

woshiyyya commented 3 months ago

@anyscalesam OK. Will take a look at the CI issue of #39130.