Open gilvikra opened 1 year ago
+1. Given the shortage of GPUs in the industry, it would be beneficial for us to have Ray tested and supported on AWS Trainium, to unblock LLM use cases.
Follow-up issue: https://github.com/ray-project/ray/issues/38473. This improves the maintainability of https://github.com/ray-project/ray/pull/37998 by removing the need to continuously update a hard-coded dictionary of EC2 instance types to neuron core counts.
@woshiyyya can you take a look; I'm adding triage as well in case we want to punt this to the next on-call rotation.
@anyscalesam OK. Will take a look at the CI issue of #39130.
Description
I would like AWS trainium instances requiring "xla" torch backend be supported with ray.
https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/distributed_data_parallel.html#neur[…]rial
There is a great push towards Trainium and right now ray does not seem to support it natively like CPU and GPUs
Use case
Use of AWS Trainium chips for efficient, performant, cost effective distributed training on top of ray.