Open jaanphare opened 6 months ago
Hi @jaanphare , this user guide: https://docs.ray.io/en/master/cluster/vms/index.html might help:)
Thanks so much @woshiyyya !! I couldn't find anything about terraform there. Would you recommend avoiding it in lieu of the Ray Cluster Launcher tool?
Yeah I think you can try to use the Ray cluster launcher for AWS. I am not an expert on it, maybe @kevin85421 @rkooo567 @scv119 could give some help here?
I'm not familiar with terraform, and I don't think we currently have any Ray docs about terraform. But you can use the Ray cluster launcher on AWS without terraform, let me know if you run into any issues in the docs!
@kevin85421 do you have insights on Terraform?
@rynewang We have already had an offline discussion today.
@kevin85421 does this issue require any feature? Or is it just a question? Do we need a doc update? Can you follow up with action items here?
does this issue require any feature? Or is it just a question? Do we need a doc update? Can you follow up with action items here?
I think the user plans to use KubeRay instead, and we can close this issue. cc @jaanphare to confirm
Thank you @kevin85421 ! We ended up needing to use terraform to configure virtual private clouds (not sure if this is something KubeRay or RayCluster supports?). We're hoping to give KubeRay a shot soon as it sounds like that will be supported going forward for batch jobs 🙏
Just following up on this - the main blocker here is the lack of documentation or guides on how to run Ray in a private subnet in AWS or other cloud providers without public IPs.
Any help on that?
(This is due to restrictions on the sensitive data in health care that we train LLMs on, due to federal law - Health Insurance Portability and Accountability Act.)
Description
Hi,
This is the only Terraform script from @sfloresk I could find that gives a complete example of running distributed training workloads with GPUs on AWS:
https://github.com/aws-ia/ecs-blueprints/tree/main/terraform/ec2-examples/distributed-ml-training
Are there other examples you are aware of that describe how to use Terraform to set up Ray with autoscaling on AWS?
I also found this related example for EKS: https://github.com/awslabs/data-on-eks/tree/main/ai-ml/ray/terraform
(We are unable to use Kubernetes for now.)
Thank you so much! Jaan
Use case
We need to run distributed GPU training for language models I am iterating on such as https://arxiv.org/abs/1904.05342