ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.1k stars 5.47k forks source link

[Ray Train / Tune on AWS] Examples needed for running Ray with private subnets - any help? #42348

Open jaanphare opened 6 months ago

jaanphare commented 6 months ago

Description

Hi,

This is the only Terraform script from @sfloresk I could find that gives a complete example of running distributed training workloads with GPUs on AWS:

https://github.com/aws-ia/ecs-blueprints/tree/main/terraform/ec2-examples/distributed-ml-training

Are there other examples you are aware of that describe how to use Terraform to set up Ray with autoscaling on AWS?

I also found this related example for EKS: https://github.com/awslabs/data-on-eks/tree/main/ai-ml/ray/terraform

(We are unable to use Kubernetes for now.)

Thank you so much! Jaan

Use case

We need to run distributed GPU training for language models I am iterating on such as https://arxiv.org/abs/1904.05342

woshiyyya commented 5 months ago

Hi @jaanphare , this user guide: https://docs.ray.io/en/master/cluster/vms/index.html might help:)

jaanphare commented 5 months ago

Thanks so much @woshiyyya !! I couldn't find anything about terraform there. Would you recommend avoiding it in lieu of the Ray Cluster Launcher tool?

woshiyyya commented 5 months ago

Yeah I think you can try to use the Ray cluster launcher for AWS. I am not an expert on it, maybe @kevin85421 @rkooo567 @scv119 could give some help here?

architkulkarni commented 5 months ago

I'm not familiar with terraform, and I don't think we currently have any Ray docs about terraform. But you can use the Ray cluster launcher on AWS without terraform, let me know if you run into any issues in the docs!

rynewang commented 5 months ago

@kevin85421 do you have insights on Terraform?

kevin85421 commented 5 months ago

@rynewang We have already had an offline discussion today.

rkooo567 commented 5 months ago

@kevin85421 does this issue require any feature? Or is it just a question? Do we need a doc update? Can you follow up with action items here?

kevin85421 commented 5 months ago

does this issue require any feature? Or is it just a question? Do we need a doc update? Can you follow up with action items here?

I think the user plans to use KubeRay instead, and we can close this issue. cc @jaanphare to confirm

jaanphare commented 5 months ago

Thank you @kevin85421 ! We ended up needing to use terraform to configure virtual private clouds (not sure if this is something KubeRay or RayCluster supports?). We're hoping to give KubeRay a shot soon as it sounds like that will be supported going forward for batch jobs 🙏

jaanphare commented 5 months ago

Just following up on this - the main blocker here is the lack of documentation or guides on how to run Ray in a private subnet in AWS or other cloud providers without public IPs.

Any help on that?

(This is due to restrictions on the sensitive data in health care that we train LLMs on, due to federal law - Health Insurance Portability and Accountability Act.)