Closed saryazdi closed 2 years ago
I think ray_lightning
is for running on Ray cluster with multi-node training.
Hey @saryazdi-- PTL has support for multi-node distributed training but it assumes that you have a cluster already setup and networked together for training. If you are using a cluster manager like Slurm for example, this might already be the case, but on the cloud you would have to do the necessary configuration and setup yourself.
With Ray, setting up clusters is very easy.
This blog post goes into more detail: https://devblog.pytorchlightning.ai/introducing-ray-lightning-multi-node-pytorch-lightning-training-made-easy-30ed075209f0
In addition the simple setup there are also other benefits that Ray Lightning provides:
Thanks @chongxiaoc @amogkam ! So for a private multi-node multi-gpu cluster it seems to come down to whether we want to manage our cluster with Ray Cluster
(then use ray_lightning
) or Slurm
/custom-scripts (then use pytorch_lightning
)? And the main factor to consider when making that decision is that Ray Cluster
is easier to setup?
Hi,
Sorry if it's a basic question - I was looking into solutions for multi-node training for
pytorch_lightning
& came acrossray_lightning
which is implemented on top ofpytorch_lightning
, butpytorch_lightning
also seems to have some out-of-the-box support for multi-node training (docs here)How does
ray_lightning
's multi-node training compare topytorch_lightning
? What are the differences? Why would one choose one over the other?I'm trying to figure out which one I should go with. Thanks you!