ray-project / ray_lightning

Pytorch Lightning Distributed Accelerators using Ray
Apache License 2.0
211 stars 34 forks source link

Question: Why use ray_lightning instead of pytorch_lightning for multi-node training? #212

Closed saryazdi closed 2 years ago

saryazdi commented 2 years ago

Hi,

Sorry if it's a basic question - I was looking into solutions for multi-node training for pytorch_lightning & came across ray_lightning which is implemented on top of pytorch_lightning, but pytorch_lightning also seems to have some out-of-the-box support for multi-node training (docs here)

How does ray_lightning's multi-node training compare to pytorch_lightning? What are the differences? Why would one choose one over the other?

I'm trying to figure out which one I should go with. Thanks you!

chongxiaoc commented 2 years ago

I think ray_lightning is for running on Ray cluster with multi-node training.

amogkam commented 2 years ago

Hey @saryazdi-- PTL has support for multi-node distributed training but it assumes that you have a cluster already setup and networked together for training. If you are using a cluster manager like Slurm for example, this might already be the case, but on the cloud you would have to do the necessary configuration and setup yourself.

With Ray, setting up clusters is very easy.

This blog post goes into more detail: https://devblog.pytorchlightning.ai/introducing-ray-lightning-multi-node-pytorch-lightning-training-made-easy-30ed075209f0

In addition the simple setup there are also other benefits that Ray Lightning provides:

  1. Running multi-node training from a jupyter notebook instead of bash scripts
  2. Easily using Ray Tune for distributed hyperparameter tuning
  3. Easy scaling- the same code you write for 1 GPU will work for N GPUs
saryazdi commented 2 years ago

Thanks @chongxiaoc @amogkam ! So for a private multi-node multi-gpu cluster it seems to come down to whether we want to manage our cluster with Ray Cluster (then use ray_lightning) or Slurm/custom-scripts (then use pytorch_lightning)? And the main factor to consider when making that decision is that Ray Cluster is easier to setup?

amogkam commented 2 years ago

yes that's right @saryazdi! Also if you are planning to add hyperparameter tuning as well and want to use Ray Tune, then using Ray Lightning for training would be best for that