Multi GPU training - Githubissues

adamgayoso commented 3 years ago

Write a custom DistributedSampler that also takes as input the overall set of indices to pull data from (i.e., train set or test set or val set indices). Probably just need to add a few lines of code to init, call super init and the write a custom iter method
Use this sampler if multi gpu training selected (these are all through kwargs of pytorch lightning trainer)

vitkl commented 3 years ago

I would be also very interested in multi-GPU training of pyro models, specifically full data training mode where large data is split between different GPUs.

vitkl commented 3 years ago

It is actually fairly straightforward to do data parallelism in pyro using horovod (https://github.com/pyro-ppl/pyro/blob/dev/examples/svi_horovod.py). This would mainly require 1) a new training plan (to use DistributedSampler, a different optimizer) and 2) a different device-backed data loader (to split data across devices).

There are issues with using this for models with both global and local cell-specific parameters (all parameters live on all devices).

adamgayoso commented 3 years ago

It is actually fairly straightforward to do data parallelism in pyro using horovod (https://github.com/pyro-ppl/pyro/blob/dev/examples/svi_horovod.py). This would mainly require 1) a new training plan (to use DistributedSampler, a different optimizer) and 2) a different device-backed data loader (to split data across devices).

It is likely less straightforward to make this work with PyTorch Lightning and might require substantial work to make this work generally for Pyro. In particular we'd have to look more into how device backed data loaders would work in this case.

vitkl commented 3 years ago

I see. I also got some feedback from @fritzo that for models with local parameters this might not give as much space increase as one would hope because cell-specific parameters are quite large for just a few 100k cells (but this could give 4-5x more space). I don't quite understand PyTorch Lightning so if you solve this - I would be very keen to try.

vitkl commented 2 years ago

Does numpyro+jax more natively support multi-GPU training? If yes this could be a way to go.

What I am specifically interested in is data and model parameter parallelism where the data and model parameters for different cells (denoted by a plate) are distributed to different GPU devices. Maybe this is also possible with pyro.

Also cc @fehiepsi @fritzo @martinjankowiak

fritzo commented 2 years ago

[As metntioned above] Pyro can use Horovod for data parallelism across GPUs and machines in a cluster, but I believe parameters would be replicated on all nodes. NumPyro might be the way to go. @fehiepsi?

fehiepsi commented 2 years ago

Current NumPyro SVI does not support that pattern but it might be able to do using JAX. Something like

def loss_fn(batch, params):
    global_params, local_params = params
    model_g = handlers.substitute(model, data=global_params)
    guide_g = handlers.substitute(guide, data=global_params)

    def get_loss_local(data, local_params):
        model_l = handlers.substitute(model, data=local_params)
        guide_l = handlers.substitute(guide, data=local_params)
        loss = TraceELBO(model_l, guide_l, ...)
        return loss

   return jax.pmap(get_loss_local)(batch, params)

# then use jaxopt to optimize loss_fn over params: https://jaxopt.github.io/stable/stochastic.html#optax-solvers

though still seems to be a bit tricky to cover many usage cases (like when there are both global variables and local variables, we need to apply reduced sum at local variables).

vitkl commented 2 years ago

Thanks for your thoughts!

My models always have both local and global variables. Do you see any way to define device split along the pyro plate? Maybe that could be provided as option in numpyro?

On Wed, 13 Apr 2022, 12:02 Du Phan, @.***> wrote:

Current NumPyro SVI does not support that pattern but it might be able to do using JAX. Something like

def loss_fn(batch, params): global_params, local_params = params model_g = handlers.substitute(model, data=global_params) guide_g = handlers.substitute(guide, data=global_params)
def get_loss_local(data, local_params):
    model_l = handlers.substitute(model, data=local_params)
    guide_l = handlers.substitute(guide, data=local_params)
    loss = TraceELBO(model_l, guide_l, ...)
    return loss
return jax.pmap(get_loss_local)(batch, params)

then use jaxopt to optimize loss_fn over params: https://jaxopt.github.io/stable/stochastic.html#optax-solvers

though still seems to be a bit tricky to cover many usage cases (like when there are both global variables and local variables, we need to apply reduced sum at local variables).

— Reply to this email directly, view it on GitHub https://github.com/scverse/scvi-tools/issues/1226#issuecomment-1097916251, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFMFTV4CKS4EOBEXVKDTLYTVE2SURANCNFSM5F5UWUGQ . You are receiving this because you commented.Message ID: @.***>

vitkl commented 2 years ago

Does pure data parallelism work with current numpyro and scvi-tools? (Loading different minibatches of data to different devices, 8 minibatches in parallel but different cells in each training iteration)

mxposed commented 2 years ago

@adamgayoso sorry, do you have a recipe if I want to enable multi-GPU training of scVI model before it is released scvi-tools? I haven't done multi-GPU training before, so I'm asking where to start. Can I just apply a patch from #1357 ?

vitkl commented 1 year ago

@adamgayoso If you ignore device-backed data loaders for now, what is the main roadblock to implementing the Pyro+horovod solution? https://pyro.ai/examples/svi_horovod.html

Does this boil down to implementing an equivalent to torch.utils.data.distributed.DistributedSampler and modifying the training plan to add horovod use? Or is there more to it?

Is the problem in writing a general solution that works for any model OR is the problem that this won't work for any model?

Also cc @macwiatrak for discussion