Running multi-gpu training

joe-sht commented 6 months ago

How to run training on multi gpu? As I can see training runs on single gpu.

Suhail commented 6 months ago

I am also curious. The error I get is this:

Traceback (most recent call last):
  File "/root/research/suhail/magvit2/train.py", line 27, in <module>
    trainer = VideoTokenizerTrainer(
  File "/root/research/suhail/.venv/lib/python3.10/site-packages/pytorch_custom_utils/accelerate_utils.py", line 95, in __init__
    _orig_init(self, *args, **kwargs)
  File "<@beartype(magvit2_pytorch.trainer.VideoTokenizerTrainer.__init__) at 0x7f20aa90b910>", line 314, in __init__
  File "/root/research/suhail/.venv/lib/python3.10/site-packages/magvit2_pytorch/trainer.py", line 203, in __init__
    self.has_multiscale_discrs = self.model.has_multiscale_discrs
  File "/root/research/suhail/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1695, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'DistributedDataParallel' object has no attribute 'has_multiscale_discrs'```

ChatGPT: The error you're encountering indicates that an attribute has_multiscale_discrs is being accessed on an object of type DistributedDataParallel, and this object does not have such an attribute. This is a common issue when using PyTorch's DistributedDataParallel (DDP) wrapper around models for distributed training. The DDP wrapper takes your model and replicates it across multiple GPUs, managing the distribution of data and the gathering of results. However, it only forwards calls to the underlying model for methods defined in the nn.Module, not for custom attributes or methods unless they are implemented in a specific way.

madebyollin commented 6 months ago

iirc one needs to replace direct self.model.whatever accesses with something like (self.model.module if isinstance(self.model, nn.DataParallel) else self.model).whatever (potentially via a helper function) when using PyTorch DDP

ziyannchen commented 4 months ago

The codes use accelerate to do DDP automatically.

Dongzhikang commented 3 months ago

Can you please share your command to do DDP?

wanglg20 commented 2 months ago

Can you please share your command to do DDP?

just refer to https://github.com/huggingface/accelerate, for example, if you are using 2 gpus:

accelerate launch --multi_gpu --num_processes 2 train.py --...

You may also find https://github.com/huggingface/accelerate/issues/1239 helpful if you are running in slurm environment

lucidrains / magvit2-pytorch

Running multi-gpu training #35