multi GPU training doesnt seem to work

ppisljar commented 1 year ago

I tested with a single GPU and training works fine. I am not testing with multiple GPUs and i noticed, that the outer bar (counting total number of steps) is not updating. Adding some print statements to the code it seems that the statement in train.py:

for batchs in loader returns batchs: [], [], [], [] (so empty batches)

seems something goes wrong in data loader ?

ppisljar commented 1 year ago

in database.py the batch size is set to total batch size (rather than batch size per gpu). this makes _collate_fn return empty batch array. by fixing this i get batches in train.py, but the process now fails with:

Epoch 1:   0%|1                                                                                                                                                                     | 1/894 [00:08<2:00:37,  8.10s/it]
Traceback (most recent call last):
  File "train.py", line 342, in <module>
    mp.spawn(train, nprocs=num_gpus, args=(args, configs, batch_size, num_gpus))
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 199, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/tts/Comprehensive-E2E-TTS/train.py", line 152, in train
    output = model(*(batch[2:]), step=step)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 606, in forward
    if self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`; (2) making sure all `forward` function outputs participate in calculating los$

ppisljar commented 1 year ago

trying to set find_unused_parameters=True on DistributedDataParallel does NOT solve the problem

keonlee9420 / Comprehensive-E2E-TTS

multi GPU training doesnt seem to work #6