Open ppisljar opened 1 year ago
in database.py the batch size is set to total batch size (rather than batch size per gpu). this makes _collate_fn return empty batch array. by fixing this i get batches in train.py, but the process now fails with:
Epoch 1: 0%|1 | 1/894 [00:08<2:00:37, 8.10s/it]
Traceback (most recent call last):
File "train.py", line 342, in <module>
mp.spawn(train, nprocs=num_gpus, args=(args, configs, batch_size, num_gpus))
File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/home/tts/Comprehensive-E2E-TTS/train.py", line 152, in train
output = model(*(batch[2:]), step=step)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 606, in forward
if self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`; (2) making sure all `forward` function outputs participate in calculating los$
trying to set find_unused_parameters=True
on DistributedDataParallel does NOT solve the problem
I tested with a single GPU and training works fine. I am not testing with multiple GPUs and i noticed, that the outer bar (counting total number of steps) is not updating. Adding some print statements to the code it seems that the statement in train.py:
for batchs in loader
returns batchs:[], [], [], []
(so empty batches)seems something goes wrong in data loader ?