Random crashes with custom dataset, tensor size mismatch

bob80333 commented 5 years ago

I created a small test dataset that you can replicate by downloading this podcast and following these steps.

I then used ffmpeg to convert it to a mono 22050hz wav file with ffmpeg -I input.mp3 -ac 1 -ar 22050 output.wav

I used sox to split on silence to have many smaller pieces into a split_files output folder with sox -V3 output.wav split_files/output.wav silence -l 0 3.0 1.0 5% : newfile : restart

There should be 240 pieces.

The last 24 pieces were used for validation.

Here's two seperate errors (note that the dataloader shuffling was modified to False for both these runs, despite the fact that they crash at different steps)

``` [eric@eric-pc melgan]$ python trainer.py -c config/default.yaml -n test4 2019-10-24 23:06:54,795 - INFO - Starting new training run. Validation loop: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [00:03<00:00, 6.90it/s] g 31.2470 d 56.5574 | step 13: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:08<00:00, 1.51it/s] 2019-10-24 23:07:10,354 - INFO - Saved checkpoint to: chkpt/test4/test4_df8b090_0000.pt g 29.4583 d 55.8972 | step 26: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:06<00:00, 1.91it/s] g 29.3384 d 55.7414 | step 39: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:06<00:00, 1.90it/s] g 31.0743 d 55.8826 | step 52: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:06<00:00, 1.87it/s] g 30.2437 d 55.5219 | step 65: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:06<00:00, 1.89it/s] Validation loop: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [00:03<00:00, 6.98it/s] g 32.9035 d 58.3628 | step 78: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:06<00:00, 1.88it/s] g 32.2074 d 55.6909 | step 91: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:06<00:00, 1.87it/s] g 30.4200 d 55.2120 | step 93: 15%|██████████████████████████▏ | 2/13 [00:01<00:09, 1.20it/s]2019-10-24 23:07:59,489 - INFO - Exiting due to exception: Caught RuntimeError in DataLoader worker process 2. Original Traceback (most recent call last): File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop data = fetcher.fetch(index) File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch return self.collate_fn(data) File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 79, in default_collate return [default_collate(samples) for samples in transposed] File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 79, in return [default_collate(samples) for samples in transposed] File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 79, in default_collate return [default_collate(samples) for samples in transposed] File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 79, in return [default_collate(samples) for samples in transposed] File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate return torch.stack(batch, 0, out=out) RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 16000 and 15986 in dimension 2 at /pytorch/aten/src/TH/generic/THTensor.cpp:689 Traceback (most recent call last): File "/home/eric/Documents/repos/melgan/utils/train.py", line 64, in train for (melG, audioG), (melD, audioD) in loader: File "/usr/lib/python3.7/site-packages/tqdm/_tqdm.py", line 1060, in __iter__ for obj in iterable: File "/usr/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 801, in __next__ return self._process_data(data) File "/usr/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 846, in _process_data data.reraise() File "/usr/lib/python3.7/site-packages/torch/_utils.py", line 385, in reraise raise self.exc_type(msg) RuntimeError: Caught RuntimeError in DataLoader worker process 2. Original Traceback (most recent call last): File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop data = fetcher.fetch(index) File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch return self.collate_fn(data) File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 79, in default_collate return [default_collate(samples) for samples in transposed] File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 79, in return [default_collate(samples) for samples in transposed] File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 79, in default_collate return [default_collate(samples) for samples in transposed] File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 79, in return [default_collate(samples) for samples in transposed] File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate return torch.stack(batch, 0, out=out) RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 16000 and 15986 in dimension 2 at /pytorch/aten/src/TH/generic/THTensor.cpp:689 g 30.4200 d 55.2120 | step 93: 15%|██████████████████████████▏ | 2/13 [00:01<00:08, 1.28it/s] [eric@eric-pc melgan]$ python trainer.py -c config/default.yaml -n test5 2019-10-24 23:11:19,808 - INFO - Starting new training run. Validation loop: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [00:03<00:00, 6.96it/s] g 31.1410 d 56.5434 | step 13: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:08<00:00, 1.51it/s] 2019-10-24 23:11:35,537 - INFO - Saved checkpoint to: chkpt/test5/test5_df8b090_0000.pt g 30.1641 d 56.2416 | step 21: 62%|████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 8/13 [00:04<00:02, 1.93it/s]2019-10-24 23:11:39,845 - INFO - Exiting due to exception: Caught RuntimeError in DataLoader worker process 8. Original Traceback (most recent call last): File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop data = fetcher.fetch(index) File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch return self.collate_fn(data) File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 79, in default_collate return [default_collate(samples) for samples in transposed] File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 79, in return [default_collate(samples) for samples in transposed] File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 79, in default_collate return [default_collate(samples) for samples in transposed] File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 79, in return [default_collate(samples) for samples in transposed] File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate return torch.stack(batch, 0, out=out) RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 16000 and 15958 in dimension 2 at /pytorch/aten/src/TH/generic/THTensor.cpp:689 Traceback (most recent call last): File "/home/eric/Documents/repos/melgan/utils/train.py", line 64, in train for (melG, audioG), (melD, audioD) in loader: File "/usr/lib/python3.7/site-packages/tqdm/_tqdm.py", line 1060, in __iter__ for obj in iterable: File "/usr/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 801, in __next__ return self._process_data(data) File "/usr/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 846, in _process_data data.reraise() File "/usr/lib/python3.7/site-packages/torch/_utils.py", line 385, in reraise raise self.exc_type(msg) RuntimeError: Caught RuntimeError in DataLoader worker process 8. Original Traceback (most recent call last): File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop data = fetcher.fetch(index) File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch return self.collate_fn(data) File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 79, in default_collate return [default_collate(samples) for samples in transposed] File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 79, in return [default_collate(samples) for samples in transposed] File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 79, in default_collate return [default_collate(samples) for samples in transposed] File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 79, in return [default_collate(samples) for samples in transposed] File "/usr/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate return torch.stack(batch, 0, out=out) RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 16000 and 15958 in dimension 2 at /pytorch/aten/src/TH/generic/THTensor.cpp:689 g 30.1641 d 56.2416 | step 21: 62%|████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 8/13 [00:04<00:02, 1.81it/s] ```

seungwonpark commented 5 years ago

Please share your error messages here, as you mentioned in reddit comment

How long is the shortest audio among your training dataset?

kensun0 commented 5 years ago

try to add this code in dataloader.py def my_getitem(self, idx): wavpath = self.wav_list[idx] melpath = wavpath.replace('.wav', '.mel') sr, audio = read_wav_np(wavpath) audio = torch.from_numpy(audio).unsqueeze(0) mel = torch.load(melpath).squeeze(0) frame_num = min(mel.size(1), audio.size(1)//self.hp.audio.hop_length) audio = audio[:, 0:frame_num * self.hp.audio.hop_length] mel = mel[:, 0:frame_num]

seungwonpark commented 5 years ago

Assuming that this is caused by audios shorter than 16000 samples, I'm working to solve this on padshort branch.

seungwonpark commented 5 years ago

@bob80333 Will you try with padshort branch again? You'll need to generate mel-spectrograms again at first.

git fetch origin
git checkout padshort

bob80333 commented 5 years ago

Assuming the smallest audio file is also the shortest, the shortest audio file has 29548 samples according to sox.

bob80333 commented 5 years ago

Its managed to do more than 25 epochs on the padshort branch so far, so I think that has solved it. Thanks for the help!

seungwonpark commented 5 years ago

Thanks for letting me know! I'll merge padshort branch into master.

deepconsc commented 4 years ago

@seungwonpark it's happening again, nevertheless the data didn't have audio shorter than 16k samples. my workaround is this edit for dataloader.py module:

https://gist.github.com/deepconsc/28d517597196e361faa2f07628ccf855

P.S. don't change batch size in validation as long as in this phase most of the tensors have different shapes.

seungwonpark / melgan

Random crashes with custom dataset, tensor size mismatch #11