mozilla / TTS

:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)
Mozilla Public License 2.0
9.37k stars 1.25k forks source link

Unexpected bus error encountered in worker #289

Closed vcjob closed 5 years ago

vcjob commented 5 years ago

Hello everyone. When training Tacotron2 from dev branch on V100 16gb single GPU, near the end of the second epoch I've got error: | > Step:598/770 GlobalStep:1370 PostnetLoss:0.03293 DecoderLoss:0.01078 StopLoss:0.17731 AlignScore:0.0933 GradNorm:0.32636 GradNormST:0.20265 AvgTextLen:140.4 AvgSpecLen:646.5 StepTime:2.75 LoaderTime:0.02 LR:0.000100 ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ! Run is kept in /ssd/ya/outputs/V100-September-26-2019_08+54AM-53d658f Traceback (most recent call last): File "train.py", line 682, in main(args) File "train.py", line 587, in main ap, global_step, epoch) File "train.py", line 165, in train text_input, text_lengths, mel_input, speaker_ids=speaker_ids) File "/opt/tts/lib/python3.6/site-packages/torch-1.2.0-py3.6-linux-x86_64.egg/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, kwargs) File "/opt/tts/lib/python3.6/site-packages/TTS-0.0.1+53d658f-py3.6.egg/TTS/models/tacotron2.py", line 53, in forward encoder_outputs, mel_specs, mask) File "/opt/tts/lib/python3.6/site-packages/torch-1.2.0-py3.6-linux-x86_64.egg/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, *kwargs) File "/opt/tts/lib/python3.6/site-packages/TTS-0.0.1+53d658f-py3.6.egg/TTS/layers/tacotron2.py", line 250, in forward mel_output, stop_token, attention_weights = self.decode(memory) File "/opt/tts/lib/python3.6/site-packages/TTS-0.0.1+53d658f-py3.6.egg/TTS/layers/tacotron2.py", line 232, in decode stop_token = self.stopnet(stopnet_input.detach()) File "/opt/tts/lib/python3.6/site-packages/torch-1.2.0-py3.6-linux-x86_64.egg/torch/nn/modules/module.py", line 547, in call result = self.forward(input, kwargs) File "/opt/tts/lib/python3.6/site-packages/torch-1.2.0-py3.6-linux-x86_64.egg/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/opt/tts/lib/python3.6/site-packages/torch-1.2.0-py3.6-linux-x86_64.egg/torch/nn/modules/module.py", line 547, in call result = self.forward(*input, **kwargs) File "/opt/tts/lib/python3.6/site-packages/torch-1.2.0-py3.6-linux-x86_64.egg/torch/nn/modules/dropout.py", line 54, in forward return F.dropout(input, self.p, self.training, self.inplace) File "/opt/tts/lib/python3.6/site-packages/torch-1.2.0-py3.6-linux-x86_64.egg/torch/nn/functional.py", line 806, in dropout else _VF.dropout(input, p, training)) File "/opt/tts/lib/python3.6/site-packages/torch-1.2.0-py3.6-linux-x86_64.egg/torch/utils/data/_utils/signal_handling.py", line 66, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 32284) is killed by signal: Bus error. (tts) root@server_ip:/opt/tts/TTS#


When try second time - once again on the near end of the 2nd epoch another error: | > Step:638/770 GlobalStep:1410 PostnetLoss:0.04727 DecoderLoss:0.01074 StopLoss:0.15123 AlignScore:0.0929 GradNorm:0.29388 GradNormST:0.20302 AvgTextLen:148.7 AvgSpecLen:689.9 StepTime:3.27 LoaderTime:0.03 LR:0.000100 ! Run is kept in /ssd/ya/outputs/V100-September-26-2019_10+02AM-53d658f Traceback (most recent call last): File "train.py", line 682, in main(args) File "train.py", line 587, in main ap, global_step, epoch) File "train.py", line 113, in train for num_iter, data in enumerate(data_loader): File "/opt/tts/lib/python3.6/site-packages/torch-1.2.0-py3.6-linux-x86_64.egg/torch/utils/data/dataloader.py", line 819, in next return self._process_data(data) File "/opt/tts/lib/python3.6/site-packages/torch-1.2.0-py3.6-linux-x86_64.egg/torch/utils/data/dataloader.py", line 846, in _process_data data.reraise() File "/opt/tts/lib/python3.6/site-packages/torch-1.2.0-py3.6-linux-x86_64.egg/torch/_utils.py", line 369, in reraise raise self.exc_type(msg) RuntimeError: Caught RuntimeError in DataLoader worker process 1. Original Traceback (most recent call last): File "/opt/tts/lib/python3.6/site-packages/torch-1.2.0-py3.6-linux-x86_64.egg/torch/utils/data/_utils/worker.py", line 178, in _worker_loop data = fetcher.fetch(index) File "/opt/tts/lib/python3.6/site-packages/torch-1.2.0-py3.6-linux-x86_64.egg/torch/utils/data/_utils/fetch.py", line 47, in fetch return self.collate_fn(data) File "/opt/tts/lib/python3.6/site-packages/TTS-0.0.1+53d658f-py3.6.egg/TTS/datasets/TTSDataset.py", line 221, in collate_fn linear = torch.FloatTensor(linear).contiguous() RuntimeError: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 99187200 bytes. Error code 12 (Cannot allocate memory)

What can be the reason? The files train.py and others, except for some of utils/text python scripts are unchanged. Batch size is 32. number of workers = 4.

erogol commented 5 years ago

Here it tells the problem.

RuntimeError: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 99187200 bytes. Error code 12 (Cannot allocate memory)

try a smaller batch size or reduce num_workers and be sure there is enough space in RAM.

vcjob commented 5 years ago

Here it tells the problem.

RuntimeError: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 99187200 bytes. Error code 12 (Cannot allocate memory)

try a smaller batch size or reduce num_workers and be sure there is enough space in RAM.

The thing is, with a batch size = 24 it uses 31gb RAM and 27 GB swap on RTX2080TI server by the end of the 2nd epoch. It seems like in dev branch it is not going to release memory at all. Just consumes more and more. On V100 server we have only 32 gb ram and no swap at all, that's why we had that error occurred. Now, with num_worker=0 it consumes 12.5 GB ram on Tesla V100 after 1st epoch is finished. I'm afraid that by the end of 2nd epoch it will ask for another 12.5 (25GB in total) and so on. I will keep you updated!

vcjob commented 5 years ago

I've checked, it is not consuming more memory. On RTX it went down to ~10gb RAM, on V100 it remained on the same level - 12.5 gb (number of workers = 0). Seems like I need to extend my RAM. But is it strange, isn't it? Does it really suppose to consume 64gb?

machineko commented 4 years ago

For people from future. If u have same problem change gradual_training to null in config file :)

erogol commented 4 years ago

Just checked with null but it did not happen. So I'd suggest anyone having this problem to re-install the environment and PyTorch as the first step.

machineko commented 4 years ago

I was working some time on RTX 2080ti and gradual_training is set in config file to [[0, 7, 64], [1, 5, 64], [50000, 3, 32], [130000, 2, 16], [290000, 1, 32]] so first batch size is set to 64 and it broke my memory limit. After changing gradual_training to null in config file batch size was set to normal value (in my case 38) and everything worked fine :) [trained on arch linux with Pytorch 1.4 cuda 10.1, and all newest possible requirments]

erogol commented 4 years ago

So then the problem is not gradual_training, it is being not setting the right batch size in gradual training schedule, which was too big in your case.

machineko commented 4 years ago

Yup it is not gradual_training per se but basic config of it :) I'm just saying it for people with same problem in future so they didnt need to look for solution in cuda/pytorch etc.