Closed vcjob closed 5 years ago
Here it tells the problem.
RuntimeError: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 99187200 bytes. Error code 12 (Cannot allocate memory)
try a smaller batch size or reduce num_workers and be sure there is enough space in RAM.
Here it tells the problem.
RuntimeError: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 99187200 bytes. Error code 12 (Cannot allocate memory)
try a smaller batch size or reduce num_workers and be sure there is enough space in RAM.
The thing is, with a batch size = 24 it uses 31gb RAM and 27 GB swap on RTX2080TI server by the end of the 2nd epoch. It seems like in dev branch it is not going to release memory at all. Just consumes more and more. On V100 server we have only 32 gb ram and no swap at all, that's why we had that error occurred. Now, with num_worker=0 it consumes 12.5 GB ram on Tesla V100 after 1st epoch is finished. I'm afraid that by the end of 2nd epoch it will ask for another 12.5 (25GB in total) and so on. I will keep you updated!
I've checked, it is not consuming more memory. On RTX it went down to ~10gb RAM, on V100 it remained on the same level - 12.5 gb (number of workers = 0). Seems like I need to extend my RAM. But is it strange, isn't it? Does it really suppose to consume 64gb?
For people from future. If u have same problem change gradual_training to null in config file :)
Just checked with null
but it did not happen. So I'd suggest anyone having this problem to re-install the environment and PyTorch as the first step.
I was working some time on RTX 2080ti and gradual_training is set in config file to [[0, 7, 64], [1, 5, 64], [50000, 3, 32], [130000, 2, 16], [290000, 1, 32]] so first batch size is set to 64 and it broke my memory limit. After changing gradual_training to null in config file batch size was set to normal value (in my case 38) and everything worked fine :) [trained on arch linux with Pytorch 1.4 cuda 10.1, and all newest possible requirments]
So then the problem is not gradual_training, it is being not setting the right batch size in gradual training schedule, which was too big in your case.
Yup it is not gradual_training per se but basic config of it :) I'm just saying it for people with same problem in future so they didnt need to look for solution in cuda/pytorch etc.
Hello everyone. When training Tacotron2 from dev branch on V100 16gb single GPU, near the end of the second epoch I've got error: | > Step:598/770 GlobalStep:1370 PostnetLoss:0.03293 DecoderLoss:0.01078 StopLoss:0.17731 AlignScore:0.0933 GradNorm:0.32636 GradNormST:0.20265 AvgTextLen:140.4 AvgSpecLen:646.5 StepTime:2.75 LoaderTime:0.02 LR:0.000100 ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ! Run is kept in /ssd/ya/outputs/V100-September-26-2019_08+54AM-53d658f Traceback (most recent call last): File "train.py", line 682, in
main(args)
File "train.py", line 587, in main
ap, global_step, epoch)
File "train.py", line 165, in train
text_input, text_lengths, mel_input, speaker_ids=speaker_ids)
File "/opt/tts/lib/python3.6/site-packages/torch-1.2.0-py3.6-linux-x86_64.egg/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, kwargs)
File "/opt/tts/lib/python3.6/site-packages/TTS-0.0.1+53d658f-py3.6.egg/TTS/models/tacotron2.py", line 53, in forward
encoder_outputs, mel_specs, mask)
File "/opt/tts/lib/python3.6/site-packages/torch-1.2.0-py3.6-linux-x86_64.egg/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, *kwargs)
File "/opt/tts/lib/python3.6/site-packages/TTS-0.0.1+53d658f-py3.6.egg/TTS/layers/tacotron2.py", line 250, in forward
mel_output, stop_token, attention_weights = self.decode(memory)
File "/opt/tts/lib/python3.6/site-packages/TTS-0.0.1+53d658f-py3.6.egg/TTS/layers/tacotron2.py", line 232, in decode
stop_token = self.stopnet(stopnet_input.detach())
File "/opt/tts/lib/python3.6/site-packages/torch-1.2.0-py3.6-linux-x86_64.egg/torch/nn/modules/module.py", line 547, in call
result = self.forward(input, kwargs)
File "/opt/tts/lib/python3.6/site-packages/torch-1.2.0-py3.6-linux-x86_64.egg/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/opt/tts/lib/python3.6/site-packages/torch-1.2.0-py3.6-linux-x86_64.egg/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/opt/tts/lib/python3.6/site-packages/torch-1.2.0-py3.6-linux-x86_64.egg/torch/nn/modules/dropout.py", line 54, in forward
return F.dropout(input, self.p, self.training, self.inplace)
File "/opt/tts/lib/python3.6/site-packages/torch-1.2.0-py3.6-linux-x86_64.egg/torch/nn/functional.py", line 806, in dropout
else _VF.dropout(input, p, training))
File "/opt/tts/lib/python3.6/site-packages/torch-1.2.0-py3.6-linux-x86_64.egg/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 32284) is killed by signal: Bus error.
(tts) root@server_ip:/opt/tts/TTS#
When try second time - once again on the near end of the 2nd epoch another error: | > Step:638/770 GlobalStep:1410 PostnetLoss:0.04727 DecoderLoss:0.01074 StopLoss:0.15123 AlignScore:0.0929 GradNorm:0.29388 GradNormST:0.20302 AvgTextLen:148.7 AvgSpecLen:689.9 StepTime:3.27 LoaderTime:0.03 LR:0.000100 ! Run is kept in /ssd/ya/outputs/V100-September-26-2019_10+02AM-53d658f Traceback (most recent call last): File "train.py", line 682, in
main(args)
File "train.py", line 587, in main
ap, global_step, epoch)
File "train.py", line 113, in train
for num_iter, data in enumerate(data_loader):
File "/opt/tts/lib/python3.6/site-packages/torch-1.2.0-py3.6-linux-x86_64.egg/torch/utils/data/dataloader.py", line 819, in next
return self._process_data(data)
File "/opt/tts/lib/python3.6/site-packages/torch-1.2.0-py3.6-linux-x86_64.egg/torch/utils/data/dataloader.py", line 846, in _process_data
data.reraise()
File "/opt/tts/lib/python3.6/site-packages/torch-1.2.0-py3.6-linux-x86_64.egg/torch/_utils.py", line 369, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 1.
Original Traceback (most recent call last):
File "/opt/tts/lib/python3.6/site-packages/torch-1.2.0-py3.6-linux-x86_64.egg/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
data = fetcher.fetch(index)
File "/opt/tts/lib/python3.6/site-packages/torch-1.2.0-py3.6-linux-x86_64.egg/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/opt/tts/lib/python3.6/site-packages/TTS-0.0.1+53d658f-py3.6.egg/TTS/datasets/TTSDataset.py", line 221, in collate_fn
linear = torch.FloatTensor(linear).contiguous()
RuntimeError: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 99187200 bytes. Error code 12 (Cannot allocate memory)
What can be the reason? The files train.py and others, except for some of utils/text python scripts are unchanged. Batch size is 32. number of workers = 4.