r9y9 / deepvoice3_pytorch

PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models
https://r9y9.github.io/deepvoice3_pytorch/
Other
1.97k stars 485 forks source link

What is the latest known pytorch version to train for speaker adaptation on Windows10? #194

Open RaghothamRao opened 4 years ago

RaghothamRao commented 4 years ago

Hi, Just wanted to give some background before i raised this issue. Background:

  1. Windows 10 machine with GeForce GTX 1060 (6GB) GPU
  2. Everytime, i created a new python 3.6 conda environment and tried different combinations of pytorch & cudatoolkit installations to see if the code on particular git commit or master worked.
  3. I initially failed to train using the pytorch version 1.4 (with cudatoolkit 9) to adapt to a speaker from the trained LJ speech model. [Code used was from git commit "abf0a21f83aeb451b918f867bc23378f1e2e608b"]
  4. Later, i learned from the issue https://github.com/r9y9/deepvoice3_pytorch/issues/173 that pytorch 1.1 with cuda10 works. However, i tried this with cuda 9 and was able to run training for the first time for few of my custom voice samples. On subsequent times, i used to get "RuntimeError: CUDA error: unknown error" and tried rebooting several times. Finally to fix, i had put "torch.cuda.current_device()" after "import torch" in train.py file as per https://github.com/pytorch/pytorch/issues/21114. The error was gone, but i got some or the other errors (as highlighted below) and was no luck since then with any of the below combinations of pytorch & cudatoolkit.
  5. Could not try using pytorch 1.3 as it seems not available in https://pytorch.org/ under current versions and previous versions.

Few pytorch version-cudatoolkit combinations and errors:

  1. pytorch 1.4 & cuda 9.2 (using code on git commit) File "train.py", line 983, in train_seq2seq=train_seq2seq, train_postnet=train_postnet) File "train.py", line 589, in train in tqdm(enumerate(data_loader)): File "C:\ProgramData\Anaconda3\envs\DeepVoice3\lib\site-packages\tqdm\std.py", line 1107, in iter for obj in iterable: File "C:\ProgramData\Anaconda3\envs\DeepVoice3\lib\site-packages\torch\utils\data\dataloader.py", line 345, in next data = self._next_data() File "C:\ProgramData\Anaconda3\envs\DeepVoice3\lib\site-packages\torch\utils\data\dataloader.py", line 856, in _next_data return self._process_data(data) File "C:\ProgramData\Anaconda3\envs\DeepVoice3\lib\site-packages\torch\utils\data\dataloader.py", line 881, in _process_data data.reraise() File "C:\ProgramData\Anaconda3\envs\DeepVoice3\lib\site-packages\torch_utils.py", line 394, in reraise raise self.exc_type(msg) RuntimeError: Caught RuntimeError in pin memory thread for device 0. Original Traceback (most recent call last): File "C:\ProgramData\Anaconda3\envs\DeepVoice3\lib\site-packages\torch\utils\data_utils\pin_memory.py", line 31, in _pin_memory_loop data = pin_memory(data) File "C:\ProgramData\Anaconda3\envs\DeepVoice3\lib\site-packages\torch\utils\data_utils\pin_memory.py", line 55, in pin_memory return [pin_memory(sample) for sample in data] File "C:\ProgramData\Anaconda3\envs\DeepVoice3\lib\site-packages\torch\utils\data_utils\pin_memory.py", line 55, in return [pin_memory(sample) for sample in data] File "C:\ProgramData\Anaconda3\envs\DeepVoice3\lib\site-packages\torch\utils\data_utils\pin_memory.py", line 55, in pin_memory return [pin_memory(sample) for sample in data] File "C:\ProgramData\Anaconda3\envs\DeepVoice3\lib\site-packages\torch\utils\data_utils\pin_memory.py", line 55, in return [pin_memory(sample) for sample in data] File "C:\ProgramData\Anaconda3\envs\DeepVoice3\lib\site-packages\torch\utils\data_utils\pin_memory.py", line 47, in pin_memory return data.pin_memory() RuntimeError: unsupported operation: more than one element of the written-to tensor refers to a single memory location. Please clone() the tensor before performing the operation.

  2. pytorch 1.4 & cuda 9.2 (using code on master branch) File "train.py", line 1017, in train_seq2seq=train_seq2seq, train_postnet=train_postnet) File "train.py", line 723, in train priority_w=hparams.priority_freq_weight) File "train.py", line 557, in spec_loss l1_loss = w masked_l1(y_hat, y, mask=mask) + (1 - w) l1(y_hat, y) File "C:\ProgramData\Anaconda3\envs\DV3pip\lib\site-packages\torch\nn\modules\module.py", line 532, in call result = self.forward(*input, *kwargs) File "train.py", line 290, in forward loss = self.criterion(input mask, target * mask_) RuntimeError: The size of tensor a (513) must match the size of tensor b (1025) at non-singleton dimension 2

  3. With pytorch 1.1.0 & torchvision 0.3.0 and cudatoolkit 9 (with master as well as particular git commit) RuntimeError: The size of tensor a (513) must match the size of tensor b (1025) at non-singleton dimension 2

  4. With pytorch==1.2.0, torchvision==0.4.0 cudatoolkit=10.0 (with code on git commit) RuntimeError: reduce failed to synchronize: device-side assert triggered

  5. With pytorch==1.2.0 torchvision==0.4.0 cudatoolkit=10.0 -c pytorch (with master) RuntimeError: Expected object of scalar type Float but got scalar type Long for argument #2 'other'

  6. conda install pytorch==1.0.0 torchvision==0.2.1 cuda80 -c pytorch (on git commit) RuntimeError: The size of tensor a (513) must match the size of tensor b (1025) at non-singleton dimension 2

Could someone kindly advise on the pytorch, cudatoolkit combination that this code with LJspeech pre-trained model works with?

RaghothamRao commented 4 years ago

An update: Tried on Ubuntu 16.04.6 LTS (GNU/Linux 4.4.0-1095-aws x86_64v) as well with pytorch 1.1.0 & torchvision 0.3.0 and cudatoolkit 10. No Luck yet on training LJSpeech pretrained model for speaker adaptation.

Traceback (most recent call last): File "train.py", line 984, in train_seq2seq=train_seq2seq, train_postnet=train_postnet) File "train.py", line 689, in train priority_w=hparams.priority_freq_weight) File "train.py", line 523, in spec_loss l1_loss = w masked_l1(y_hat, y, mask=mask) + (1 - w) l1(y_hat, y) File "/home/ubuntu/anaconda3/envs/pytorch1_1cuda10/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, *kwargs) File "train.py", line 292, in forward loss = self.criterion(input mask, target * mask_) RuntimeError: The size of tensor a (513) must match the size of tensor b (1025) at non-singleton dimension 2