Closed homink closed 6 years ago
I was hoping nobody hit this. The CUDA error message is really hard to tell (you can see more informative error messages if you disable CUDA). I believe the problem is that input or decoder target length exceeded the maximum length in the model.
I added a sanity check to give a better error message when users hit this problem.
Traceback (most recent call last):
File "train.py", line 947, in <module>
train_seq2seq=train_seq2seq, train_postnet=train_postnet)
File "train.py", line 605, in train
""".format(max_seq_len, hparams.max_positions))
RuntimeError: max_seq_len (186) >= max_posision (64)
Input text or decoder targget length exceeded the maximum length.
Please set a larger value for ``max_position`` in hyper parameters.
Could you try with the latest master if it works?
That helps! I will let you know when it comes running.
@r9y9 , I'm running a training using a Brazilian Portuguese dataset, created by myself, and I had the same problem faced by @homink . I followed yours instructions to run on the CPU, increased the value of the variable max_positions = 4096, but the problem continues. @r9y9, please, do you have any other tips?
The error message running on CPU:
CUDA_VISIBLE_DEVICES=, python train.py --preset=./presets/deepvoice3_ljspeech.json --data-root=./datasets/processed_AS+JC+LN+RG/ --checkpoint-dir=./checkpoints-22-05
Command line args:
{'--checkpoint': None,
'--checkpoint-dir': './checkpoints-22-05',
'--checkpoint-postnet': None,
'--checkpoint-seq2seq': None,
'--data-root': './datasets/processed_AS+JC+LN+RG/',
'--help': False,
'--hparams': '',
'--load-embedding': None,
'--log-event-path': None,
'--preset': './presets/deepvoice3_ljspeech.json',
'--reset-optimizer': False,
'--restore-parts': None,
'--speaker-id': None,
'--train-postnet-only': False,
'--train-seq2seq-only': False}
Training whole model
Training seq2seq model
Hyperparameters:
adam_beta1: 0.5
adam_beta2: 0.9
adam_eps: 1e-06
allow_clipping_in_normalization: True
amsgrad: False
batch_size: 16
binary_divergence_weight: 0.1
builder: deepvoice3
checkpoint_interval: 100
clip_thresh: 0.1
converter_channels: 256
decoder_channels: 256
downsample_step: 4
dropout: 0.050000000000000044
embedding_weight_std: 0.1
encoder_channels: 512
eval_interval: 100
fft_size: 1024
fmax: 8000
fmin: 0
force_monotonic_attention: True
freeze_embedding: False
frontend: ptbr
guided_attention_sigma: 0.2
hop_size: 256
ignore_recognition_level: 0
initial_learning_rate: 0.0005
kernel_size: 3
key_position_rate: 1.385
key_projection: True
lr_schedule: noam_learning_rate_decay
lr_schedule_kwargs: {}
masked_loss_weight: 0.5
max_positions: 4096
min_level_db: -100
min_text: 20
n_speakers: 4
name: deepvoice3
nepochs: 2000
num_mels: 80
num_workers: 2
outputs_per_step: 1
padding_idx: 0
pin_memory: True
power: 1.4
preemphasis: 0.97
priority_freq: 3000
priority_freq_weight: 0.0
process_only_htk_aligned: False
query_position_rate: 1.0
ref_level_db: 20
replace_pronunciation_prob: 0.5
rescaling: False
rescaling_max: 0.999
sample_rate: 22050
save_optimizer_state: True
speaker_embed_dim: 16
speaker_embedding_weight_std: 0.01
text_embed_dim: 256
trainable_positional_encodings: False
use_decoder_state_for_postnet_input: True
use_guided_attention: True
use_memory_mask: True
value_projection: True
weight_decay: 0.0
window_ahead: 3
window_backward: 1
Los event path: log/run-test2019-05-22_19:21:05.490590
100it [05:29, 3.22s/it]Save intermediate states at step 100
Saved checkpoint: ./checkpoints-22-05/checkpoint_step000000100.pth
Traceback (most recent call last):
File "train.py", line 981, in <module>
train_seq2seq=train_seq2seq, train_postnet=train_postnet)
File "train.py", line 715, in train
eval_model(global_step, writer, device, model, checkpoint_dir, ismultispeaker)
File "train.py", line 404, in eval_model
model_eval, text, p=0, speaker_id=speaker_id, fast=True)
File "/home/fred/Documentos/deepvoice3_pytorch/synthesis.py", line 62, in tts
sequence, text_positions=text_positions, speaker_ids=speaker_ids)
File "/opt/anaconda3/envs/deepvoice3_pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/fred/Documentos/deepvoice3_pytorch/deepvoice3_pytorch/__init__.py", line 71, in forward
speaker_embed = self.embed_speakers(speaker_ids)
File "/opt/anaconda3/envs/deepvoice3_pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/opt/anaconda3/envs/deepvoice3_pytorch/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 117, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "/opt/anaconda3/envs/deepvoice3_pytorch/lib/python3.6/site-packages/torch/nn/functional.py", line 1506, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: index out of range at /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:193`
My dataset has 4 speakers:
cat datasets/processed_AS+JC+LN+RG/train.txt | cut -d "|" -f 5 | uniq | awk '{if(m<$1) m=$1} END{print m}'
3
and:
cat datasets/processed_AS+JC+LN+RG/train.txt | cut -d "|" -f 5 | uniq | wc -l
4
deepvoice3_ljspeech.json
is not for multi-speaker dataset.n_speakers
.builder=deepvoice3_multispeaker
for multi-speaker model.See https://github.com/r9y9/deepvoice3_pytorch/blob/6fb72bf7b7d53414493f1daeb215f2fb178fba78/presets/deepvoice3_vctk.json#L5-L6 for example.
max_positions doesn't seem to be related in your case. That matters if your dataset contains long sentences.
@r9y9, following your instructions, unfortunately the same problem happened:
CUDA_VISIBLE_DEVICES=-1, python train.py --preset=./presets/deepvoice3_vctk.json --data-root=./datasets/processed_AS+JC+LN+RG/ --checkpoint-dir=./checkpoints-23-5
Command line args:
{'--checkpoint': None,
'--checkpoint-dir': './checkpoints-23-5',
'--checkpoint-postnet': None,
'--checkpoint-seq2seq': None,
'--data-root': './datasets/processed_AS+JC+LN+RG/',
'--help': False,
'--hparams': '',
'--load-embedding': None,
'--log-event-path': None,
'--preset': './presets/deepvoice3_vctk.json',
'--reset-optimizer': False,
'--restore-parts': None,
'--speaker-id': None,
'--train-postnet-only': False,
'--train-seq2seq-only': False}
Training whole model
Training seq2seq model
Hyperparameters:
adam_beta1: 0.5
adam_beta2: 0.9
adam_eps: 1e-06
allow_clipping_in_normalization: True
amsgrad: False
batch_size: 8
binary_divergence_weight: 0.1
builder: deepvoice3_multispeaker
checkpoint_interval: 10
clip_thresh: 0.1
converter_channels: 256
decoder_channels: 256
downsample_step: 4
dropout: 0.050000000000000044
embedding_weight_std: 0.1
encoder_channels: 512
eval_interval: 10
fft_size: 1024
fmax: 8000
fmin: 0
force_monotonic_attention: True
freeze_embedding: False
frontend: en
guided_attention_sigma: 0.4
hop_size: 256
ignore_recognition_level: 0
initial_learning_rate: 0.0005
kernel_size: 3
key_position_rate: 7.6
key_projection: True
lr_schedule: noam_learning_rate_decay
lr_schedule_kwargs: {}
masked_loss_weight: 0.5
max_positions: 1024
min_level_db: -100
min_text: 20
n_speakers: 4
name: deepvoice3
nepochs: 2000
num_mels: 80
num_workers: 2
outputs_per_step: 1
padding_idx: 0
pin_memory: True
power: 1.4
preemphasis: 0.97
priority_freq: 3000
priority_freq_weight: 0.0
process_only_htk_aligned: False
query_position_rate: 2.0
ref_level_db: 20
replace_pronunciation_prob: 0.5
rescaling: False
rescaling_max: 0.999
sample_rate: 22050
save_optimizer_state: True
speaker_embed_dim: 16
speaker_embedding_weight_std: 0.05
text_embed_dim: 256
trainable_positional_encodings: False
use_decoder_state_for_postnet_input: True
use_guided_attention: True
use_memory_mask: True
value_projection: True
weight_decay: 0.0
window_ahead: 3
window_backward: 1
Los event path: log/run-test2019-05-23_10:03:02.916023
10it [00:17, 1.69s/it]Save intermediate states at step 10
Saved checkpoint: ./checkpoints-23-5/checkpoint_step000000010.pth
Traceback (most recent call last):
File "train.py", line 981, in <module>
train_seq2seq=train_seq2seq, train_postnet=train_postnet)
File "train.py", line 715, in train
eval_model(global_step, writer, device, model, checkpoint_dir, ismultispeaker)
File "train.py", line 404, in eval_model
model_eval, text, p=0, speaker_id=speaker_id, fast=True)
File "/home/fred/Documentos/deepvoice3_pytorch/synthesis.py", line 62, in tts
sequence, text_positions=text_positions, speaker_ids=speaker_ids)
File "/opt/anaconda3/envs/deepvoice3_pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/fred/Documentos/deepvoice3_pytorch/deepvoice3_pytorch/__init__.py", line 71, in forward
speaker_embed = self.embed_speakers(speaker_ids)
File "/opt/anaconda3/envs/deepvoice3_pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/opt/anaconda3/envs/deepvoice3_pytorch/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 117, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "/opt/anaconda3/envs/deepvoice3_pytorch/lib/python3.6/site-packages/torch/nn/functional.py", line 1506, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: index out of range at /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:193
Can it be some problem in the dataset? I created the alignment.json files inside each speaker folder using the scripts from the repository multi-Speaker-tacotron-tensorflow
python -m datasets.AS.prepare
python -m datasets.JC.prepare
python -m datasets.LN.prepare
python -m datasets.RG.prepare
Then I did a preprocessing
python preprocess.py json_meta "./datasets/AS/alignment.json,./datasets/JC/alignment.json,./datasets/LN/alignment.json,./datasets/RG/alignment.json" "./datasets/processed_AS+JC+LN+RG" --preset=./presets/deepvoice3_vctk.json
Is there a better way to find out what is causing the error? Thanks a lot for the help!
Try to figure out what exact value is causing the out of range error.
Hi again,
I am applying this repository for Korean speech corpus (http://www.korean.go.kr/front/board/boardStandardView.do?board_id=4&mn_id=17&b_seq=464) and have encountered the following error. Could you have a look at it? I will be happy to ask PR once it gets working.
I formatted Korean corpus into npy as same as ljspeech has as single speaker and ran training with single GPU or multipe GPU. But it shows a series of error messages like Assertion
srcIndex < srcSelectDimSize
failed.