keonlee9420 / StyleSpeech

PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation
MIT License
190 stars 23 forks source link

The size of tensor a (xx) must match the size of tensor b (yy) #3

Closed DiDimus closed 3 years ago

DiDimus commented 3 years ago

Hi I try to run your project. I use cuda 10.1, all requirements are installed (with torch 1.8.1), all models are preloaded. But i have an error: python3 synthesize.py --text "Hello world" --restore_step 200000 --mode single -p config/LibriTTS/preprocess.yaml -m config/LibriTTS/model.yaml -t config/LibriTTS/train.yaml --duration_control 0.8 --energy_control 0.8 --ref_audio ref.wav

Removing weight norm...
Raw Text Sequence: Hello world
Phoneme Sequence: {HH AH0 L OW1 W ER1 L D}
Traceback (most recent call last):
  File "synthesize.py", line 268, in <module>
    synthesize(model, args.restore_step, configs, vocoder, batchs, control_values)
  File "synthesize.py", line 152, in synthesize
    d_control=duration_control
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(input, *kwargs)
  File "/usr/local/work/model/StyleSpeech.py", line 144, in forward
    d_control,
  File "/usr/local/work/model/StyleSpeech.py", line 91, in G
    output, mel_masks = self.mel_decoder(output, style_vector, mel_masks)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(input, kwargs)
  File "/usr/local/work/model/modules.py", line 307, in forward
    enc_seq = self.mel_prenet(enc_seq, mask)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(input, kwargs)
  File "/usr/local/work/model/modules.py", line 259, in forward
    x = x.masked_fill(mask.unsqueeze(-1), 0)
RuntimeError: The size of tensor a (44) must match the size of tensor b (47) at non-singleton dimension 1
keonlee9420 commented 3 years ago

HI @DiDimus , please try to specify an index of GPUs if you have several of in your computer by CUDA_VISIBLE_DEVICES=0 for the first GPU as an example.

DiDimus commented 3 years ago

Thanks. but result is exactly the same. I think the problem is in the software environment. Do you have docker for this project? Wich OS do you use?

keonlee9420 commented 3 years ago

Gotcha, you can refer to this: https://github.com/keonlee9420/Daft-Exprt/blob/main/Dockerfile

I think the Dockerfile should also work for this project. Please try it out and let me know the result.

Vadim2S commented 3 years ago

This is obviously project code error with predicted tensor size:

With duration_control = 0.3 here RuntimeError: The size of tensor a (25) must match the size of tensor b (31) x shape is torch.Size([1, 31, 256]) ; mask shape is torch.Size([1, 25]) Right value is 31 (104*0.3)

With duration_control = 0.5 here RuntimeError: The size of tensor a (47) must match the size of tensor b (52) x shape is torch.Size([1, 52, 256]) ; mask shape is torch.Size([1, 47]) Right value is 52 (104*0.5)

With duration_control = 1.0 all OK x shape is torch.Size([1, 104, 256]) ; mask shape is torch.Size([1, 104])

With duration_control = 2.0 all OK x shape is torch.Size([1, 208, 256]) ; mask shape is torch.Size([1, 208])

DiDimus commented 3 years ago

yes, @Vadim2S . Problem found, thanks. Docker from Daft-Export didn't help :(

keonlee9420 commented 3 years ago

hey guys, I just found that you had issue with the control value lower than 1. sorry for the late correction, and thanks to @Vadim2S , I can confirm that there is an error in current code. I'll fix it and push soon. thank you all for the report!

Vadim2S commented 3 years ago

Temporal workaround:

/model/modules.py #177 class LengthRegulator(nn.Module):

change

    if max_len is not None:
        output = pad(output, max_len)
    else:
        output = pad(output)

to:

    if max_len is not None:
        output = pad(output, max_len)
        #VVS
        mel_len.clear()
        mel_len.append(output.shape[1])
    else:
        output = pad(output)

P.S. Duration prediction is real and in LengthRegulator.expand you do

    for i, vec in enumerate(batch):
        expand_size = predicted[i].item()
        out.append(vec.expand(max(int(expand_size), 0), -1))
    out = torch.cat(out, 0)

of course, you get out smaller than max_len due rounding. I am presume you must extend out to max_size later

keonlee9420 commented 3 years ago

I fixed the code and it's working now. The problem was originated from the value of max_len at inference time in VarianceAdaptor, where it should be 'None' but that of a reference audio was wrongly passed.

Vadim2S commented 3 years ago

Thanks! Tested. Low duration work OK!