Open tuong-olli opened 3 years ago
Got 8192 and 8119 in dimension 1 at /pytorch/aten/src/TH/generic/THTensor.cpp:612
I think this is more likely to be an incorrectly padded audio tensor. There shouldn't ever be a spectrogram that long being collated.
Saying that, I'm looking at the code and I can't figure out how this issue can occur. If you're able to provide more details then that'd help a lot.
1.
Which tacotron2 repo are you using and what is the hop_length
you're using for both tacotron2 and hifigan?
I used the below parameters to train successfully with my audio data.
"segment_size": 8192,
"num_mels": 80,
"num_freq": 513,
"n_fft": 1024,
"hop_size": 256,
"win_size": 1024,
After that I have synthesized mel spectrogram from Tacotron 2 model with my dataset to train with --fine_tuning True
and the same parameters but it has been errored:
Using a target size (torch.Size([1, 80, 759])) that is different to the input size (torch.Size([1, 80, 815])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
val_err_tot += F.l1_loss(y_mel, y_g_hat_mel).item()
To fix this error I add padding to train.py file:
if y_mel.size(2) > y_g_hat_mel.size(2):
y_g_hat_mel = torch.nn.functional.pad(y_g_hat_mel, (0, y_mel.size(2) - y_g_hat_mel.size(2)), 'constant')
elif y_mel.size(2) < y_g_hat_mel.size(2):
y_mel = torch.nn.functional.pad(y_mel, (0, y_g_hat_mel.size(2) - y_mel.size(2)), 'constant')
And it is errored as above.
I have a very similar issue. I generated the mell files using tacotron2.
train.py:199: UserWarning: Using a target size (torch.Size([1, 80, 240])) that is different to the input size (torch.Size([1, 80, 235])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
val_err_tot += F.l1_loss(y_mel, y_g_hat_mel).item()
Traceback (most recent call last):
File "train.py", line 271, in <module>
main()
File "train.py", line 267, in main
train(0, a, h)
File "train.py", line 199, in train
val_err_tot += F.l1_loss(y_mel, y_g_hat_mel).item()
File "C:\Anaconda3\lib\site-packages\torch\nn\functional.py", line 2633, in l1_loss
expanded_input, expanded_target = torch.broadcast_tensors(input, target)
File "C:\Anaconda3\lib\site-packages\torch\functional.py", line 71, in broadcast_tensors
return _VF.broadcast_tensors(tensors) # type: ignore
RuntimeError: The size of tensor a (235) must match the size of tensor b (240) at non-singleton dimension 2
You can try the fix from this fork (line 245).
Replace in line 199:
val_err_tot += F.l1_loss(y_mel, y_g_hat_mel).item()
with:
val_err_tot += F.l1_loss(y_mel, y_g_hat_mel[:,:,:y_mel.size(2)]).item()
val_err_tot
This seem to solve that problem miss match between two mel spec, but when I compare difference y_mel, y_g_hat_mel, It is larger than when using mel from audio (not using mel from Tacotron 2 force alignment)
Can you share your result is okie or not?
e synthesized mel spectrogram from Tacotron 2 model with my dataset to train with
--fine_tuning True
and the same parameters but it has been errored:
The mel you get from Tacotron2 is at training step or inference step, As my knowledge, It is from force teacher at training.
val_err_tot
This seem to solve that problem miss match between two mel spec, but when I compare difference y_mel, y_g_hat_mel, It is larger than when using mel from audio (not using mel from Tacotron 2 force alignment)
Can you share your result is okie or not?
do you resolve this issue
The shape of mel output of Tacotron2 is bigger than mel extracted from audio and the model has issue