keonlee9420 / Parallel-Tacotron2

PyTorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling
MIT License
189 stars 45 forks source link

Have you solved the bugs in soft-DTW? #14

Open Charlottecuc opened 3 years ago

Charlottecuc commented 3 years ago

Hi. The work is amazing. I notice that you mentioned there were some bugs in soft-DTW in "Updates". Have you already solved these problems?

keonlee9420 commented 3 years ago

Hi @Charlottecuc. The issue from soft-DTW that I mentioned in the "Update" section is already fixed, which means that you can use the current implementation of soft-DTW for loss. However, I'm still finding out the proper hyperparameters for LJSpeech (especially the learning rate and KL annealing). It may take more time than I expected since I'm currently engaged in several projects. I'll do my best to have the pre-trained model soon!

elch10 commented 3 years ago

Hi, I have problems with training this. Model always predicts "mean" spectrogram.

I think, that problem is soft-dtw. Have you any idea how I can fix that? Or how can I find out the problem?

изображение

hhguo commented 3 years ago

I also have this problem when using the original configuration in your code

elch10 commented 3 years ago

@hhguo does you have any solution for that?

hhguo commented 3 years ago

No... Now I still can only train it with a guided duration

taras-sereda commented 3 years ago

@keonlee9420 overall great work, thanks for sharing your implementation! Though reference implementation of sdtw has a limitation, that it supports only a batch of sequences of the same length. this is mentioned here: https://github.com/Maghoumi/pytorch-softdtw-cuda And this limitation is still present in sdtw implementation in this repo. Hence computed gradients will be wrong for sequences of various lengths.

Adding support for sequences of various lengths shouldn't be super hard though. Perhaps this limitation is one of the reasons why model is not converging.

keonlee9420 commented 3 years ago

Hi @taras-sereda , thank you for your interest. I think the current implementation is already taking it into account. Could you elaborate it more?

taras-sereda commented 3 years ago

hey @keonlee9420, current implementation is not touching various lengths sequences inside S-DTW part.

being more specific

# targets shape - 2 x 200 x 80
target_lenghts=[200, 180]

# pred shape - 2 x 180 x 80
pred_lenghts=[180, 160]

samples which are shorter then max lengths of the spectrogram are padded with zeros. And when S-DTW is computed it accounts cost for aligning trailing zeroes in target and pred spectrograms, which is wrong.

Here is a simple test to prove my point, on a single example, the only difference is padding. So S-DTW loss should be the same for both padded ans un-padded sequences. TL;DR : it's not the same.

sdtw_loss = SDTLoss()
target = torch.rand(1, 100, 80)
pred = torch.rand(1, 90, 80)
loss_1  = stdw_loss(pred, target)

target_padded = torch.cat([target, torch.zeros(1, 5, 80)], dim=1)
pred_padded = torch.cat([pred,  torch.zeros(1, 5, 80)], dim=1)
loss_2 = sdtw_loss(pred_padded, target_padded)

assert loss_1 == loss_2

I'm working on adding support for various length sequences, already have some work done. But still it's too early to share it.

So no support of various length sequences in S-DTW might be one of the reasons why model is not converging. Also I've observed that S-DTWs gradients depend too much on the value of gamma and difference in total durations of target and pred sequences. If this difference is high (30-40 samples) gamma should be set to 0.5, if the difference is getting lower, around 5 frames, then it's possible to make softmin smoother by setting gamma to 0.05, but too soft gamma values can lead to gradients vanishing when duration predictor is doing a poor job.

Curious to hear your observations on training dynamics of this model.

elch10 commented 2 years ago

I have also problem with duration predictor and residual encoder. Duration predictor mostly predicts one value for every symbol from input with small variation (all values in range 4-6). Besides that, residual encoder can't converge at all even with VAE disabled. Has anyone proper residual alignment?

I also found problem of vanishing gradient and made modification of soft-dtw algorithm, and it works with different lengths. But if lengths are very different, then result is not good (I think it's expected). And now my bottleneck is Duration predictor and residual encoder. Does any have ideas how to fix that?

On the picture result of convergence using soft-dtw with pred=torch.rand(batch_size, target_len+45, 80). First image is ground-truth. I've tested this with different batches and it works. With equal length spectrogram it works better. But model as I sad earlier predicts mean spectrogram even with such modificated S-DTW.

1 2

I don't agree with @taras-sereda , because we use S-DTW, not DTW - and loss doesn't have to be the same.

elch10 commented 2 years ago

I paste my piece of code here: https://pastebin.com/GFyguNP2 . Maybe it can help someone. Or you find some kind of bug..

P.S. you can also draw D_xy.grad to see aligner path.

clementruhm commented 2 years ago

any updates on the mean (or just very blur) spectrogram at prediction?

As for @taras-sereda point, I think its valid, but its not that problematic. Its the same as with non-masked RNN. loss is different with padding, but it still converges. The fix would be to pick different R_ij depending on length in frames and phonemes for each utterance within batch. Also if that would be the only problem, training with batch_size=1 wouldnt have it. but it still converges poorly.