Where to add an attention prior (betabinom prior)

imdanboy / jets

JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech

Apache License 2.0

104 stars 12 forks source link

Where to add an attention prior (betabinom prior) #4

Open changjinhan opened 1 year ago

changjinhan commented 1 year ago

Hello! I have a question about the adding position of an attention prior. You added the attention prior before calculating forwardsum loss like this. https://github.com/imdanboy/jets/blob/44e3dbcb9e7e5368158917748fa2c6b45039b4d0/espnet2/gan_tts/jets/loss.py#L147 It can decrease the forwardsum loss while forcing a monotonic alignment(log_p_attn). But, you didn't add anywhere in the forward process of the Jets generator and got the durations using the pure 'log_p_attn' by the viterbi algorithm. https://github.com/imdanboy/jets/blob/44e3dbcb9e7e5368158917748fa2c6b45039b4d0/espnet2/gan_tts/jets/generator.py#L593 To make the attention prior effective, I think this should be also added before getting the durations.

What do you think of this?

imdanboy commented 1 year ago

Sorry for late, I recently recognized the current implementation regarding on an alignment learning is different from official code Nvidia-FastPitch as discussed at https://github.com/espnet/espnet/issues/5179#issuecomment-1565241556 Thanks a lot, I will check it out first whether there are improvement or not.

changjinhan commented 1 year ago

Oh, I'm happy to know similar discussion with it and thank you for your reply. We look forward to hearing the results of your further experiments!

imdanboy commented 1 year ago

Hi, I recently check an experiment regarding on an alignment algorithm and find that diagonal alignment plot is more clear from the very early training stage after fix; normalize input for ctc_loss as well as add attention prior before viterbi decoding.

Although I didn't find clear improvement of speech quality on datasets (ljspeech, kss and internal dataset which is quite clean), the fix on alignment algorithm might be helpful on somewhat noisy, multi speaker dataset.

You can check the fix at https://github.com/espnet/espnet/pull/5288 Thanks for report 😄