Open changjinhan opened 1 year ago
Sorry for late, I recently recognized the current implementation regarding on an alignment learning is different from official code Nvidia-FastPitch as discussed at https://github.com/espnet/espnet/issues/5179#issuecomment-1565241556 Thanks a lot, I will check it out first whether there are improvement or not.
Oh, I'm happy to know similar discussion with it and thank you for your reply. We look forward to hearing the results of your further experiments!
Hi, I recently check an experiment regarding on an alignment algorithm and find that diagonal alignment plot is more clear from the very early training stage after fix; normalize input for ctc_loss
as well as add attention prior before viterbi decoding.
Although I didn't find clear improvement of speech quality on datasets (ljspeech, kss and internal dataset which is quite clean), the fix on alignment algorithm might be helpful on somewhat noisy, multi speaker dataset.
You can check the fix at https://github.com/espnet/espnet/pull/5288 Thanks for report 😄
Hello! I have a question about the adding position of an attention prior. You added the attention prior before calculating forwardsum loss like this. https://github.com/imdanboy/jets/blob/44e3dbcb9e7e5368158917748fa2c6b45039b4d0/espnet2/gan_tts/jets/loss.py#L147 It can decrease the forwardsum loss while forcing a monotonic alignment(
log_p_attn
). But, you didn't add anywhere in the forward process of the Jets generator and got the durations using the pure 'log_p_attn' by the viterbi algorithm. https://github.com/imdanboy/jets/blob/44e3dbcb9e7e5368158917748fa2c6b45039b4d0/espnet2/gan_tts/jets/generator.py#L593 To make the attention prior effective, I think this should be also added before getting the durations.What do you think of this?