First off, this project is amazing! I'm getting great results compared to Tacotron2 with much shorter training times and it's unbelievably stable even for long sentences. Congratulations. :)
The only thing I've found that Tacotron2 did better was capturing the manner that people speak in. Specifically the speed words are spoken and how long they tend to pause between words. Is this something that can be adjusted in the loss function to fine tune the model to pay more attention to these aspects?
First off, this project is amazing! I'm getting great results compared to Tacotron2 with much shorter training times and it's unbelievably stable even for long sentences. Congratulations. :)
The only thing I've found that Tacotron2 did better was capturing the manner that people speak in. Specifically the speed words are spoken and how long they tend to pause between words. Is this something that can be adjusted in the loss function to fine tune the model to pay more attention to these aspects?