ubisoft / ubisoft-laforge-daft-exprt

PyTorch Implementation of Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis
Apache License 2.0
126 stars 23 forks source link

Automatic aligner like in FastPitch? #9

Closed juliakorovsky closed 2 years ago

juliakorovsky commented 2 years ago

Hello! Do you think it's possible to incorporate automatic aligner as in FastPitch (https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/FastPitch), as described in paper "One TTS Alignment To Rule Them All"? This aligner essentially only requires graphemes or phonemes and learns with the rest of the network. It would allow to omit Montreal Forced Aligner preprocessing and decrease preprocessing time. If it's possible, what should be changed to allow the use of such an aligner?

juliakorovsky commented 2 years ago

I want to elaborate: FastPitch in its latest version has an aligner, that allows to get target sound durations for spectrograms. These durations can be saved directly while training (with some code modifications), so it's possible, for example, to quickly preprocess data (that can be also reused in Daft-Exprt) and save target durations during the first epoch. What's different from Montreal Forced Aligner is that 1. it's much faster 2. it can be done for both phonemes and graphemes, so Daft-Exprt can also be trained on graphemes 3, it doesn't require acoustic and g2p MFA models which are not available for some languages. Maybe this aligner can be directly incorporated into Daft-Exprt preprocessing.

macarbonneau commented 2 years ago

Hello Julia, yes you should be able to do it without too much effort.