Open ErfolgreichCharismatisch opened 3 years ago
@ErfolgreichCharismatisch modern TTS models consist of 2 parts: feature generator and vocoder. Feature generator produces low-dimensional time-frequency acoustic features from text, while vocoder reconstructs raw waveform from these features. Each model trains separately. WaveGrad corresponds to the second part, vocoder. It takes acoustic features (mel-spectrograms) as input, not text. And it can be trained on arbitrary dataset.
Interesting. So which feature generator(s) does it work out of the box with?
As I understand it, this tts-algorithm works with your audio files without assigned text.