rendchevi / nix-tts

🐤 Nix-TTS: Lightweight and End-to-end Text-to-Speech via Module-wise Distillation
MIT License
233 stars 31 forks source link

Comparison of distilled model vs end-to-end training from scratch #7

Open nmfisher opened 2 years ago

nmfisher commented 2 years ago

Just curious - did you try training the same model architecture end-to-end from scratch (i.e. not distilling from VITS), and if so, are there any audio comparison samples available?

tugstugi commented 2 years ago

@nmfisher I have reimplemented the encoder outputting mel spectrograms and trained from scratch 150k steps on a custom dataset. It sounds OK with the universal Waveglow vocoder. It should be even faster than the original encoder because you don't need to intersperse 0s between the phonems.