Open nmfisher opened 2 years ago
@nmfisher I have reimplemented the encoder outputting mel spectrograms and trained from scratch 150k steps on a custom dataset. It sounds OK with the universal Waveglow vocoder. It should be even faster than the original encoder because you don't need to intersperse 0s between the phonems.
Just curious - did you try training the same model architecture end-to-end from scratch (i.e. not distilling from VITS), and if so, are there any audio comparison samples available?