jaywalnut310 / glow-tts

A Generative Flow for Text-to-Speech via Monotonic Alignment Search
MIT License
651 stars 150 forks source link

Sharing my results. Glow-tts is incredibly impressive! #28

Open echelon opened 3 years ago

echelon commented 3 years ago

Thank you so much for developing such a high-quality, sparse, and performant network, @jaywalnut310. I thought I'd share the results I have obtained so that others can see how promising your network is and can make an easier decision to adopt glow-tts for their use cases.

My website Vocodes chiefly employs glow-tts as a core component of speech synthesis: https://vo.codes

All of the voices now use glow-tts for mel inference and melgan for mel inversion. I briefly tried building multi-speaker embedding models, but the speakers never gained clarity or achieved natural prosody. I only conducted a limited number of experiments, but it was enough to consider my own investigation in the area to be unfruitful.

I haven't attributed an MOS to the speakers on Vocodes, but intuitively several of them seem quite good. The training data for each speaker varies between 45 minutes to four hours. One improvement I made was to remove the dual phoneme/grapheme embeddings and force ARPABET-only phoneme training.

Another series of tweaks had to me made to adapt your network to running on TorchScript JIT (the backend is in Rust), but this was relatively straightforward.

There's more work to be done here to achieve even more natural fit, but I wanted to share my results and congratulate you on your incredible work.

seantempesta commented 3 years ago

Great work @echelon ! Would you mind sharing details about how you trained melgan and got it working with glow-tts? All of my experiments with melgan failed and I'm having a horrible time dealing with waveglow.

echelon commented 3 years ago

@seantempesta here's my fork: https://github.com/ml-applications/melgan-seungwonpark

It doesn't add much. The only valuable thing here is probably the requirements.txt. I've added some automation scripts, but I'm moving to Dockerize everything as it makes things much more hermetic than python requirements (it captures CUDA versions and more).

I recently completed containerizing glow-tts and I'll probably publish that soon. I'm just starting to do that for Melgan.

MB-Melgan is also worth a look.

Here's a Discord channel with a bunch of ML folks in it you might benefit from: https://discord.gg/Er4Sjq6

The folks in the Discord channel are incredibly helpful and can help diagnose errors you experience.

Zarbuvit commented 3 years ago

@seantempesta I took melgan from https://github.com/seungwonpark/melgan and in the GlowTTS inference file i just changed the waveglow things to what seungwonpark uses in his code to generate the model:

from melgan.model.generator import Generator
from melgan.utils.hparams import HParam

checkpoint = torch.load("./melgan/nvidia_tacotron2_LJ11_epoch6400.pt") 
hp = HParam("./melgan/config/default.yaml") 

melgan = Generator(hp.audio.n_mel_channels).cuda() 
melgan.load_state_dict(checkpoint["model_g"])
melgan.eval(inference=False) 

and for the inference:

audio = waveglow.inference(y_gen_tst.half()).cpu().numpy()

This seemed to work for me.

ihshareef commented 3 years ago

Hi @echelon, would you mind sharing details on how you managed to get GlowTTS running on TorchScript? Thanks.

michaellin99999 commented 2 years ago

Thank you so much for developing such a high-quality, sparse, and performant network, @jaywalnut310. I thought I'd share the results I have obtained so that others can see how promising your network is and can make an easier decision to adopt glow-tts for their use cases.

My website Vocodes chiefly employs glow-tts as a core component of speech synthesis: https://vo.codes

All of the voices now use glow-tts for mel inference and melgan for mel inversion. I briefly tried building multi-speaker embedding models, but the speakers never gained clarity or achieved natural prosody. I only conducted a limited number of experiments, but it was enough to consider my own investigation in the area to be unfruitful.

I haven't attributed an MOS to the speakers on Vocodes, but intuitively several of them seem quite good. The training data for each speaker varies between 45 minutes to four hours. One improvement I made was to remove the dual phoneme/grapheme embeddings and force ARPABET-only phoneme training.

Another series of tweaks had to me made to adapt your network to running on TorchScript JIT (the backend is in Rust), but this was relatively straightforward.

There's more work to be done here to achieve even more natural fit, but I wanted to share my results and congratulate you on your incredible work.

can you share some times regarding the audio parameters and config you used to train glowtts?