jaywalnut310 / glow-tts

A Generative Flow for Text-to-Speech via Monotonic Alignment Search
MIT License
652 stars 150 forks source link

Any method could make the result more nature? #22

Open wotulong opened 4 years ago

wotulong commented 4 years ago

As my experiment , the result of glow-tts sounds more like robot than a real person, do you have any method could make it more nature, like the result or autoregressive model, like tacotron eg..thanks.

echelon commented 4 years ago

Could you share your results? My results are a bit robotic too, but gone off the supported path and modified the network to train on multiple speakers via n_speakers and din_channels, which might be pushing the network to do things it doesn't support .

I am incredibly pleased with how fast the model inference time is. It blows everything else I've tried out of the water. I'm still trying to achieve naturalness for multiple speakers. I'm in a memory-constrained, on-demand inference environment, and glow-tts is perfect for those requirements.

I'll share some samples from my work soon.

jaywalnut310 commented 4 years ago

If your concern is the prosody of the synthesized samples such as intonation, some techniques such as prosody embedding, and style tokens could be useful. In my ongoing experiments, such techniques help to improve sample quality.

For more information, please see https://ai.googleblog.com/2018/03/expressive-speech-synthesis-with.html

echelon commented 3 years ago

@jaywalnut310 , do you have any updates with your current experiments?

I'm using your model very successfully in production and am very invested in its development. I would love to fund further improvements. I'm currently sponsoring a few folks on Github, and if you enable sponsorship I'd be extremely happy to contribute. PayPal is also fine.

I'd also love to get in touch via email to discuss some things if you don't mind reaching out. My email address is {my username} @ gmail.com .

Thanks!