jaywalnut310 / glow-tts

A Generative Flow for Text-to-Speech via Monotonic Alignment Search
MIT License
651 stars 150 forks source link

Autoregressive flow instead of WaveGlow? #5

Closed janvainer closed 4 years ago

janvainer commented 4 years ago

Hi, thank you for this amazing idea. It is really nice :). I was wondering if it would be possible to replace the WaveGlow model by some more expressive flow, for example by autoregressive WaveNetor WaveFlow? In WaveGlow, the audio/spectrograms are directly encoded into samples from some gaussian distributions and an external encoder can be used to evaluate likelihood of the sample. WaveNet instead does a scale + shift transformation of a standard normal distribution based on previous audio timesteps. The use of a model such as WaveFlow/WaveNet could boost expressivenes of the system and lower the number of necessary params, but I was not yet able to figure out how it could be integrated in your framework. Did you consider similar options when you wrote the paper?

jaywalnut310 commented 4 years ago

You can use autoregressive flows such as WaveFlow or WaveNet as vocoders of Glow-TTS. WaveFlow and WaveNets are both conditioned by input mel-spectrograms. Therefore, you can predict a mel-spectrogram through Glow-TTS, and then feed it into those models. Actually there was a discussion about how to integrate Tacotron 2 and WaveNet (https://github.com/NVIDIA/tacotron2/issues/52), which could be applied to Glow-TTS. The point was the preprocessing for mel-spectrograms of both models should be same.

My concern is that fast sampling speed of Glow-TTS would become useless, because autoregressive vocoders are significantly slow at inference compared to our model. Despite of the slow speed, it is reasonable if high quality audio is what you really want.

janvainer commented 4 years ago

@jaywalnut310 Thanks for response :). It is clear to me that it i possible to use different vocoders. But what I mean is a bit different. Could the Glow model that transforms the latent variables to spectrograms be replaced by eg. WaveFlow or a Gaussian WaveNet? The process would still be very similar, but the latent variables from text encoder would be transformed into spectrograms with an autoregressive flow instead. I was also wondering, did you by any chance experiment with transforming the latent space directly to waveforms instead to spectrograms? The WaveGlow model could be capable of this. It would make the model truly end to end (if we do not count the grapheme-to-phoneme transformation), because no vocoder would be needed.

jaywalnut310 commented 4 years ago

TLDR: That is up to you: 1) learned prior distribution without conditioning on vocoders as Glow-TTS or 2) standard normal prior distribution with conditioning on vocoders as Flowtron.

@LordOfLuck Oh sorry, now I understand what you meant.

Although I haven't tried text-to-wave (is that right?) experiments, I think it sounds reasonable! Conceptually, any audio representation such as raw waveforms and mel-spectrograms can be generated from latent variables from text encoder.

So my answers are: 1) Yes, I think it can be trained in end-to-end with powerful vocoders as you mentioned. 2) But you cannot use local conditioning layers in those vocoder; you can only train the prior distribution from text encoder with monotonic alignment search.

So, I hope you make progress on Glow-TTS though :), I think the Flowtron work (https://github.com/NVIDIA/flowtron) would be more familiar with you for that end-to-end learning case. It also uses standard normal distribution as a prior distribution.

janvainer commented 4 years ago

Thanks! Especially for the Flowtron reference. They are pretty much doing what I meant except that they do the local conditioning on text via attention, while with GlowTTS the text conditioning would be done as you said with the trainable priors. :) I like the trainiable prior idea, it is similar to how the original Glow model can be conditioned on predicted classes.

Will see if the direct text to wave approach would work. it would be nice because there would be no need for external alignments required eg in ParallelWaveNet, ClariNet and other text-to-wav models.

jaywalnut310 commented 4 years ago

Yes, I love your idea, and also love to see how future studies will tackle into "real" end-to-end training! Closing the issue at this point.