shivammehta25 / Neural-HMM

Neural HMMs are all you need (for high-quality attention-free TTS)
MIT License
157 stars 24 forks source link

The speed of Neural-HMM TTS #12

Closed JohnHerry closed 2 years ago

JohnHerry commented 2 years ago

Thanks for this good job. but there is no description about this tts model inference speed. Is there any data about its inference RTF? Is it faster then other non-autoregession TTS models like FastSpeech?

ghenter commented 2 years ago

there is no description about this tts model inference speed. Is there any data about its inference RTF?

We had several goals with this work, including quickly learning to speak when training the system. However, we have not explicitly optimised for the speed at synthesis time. In general, I expect it to be about as fast as Nvidia's implementation of Tacotron 2, upon which our code is based.

@shivammehta007, do you happen to have any numbers relevant to the speed of the synthesis?

Is it faster then other non-autoregession TTS models like FastSpeech?

Just to be clear, the neural HMM TTS in this repository is an autoregressive model; see our paper for more information.

I hope this helps!

shivammehta25 commented 2 years ago

Hello,

Thank you for appreciating our work! As @ghenter mentioned, it is an autoregressive model, I would also expect its speed to be similar to other autoregressive models i.e linear in synthesis.

Is it faster then other non-autoregession TTS models like FastSpeech?

In FastSpeech the mel-spectrogram is synthesised in a parallel manner therefore, Fastspeech would have an edge when it comes to inference speed on GPUs. While neural HMMs has a significant advantage when learning in terms of the number of updates and the amount of data it requires to produce high-quality intelligible speech. Also, it synthesises more consistent speech sound durations as FastSpeech's duration predictor is learned from a separate Autoregressive Transformer TTS (which in my personal opinion thought smart but is somewhat hacky) and assumes the durations to be gaussian, which they are not.

JohnHerry commented 2 years ago

Thanks for the help. It is attention-free so I had wish a faster speed then FastSpeech2, Yes, this work is AR model, that is the key.