jaywalnut310 / glow-tts

A Generative Flow for Text-to-Speech via Monotonic Alignment Search
MIT License
660 stars 151 forks source link

Compare with other parallel TTS #9

Closed chazo1994 closed 4 years ago

chazo1994 commented 4 years ago

Did you compare your proposal method with others parallel TTS like Fastspeech, how about your latency compared with that models.

jaywalnut310 commented 4 years ago

For average length sentences, the inference time of Glow-TTS on a V100 GPU is about 40ms, whereas the inference time of FastSpeech was reported as ~25ms. You can find out in the papers (FastSpeech: Table 2, Glow-TTS: Section 5.2. Sampling Speed).

Glow-TTS is slower than FastSpeech, but I don't think the difference is significant in end-to-end setting, because vocoders are usually way slower than parallel TTS models.

chazo1994 commented 4 years ago

@jaywalnut310 Do you try to inference on CPU, and compare with GPU?

jaywalnut310 commented 4 years ago

I tried to measure inference on CPU, but I don't remember accurately. So, I cannot tell in certain, but I thought it was about 500ms to generate average length sentences on CPU.

You can try it with the pretrained model and inference notebook. Closing the issue.

lkurlandski commented 3 years ago

How would one configure the inference notebook to be CPU compatible?

I modified the script by adding the following after the import statements: device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

Then I replaced every call to cuda() with to(device). The script is able to nearly finish, but fails on the second-to-last line of code.

When it tries to execute: audio = waveglow.infer(y_gen_tst, sigma=.666) it raises RuntimeError: Found no NVIDIA driver on your system.