m-toman / tacorn

2018/2019 TTS framework integrating state of the art open source methods
MIT License
47 stars 4 forks source link

Is generation on a CPU near real time? #3

Closed carlfm01 closed 5 years ago

carlfm01 commented 5 years ago

I'm new to taco and I'm really interested because I want to make a voice 'mask' for a specific TTS (Loquendo in Spanish), and I need to know if the inference speed is rasonable

I don't need the conversion from Text, so I'm testing with this https://github.com/andabi/deep-voice-conversion.

Thanks.

m-toman commented 5 years ago

Hi, as it is, unfortunately not really. The original paper states a few tricks to make it happen but to my knowledge no one in the open source community implemented and published those yet.

In the Tacotron-2 repo someone also linked me those slides: http://on-demand.gputechconf.com/gtc/2017/presentation/s7544-andrew-gibiansky-efficient-inference-for-wavenet.pdf Which are also really interesting but just as well quite some work.

I suspect one of the easier bets would be to use a vocoder like World to more easily achieve Realtime performance on CPU.

carlfm01 commented 5 years ago

Thanks for the quick response, I'll look into vocoder and. Maybe I can do something with generated audio of the TTS(Loquendo) as data to train the voice style transfer, will see.

m-toman commented 5 years ago

The two samples linked in the README took about 2 minutes on a GTX1080Ti

carlfm01 commented 5 years ago

Hi, @m-toman generating the dataset using an existing TTS worked, I don't like the robotic noise but, in general the results are interesting , the time to generate the audio is not to bad. I'm still touching params and training more time. Soon I'll get into the pdf. Thanks.

es.zip

m-toman commented 5 years ago

@carlfm01 Thanks for sharing. This sounds like you've been using Griffin-Lim?

carlfm01 commented 5 years ago

Yes, any tips? I'll try using a more natural TTS, and other voice to match the phoneme lenght of the TTS, thanks.