Closed carlfm01 closed 6 years ago
Hi, as it is, unfortunately not really. The original paper states a few tricks to make it happen but to my knowledge no one in the open source community implemented and published those yet.
In the Tacotron-2 repo someone also linked me those slides: http://on-demand.gputechconf.com/gtc/2017/presentation/s7544-andrew-gibiansky-efficient-inference-for-wavenet.pdf Which are also really interesting but just as well quite some work.
I suspect one of the easier bets would be to use a vocoder like World to more easily achieve Realtime performance on CPU.
Thanks for the quick response, I'll look into vocoder and. Maybe I can do something with generated audio of the TTS(Loquendo) as data to train the voice style transfer, will see.
The two samples linked in the README took about 2 minutes on a GTX1080Ti
Hi, @m-toman generating the dataset using an existing TTS worked, I don't like the robotic noise but, in general the results are interesting , the time to generate the audio is not to bad. I'm still touching params and training more time. Soon I'll get into the pdf. Thanks.
@carlfm01 Thanks for sharing. This sounds like you've been using Griffin-Lim?
Yes, any tips? I'll try using a more natural TTS, and other voice to match the phoneme lenght of the TTS, thanks.
I'm new to taco and I'm really interested because I want to make a voice 'mask' for a specific TTS (Loquendo in Spanish), and I need to know if the inference speed is rasonable
I don't need the conversion from Text, so I'm testing with this https://github.com/andabi/deep-voice-conversion.
Thanks.