Closed prototux closed 5 years ago
Hi @prototux ,
The multispeaker vocoder is already trained and uploaded. I'm currently working on an English model for the Encoder, but I think it will only be ready in 3-4 weeks.
I think the 4GB (if that is correct) on the 940mx should be ok for real-time speech synthesis. However, I've only managed to train models on GPUs with 8GB of RAM. I've tested an NVIDIA Jetson with some image processing CNN a while back and it seemed way slower than an I3 at 1.8Ghz CPU. I honesty don't think it is a good candidate for real-time speech synthesis. However, I do think it would be a really nice test. You could try a Romanian model (which are already available on this REPO). I think you need to install pytorch using something like this. To simply test the RO models, follow these instructions:
Pull this REPO and then:
$ cd TTS-Cube
$ cp data/models/ro/* data/models/
$ bunzip2 data/models/rnn_encoder.network.bz2
$ echo "Acesta este un test." > test.txt
$ python3 cube/synthesis.py --input-file=test.txt --output-file=test.wav --speaker=anca
If you test TTS-Cube on your hardware, I'm curios regarding the performance.
Best, Tibi
Looking forward to test the english model next month then! ;)
Currently, i can do "Acesta este un test" with execution time=3.8562846183776855
(adding the dot does OOM the 940mx, interesting).
I'm actually looking to build some kind of self-hosted AI assistant based on ReSpeaker core v2.0 (which is a SoC board + mic array, basically), and i think that the realtime speech synthesis will be the pain point, as it's quite ressources-heavy, i could try to do it using the SoC CPU if you want (but i don't expect any good performance out of these kind of SoC). That's why i had the idea to use a more capable machine dedicated to the TTS, as cloud instances with GPU seems all over $500 a month to run (and i don't really have a location to put a 1080/titan based PC running 24/7 in my home).
It's going to be a good challenge to find a way to have a self hosted TTS (and the ASR/NLU/other parts of a AI assistant) that can run 24/7 without requiring huge GPUs :)
Thanks for the feedback. I'm going to bottleneck the requests during synthesis so you it won't generate OOM anymore.
Regarding what you are trying to build: Have you looked over Google Speech Cloud? It includes a free plan for a limited number of characters (TTS) or audio length (speech recognition): https://cloud.google.com/text-to-speech/ https://cloud.google.com/speech-to-text/
Maybe you can query these services externally and not do any processing on your local hardware.
Ah, didn't knew it was possible to limit the GPU usage (as i also got the OOM with some other machine learning projects i've played with), it would be awesome to have that!
Not really to be honest, as the goal is to have a self-hosted solution, and not rely on either google cloud or AWS (they also have polly and some other products for personal assistants), for privacy reasons (and well, let's face it, because it's much more fun to run everything yourself :smile:)
https://arxiv.org/abs/1904.03976 << Maybe this can be helpful? apparently it's an alternative way to generate waveforms from mel spectrograms (compared to wavenet)
(commenting here to avoid creating an useless issue)
Wow, this is fresh out of the oven. Thanks. I'm going to read it today. Sounds really interesting. BTW, still training the English encoder. It's starting to generate good samples. I will post some samples soon, but it still needs some training.
Yep, the preprint is from 3 days ago :smile: From the paper, they say it's 1800 times faster than wavenet ("GELP generated on average389k samples/second (approximately 24 times real time equiv-alent rate), on a Titan X Pascal GPU. In comparison, the ref-erence WaveNet with sequential inference generated 217 sam-ples/second on average, which is slower by a factor of 1800.") and they also claim it trains faster than wavenet (but didn't say how much faster), which is interesting as AFAIK the vocoding part is the slowest part currently (wavenet takes much more time than tacotron).
Thanks for the update, sounds promising!
Oh, so it's not that fresh. I looked at the date form the link you sent me and it said: Submitted on 8 Apr 2019.
Isn't the 8 april 3 days ago? :smile:
Sorry, I read 3 years ago in your post and I assumed there was another pre-print, which was older. I'm having a slow morning :)
No problem :wink:
Just added some examples for English here: https://github.com/tiberiu44/TTS-Cube/tree/master/examples/e2e/en.
The system is still training.
Hi @tiberiu44 thanks for uploading the English examples, what dataset of English speakers are you training it on?
Hi @roodrallec ,
Sorry for the late response, but I was away from the computer in the last couple of days. For English I'm only using LJSpeech corpus.
Do you suggest adding any other corpora?
That sounds good. There's also the Voxceleb (http://www.robots.ox.ac.uk/~vgg/data/voxceleb/) and Oxford-BBC LRW datasets (http://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrw1.html)
Wow! I had no idea about these corpora. They seem to be an amazing resource for both speech synthesis and recognition. Thanks for pointing them out.
Hi,
I've just added the encoder for English. It is currently only trained on LJ. I'm going to start working on another model which is multispeaker.
I'm going to close this issue for now.
Thanks for the model! :tada:
I've tried it, but unfortunately, i cannot run synthesis.py as the rnn_encoder.network was removed, i've tried with both the old english one and the romanian one, and get some errors with these (RuntimeError: Dimensions of lookup parameter /_0 lookup up from file ({100,42}) do not match parameters to be populated ({100,51})
). Am i missing something?
I'm running the synthesis with python3 ./cube/synthesis.py --input-file=test.txt --output-file=test.wav --speaker=lj
Did you copy everything from models/en into models/? Also, you need to bunzip rnn.network.bz2 after copying
The rnn.network.bz2 that's under ro/? (there's no rnn.network.bz2 in en/, so i did copy models/en/* into models/ and bunzip'd models/ro/rnn.network.bz2 in models/, and using that, i get the error about the lookup parameter)
Sorry about that. There is now :)
No problem :)
There's some progress (i've renamed the rnn.network to rnn_encoder.network ;) ), i get the same error as if i was using the ro encoder, but with different parameters: RuntimeError: Dimensions of parameter /_0 looked up from file ({200,120}) do not match parameters to be populated ({1024,100})
Do you know where this can come from?
It was my fault again. I added the wrong model. I've just reuploaded it.
Ok :) well, now it seems to work (but i got a CUDA out of memory error because of my shitty 940mx, it isn't a ttscube error though :p )
@prototux , If you are interested, I got this from another contributor (@roodrallec): https://colab.research.google.com/drive/1cws1INmucRJ702eV4sKNJHzMDvrkg_lh
I've seen it, seems promising :)
I just got my computer that would allow me to have a proper setup for ttscube, still need to install it properly, but it should help quite more than the laptop.
Hello Tiberiu,
I'd love to test TTS-Cube, but unfortunately now i don't have access to a good GPU (and i don't think i could train a TTS on a laptop with a 940mx), do you have a pretrained english model? (it seems you were working on one, but i don't know the current status about that).
Also, do you have an idea what could be the hardware requirements to run the synthesis? For example the nvidia jetson nano seems a nice platform to have a self-hosted TTS, but i'm not sure if it's powerful enough to run TTS-Cube.