English model and hardware requirements

prototux commented 5 years ago

Hello Tiberiu,

I'd love to test TTS-Cube, but unfortunately now i don't have access to a good GPU (and i don't think i could train a TTS on a laptop with a 940mx), do you have a pretrained english model? (it seems you were working on one, but i don't know the current status about that).

Also, do you have an idea what could be the hardware requirements to run the synthesis? For example the nvidia jetson nano seems a nice platform to have a self-hosted TTS, but i'm not sure if it's powerful enough to run TTS-Cube.

tiberiu44 commented 5 years ago

Hi @prototux ,

The multispeaker vocoder is already trained and uploaded. I'm currently working on an English model for the Encoder, but I think it will only be ready in 3-4 weeks.

I think the 4GB (if that is correct) on the 940mx should be ok for real-time speech synthesis. However, I've only managed to train models on GPUs with 8GB of RAM. I've tested an NVIDIA Jetson with some image processing CNN a while back and it seemed way slower than an I3 at 1.8Ghz CPU. I honesty don't think it is a good candidate for real-time speech synthesis. However, I do think it would be a really nice test. You could try a Romanian model (which are already available on this REPO). I think you need to install pytorch using something like this. To simply test the RO models, follow these instructions:

Pull this REPO and then:

$ cd TTS-Cube
$ cp data/models/ro/* data/models/
$ bunzip2 data/models/rnn_encoder.network.bz2
$ echo "Acesta este un test." > test.txt
$ python3 cube/synthesis.py --input-file=test.txt --output-file=test.wav --speaker=anca

If you test TTS-Cube on your hardware, I'm curios regarding the performance.

Best, Tibi

prototux commented 5 years ago

Looking forward to test the english model next month then! ;)

Currently, i can do "Acesta este un test" with execution time=3.8562846183776855 (adding the dot does OOM the 940mx, interesting). I'm actually looking to build some kind of self-hosted AI assistant based on ReSpeaker core v2.0 (which is a SoC board + mic array, basically), and i think that the realtime speech synthesis will be the pain point, as it's quite ressources-heavy, i could try to do it using the SoC CPU if you want (but i don't expect any good performance out of these kind of SoC). That's why i had the idea to use a more capable machine dedicated to the TTS, as cloud instances with GPU seems all over $500 a month to run (and i don't really have a location to put a 1080/titan based PC running 24/7 in my home).

It's going to be a good challenge to find a way to have a self hosted TTS (and the ASR/NLU/other parts of a AI assistant) that can run 24/7 without requiring huge GPUs :)

tiberiu44 commented 5 years ago

Thanks for the feedback. I'm going to bottleneck the requests during synthesis so you it won't generate OOM anymore.

Regarding what you are trying to build: Have you looked over Google Speech Cloud? It includes a free plan for a limited number of characters (TTS) or audio length (speech recognition): https://cloud.google.com/text-to-speech/ https://cloud.google.com/speech-to-text/

Maybe you can query these services externally and not do any processing on your local hardware.

prototux commented 5 years ago

Ah, didn't knew it was possible to limit the GPU usage (as i also got the OOM with some other machine learning projects i've played with), it would be awesome to have that!

Not really to be honest, as the goal is to have a self-hosted solution, and not rely on either google cloud or AWS (they also have polly and some other products for personal assistants), for privacy reasons (and well, let's face it, because it's much more fun to run everything yourself :smile:)

prototux commented 5 years ago

https://arxiv.org/abs/1904.03976 << Maybe this can be helpful? apparently it's an alternative way to generate waveforms from mel spectrograms (compared to wavenet)

(commenting here to avoid creating an useless issue)

tiberiu44 commented 5 years ago

Wow, this is fresh out of the oven. Thanks. I'm going to read it today. Sounds really interesting. BTW, still training the English encoder. It's starting to generate good samples. I will post some samples soon, but it still needs some training.

prototux commented 5 years ago

Yep, the preprint is from 3 days ago :smile: From the paper, they say it's 1800 times faster than wavenet ("GELP generated on average389k samples/second (approximately 24 times real time equiv-alent rate), on a Titan X Pascal GPU. In comparison, the ref-erence WaveNet with sequential inference generated 217 sam-ples/second on average, which is slower by a factor of 1800.") and they also claim it trains faster than wavenet (but didn't say how much faster), which is interesting as AFAIK the vocoding part is the slowest part currently (wavenet takes much more time than tacotron).

Thanks for the update, sounds promising!

tiberiu44 commented 5 years ago

Oh, so it's not that fresh. I looked at the date form the link you sent me and it said: Submitted on 8 Apr 2019.

prototux commented 5 years ago

Isn't the 8 april 3 days ago? :smile:

tiberiu44 commented 5 years ago

Sorry, I read 3 years ago in your post and I assumed there was another pre-print, which was older. I'm having a slow morning :)

prototux commented 5 years ago

No problem :wink:

tiberiu44 commented 5 years ago

Just added some examples for English here: https://github.com/tiberiu44/TTS-Cube/tree/master/examples/e2e/en.

The system is still training.

roodrallec commented 5 years ago

Hi @tiberiu44 thanks for uploading the English examples, what dataset of English speakers are you training it on?

tiberiu44 commented 5 years ago

Hi @roodrallec ,

Sorry for the late response, but I was away from the computer in the last couple of days. For English I'm only using LJSpeech corpus.

Do you suggest adding any other corpora?

roodrallec commented 5 years ago

That sounds good. There's also the Voxceleb (http://www.robots.ox.ac.uk/~vgg/data/voxceleb/) and Oxford-BBC LRW datasets (http://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrw1.html)

tiberiu44 commented 5 years ago

Wow! I had no idea about these corpora. They seem to be an amazing resource for both speech synthesis and recognition. Thanks for pointing them out.

tiberiu44 commented 5 years ago

Hi,

I've just added the encoder for English. It is currently only trained on LJ. I'm going to start working on another model which is multispeaker.

I'm going to close this issue for now.

prototux commented 5 years ago

Thanks for the model! :tada: I've tried it, but unfortunately, i cannot run synthesis.py as the rnn_encoder.network was removed, i've tried with both the old english one and the romanian one, and get some errors with these (RuntimeError: Dimensions of lookup parameter /_0 lookup up from file ({100,42}) do not match parameters to be populated ({100,51})). Am i missing something?

I'm running the synthesis with python3 ./cube/synthesis.py --input-file=test.txt --output-file=test.wav --speaker=lj

tiberiu44 commented 5 years ago

Did you copy everything from models/en into models/? Also, you need to bunzip rnn.network.bz2 after copying

prototux commented 5 years ago

The rnn.network.bz2 that's under ro/? (there's no rnn.network.bz2 in en/, so i did copy models/en/* into models/ and bunzip'd models/ro/rnn.network.bz2 in models/, and using that, i get the error about the lookup parameter)

tiberiu44 commented 5 years ago

Sorry about that. There is now :)

prototux commented 5 years ago

No problem :)

There's some progress (i've renamed the rnn.network to rnn_encoder.network ;) ), i get the same error as if i was using the ro encoder, but with different parameters: RuntimeError: Dimensions of parameter /_0 looked up from file ({200,120}) do not match parameters to be populated ({1024,100})

Do you know where this can come from?

tiberiu44 commented 5 years ago

It was my fault again. I added the wrong model. I've just reuploaded it.

prototux commented 5 years ago

Ok :) well, now it seems to work (but i got a CUDA out of memory error because of my shitty 940mx, it isn't a ttscube error though :p )

tiberiu44 commented 5 years ago

@prototux , If you are interested, I got this from another contributor (@roodrallec): https://colab.research.google.com/drive/1cws1INmucRJ702eV4sKNJHzMDvrkg_lh

prototux commented 5 years ago

I've seen it, seems promising :)

I just got my computer that would allow me to have a proper setup for ttscube, still need to install it properly, but it should help quite more than the laptop.

tiberiu44 / TTS-Cube

English model and hardware requirements #25