New: Interactive demo using Google Colaboratory can be found here
TTS-Cube is an end-2-end speech synthesis system that provides a full processing pipeline to train and deploy TTS models.
It is entirely based on neural networks, requires no pre-aligned data and can be trained to produce audio just by using character or phoneme sequences.
Markdown does not allow embedding of audio files. For a better experience check-out the project's website.
For installation please follow these instructions. Training and usage examples can be found here. A notebook demo can be found here.
Encoder outputs:
"Arată că interesul utilizatorilor de internet față de acțiuni ecologiste de genul Earth Hour este unul extrem de ridicat."
"Pentru a contracara proiectul, Rusia a demarat un proiect concurent, South Stream, în care a încercat să atragă inclusiv o parte dintre partenerii Nabucco."
Vocoder output (conditioned on gold-standard data)
Note: The mel-spectrum is computed with a frame-shift of 12.5ms. This means that Griffin-Lim reconstruction produces sloppy results at most (regardless on the number of iterations)
The encoder model is still converging, so right now the examples are still of low quality. We will update the files as soon as we have a stable Encoder model.
TTS-Cube is based on concepts described in Tacotron (1 and 2), Char2Wav and WaveRNN, but it's architecture does not stick to the exact recipes:
The ParallelWavenet/ClariNet code is adapted from this ClariNet repo.