This is a repository for our paper, π€ Nix-TTS (Accepted to IEEE SLT 2022). We released the pretrained models, an interactive demo, and audio samples below.
[[π Paper Link](Coming Soon!)] [π€ Interactive Demo] [π’ Audio Samples]
Abstract Several solutions for lightweight TTS have shown promising results. Still, they either rely on a hand-crafted design that reaches non-optimum size or use a neural architecture search but often suffer training costs. We present Nix-TTS, a lightweight TTS achieved via knowledge distillation to a high-quality yet large-sized, non-autoregressive, and end-to-end (vocoder-free) TTS teacher model. Specifically, we offer module-wise distillation, enabling flexible and independent distillation to the encoder and decoder module. The resulting Nix-TTS inherited the advantageous properties of being non-autoregressive and end-to-end from the teacher, yet significantly smaller in size, with only 5.23M parameters or up to 89.34\% reduction of the teacher model; it also achieves over 3.04$\times$ and 8.36$\times$ inference speedup on Intel-i7 CPU and Raspberry Pi 3B respectively and still retains a fair voice naturalness and intelligibility compared to the teacher model.
Clone the nix-tts
repository and move to its directory
git clone https://github.com/rendchevi/nix-tts.git
cd nix-tts
Install the dependencies
python >= 3.8
pip install -r requirements.txt
sudo apt-get install espeak
Or follow the official instruction in case it didn't work.
Download your chosen pre-trained model here.
Model | Num. of Params | Faster than real-time* (CPU Intel-i7) | Faster than real-time* (RasPi Model 3B) |
---|---|---|---|
Nix-TTS (ONNX) | 5.23 M | 11.9x | 0.50x |
Nix-TTS w/ Stochastic Duration (ONNX) | 6.03 M | 10.8x | 0.50x |
*** Here we compute how much the model run faster than real-time as the inverse of Real Time Factor (RTF). The complete table of all models speedup is detailed on the paper.
And running Nix-TTS is as easy as:
from nix.models.TTS import NixTTSInference
from IPython.display import Audio
# Initiate Nix-TTS
nix = NixTTSInference(model_dir = "<path_to_the_downloaded_model>")
# Tokenize input text
c, c_length, phoneme = nix.tokenize("Born to multiply, born to gaze into night skies.")
# Convert text to raw speech
xw = nix.vocalize(c, c_length)
# Listen to the generated speech
Audio(xw[0,0], rate = 22050)