The phoneme dictionary was extended. A VITS model trained on speaker data of "Hokuspokus Clean" was added.
A multispeaker model by NVIDIA was added (
Text To Speech Inferencing Webservice based on Tacotron 2 and Multi-Band MelGAN, trained using the HUI-Audio-Corpus-German, evaluated in Neural Speech Synthesis in German. Try it out at Requirements:
PyTorch may need to be installed separately (see
Preparation: Create virtual environment, install requirements Open a python interpreter session in the previously generated virtual environment and run:
Before the TTS models can be used, download them from and extract them to tts_inferencer/speakers
Before the STT models can be used, download it from and extract them to asr_inferencer/models
To start the server in debug settings, run "python3". Access it at
Further Notes:
If symbolic links for tacotron2 models are broken, recreate them using "ln -s
Keep in mind, this service does not include number normalization yet, so do not input any digits (2 -> zwei).
The incorporated ASR model was taken from, check out their work: