The phoneme dictionary was extended. A VITS model trained on speaker data of "Hokuspokus Clean" was added.
A multispeaker model by NVIDIA was added (https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/tts_de_fastpitch_multispeaker_5)
Text To Speech Inferencing Webservice based on Tacotron 2 and Multi-Band MelGAN, trained using the HUI-Audio-Corpus-German, evaluated in Neural Speech Synthesis in German. Try it out at http://narvi.sysint.iisys.de/projects/tts. Requirements:
PyTorch may need to be installed separately (see https://pytorch.org/get-started/locally/)
Preparation: Create virtual environment, install requirements Open a python interpreter session in the previously generated virtual environment and run:
Before the TTS models can be used, download them from https://opendata.iisys.de/systemintegration/Models/speakers.tar.gz and extract them to tts_inferencer/speakers
Before the STT models can be used, download it from https://opendata.iisys.de/systemintegration/Models/asr_models.zip and extract them to asr_inferencer/models
To start the server in debug settings, run "python3 app.py". Access it at http://127.0.0.1:5000.
Further Notes:
If symbolic links for tacotron2 models are broken, recreate them using "ln -s
Keep in mind, this service does not include number normalization yet, so do not input any digits (2 -> zwei).
The incorporated ASR model was taken from https://github.com/AASHISHAG/deepspeech-german, check out their work: https://www.researchgate.net/publication/336532830_German_End-to-end_Speech_Recognition_based_on_DeepSpeech.