kaiidams / NeMoOnnxSharp

Text-to-speech and speech recognition, VAD with NVIDIA NeMo and ONNX Runtime for .NET Core.
Apache License 2.0
15 stars 2 forks source link

Possible to improve English and German pronunciation? #26

Open GeorgeS2019 opened 9 months ago

GeorgeS2019 commented 9 months ago

NVIDIA NeMo (ByT5 G2P and G2P-Conformer):

NVIDIA NeMo provides grapheme-to-phoneme models for various languages, including German.

The ByT5 G2P model is based on a neural network and can handle out-of-vocabulary words (OOV) and heteronyms (words with the same spelling but different pronunciations).

The G2P-Conformer model is a non-autoregressive CTC model that is faster during inference.

These models allow you to enforce desired pronunciations by providing a phonetic transcript of the input. You can train and evaluate these models using manifest files containing grapheme and phoneme pairs

GeorgeS2019 commented 4 months ago

image

Is it possible to do this using NeMoOnnxSharp for German?

kaiidams commented 4 months ago

It supports both German TTS/ASR. See this https://github.com/kaiidams/NeMoOnnxSharp/blob/ad2ffe375e525bb63c59c9b1cd5154afe70351a0/NeMoOnnxSharp.Example/Program.cs#L39

GeorgeS2019 commented 4 months ago

I have use the code for German

Here is the feedback

GeorgeS2019 commented 4 months ago

Second,

I have seen Mel and MFCC code. I wonder if these codes can be repurposed for German audio and eventually to extract German phonemes from German Audio

In the entire internet, hardly anything like this. Even Wav2ToVec2 is not often shown how to work with the German langauge.

Can you do something about this?

GeorgeS2019 commented 4 months ago

It supports both German TTS/ASR. See this

I have tried TTS/ASR for German: My interest is extraction of German Phonemes from German Audio

kaiidams commented 4 months ago

In case of German, their pronunciation is not ambiguous. Why do you need a phonemizer? In case of English, NeMo FastPitch was trained with a phonemizer which translates all but ambiguous words, and FastPitch can handle ambiguous words in many cases.

GeorgeS2019 commented 4 months ago

https://github.com/kaiidams/NeMoOnnxSharp/blob/main/NeMoOnnxSharp/TTSTokenizers/EnglishG2p.cs

Is there GermanG2P.cs in NeMoOnnxSharp?

their pronunciation is not ambiguous.

explain please. Not sure I understand how this impacts how to proceed.

GeorgeS2019 commented 4 months ago

FastPitch is a text-to-speech (TTS) model developed by NVIDIA. It's a fully-parallel transformer architecture with prosody control over pitch and individual phoneme duration¹. Here are some key features:

FastPitch is used for generating mel spectrograms from text, which can then be converted to audio using a vocoder¹. It's trained on the LJSpeech dataset sampled at 22050Hz and has been tested on generating female English voices with an American accent¹. Please note that this model works well with vocoders that were trained on 22050Hz data¹.

Source: Conversation with Bing, 3/30/2024 (1) TTS En FastPitch | NVIDIA NGC. https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/tts_en_fastpitch. (2) GitHub - NVIDIA/NeMo: NeMo: a framework for generative AI. https://github.com/NVIDIA/NeMo. (3) Google Colab. https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tts/FastPitch_MixerTTS_Training.ipynb. (4) undefined. https://arxiv.org/abs/2006.06873.

kaiidams commented 4 months ago

Is there GermanG2P.cs in NeMoOnnxSharp?

FastPitch of NeMo uses a phonemizer for English but doesn't use for German. NeMoOnnxSharp doesn't contain German phonemizer.