huggingface / parler-tts

Inference and training library for high-quality TTS models.
Apache License 2.0
2.7k stars 276 forks source link

Custom pronunciation for words - any thoughts / recommendations about how best to handle them? #47

Open nmstoker opened 2 weeks ago

nmstoker commented 2 weeks ago

Hello! This is a really interesting looking project.

Currently there doesn't seem any way that users can help the model correctly pronounce custom words - for instance JPEG is something that speakers just need to know is broken down as "Jay-Peg" rather than Jay-Pea-Ee-Gee.

I appreciate this project is at an early stage but for practical uses, especially with brands and product names often having quirky ways of saying words or inventing completely new words, it's essential to be able to handle their correct pronunciation on some sort of override basis. It's not just brands - plenty of people's names need custom handling and quite a few novel computer words are non-obvious too.

Examples that cause problems in the current models: Cillian, Joaquin, Deirdre, Versace, Tag Heuer, Givenchy, gigabytes, RAM, MPEG etc.

Are there any suggestions on how best to tackle this?

I saw there was #33 which uses a normaliser specifically for numbers. Is there something similar for custom words? I suppose perhaps one could drop in a list of custom words and some sort of mapping to the desired pronunciation, applying that as a stage similar to how it handles abbreviations.

In espeak backed tools, it's sometimes possible to replace words with custom IPA that replaces the default IPA generated but I believe this model doesn't use IPA for controlling pronunciation.

Given the frequently varying pronunciations, I doubt that simply finetuning to include the words would be a viable approach.

Anyway, would be great to hear what others have to recommend.

Incidentally certain mainstream terms also get completely garbled, it seems impossible to get Instagram, Linux or Wikipedia to be spoken properly, but that's more a training data issue and those are mainstream enough that you wouldn't need to cover them via custom overrides.

nmstoker commented 2 weeks ago

Also, maybe best as a separate issue, but heteronyms are worth consideration too for practical uses.

These can't be handled by a trivial lookup since they can vary even within the same sentence depending on context:

"They had a row about exactly whose turn it was to row the boat."