Feature: Support abbreviations

rhasspy / piper

A fast, local neural text to speech system

https://rhasspy.github.io/piper-samples/

MIT License

6.76k stars 498 forks source link

Feature: Support abbreviations #54

Open disconn3ct opened 1 year ago

disconn3ct commented 1 year ago

(I'm using the released standalone binary, so this may not be an issue when used with HA.)

Sensors in HA often have some form of unit text available (ie kph or °C) and when using the cloud TTS services, it usually expands them to their spoken version (kilometers per hour). Piper just says the letters themselves (k p h).

Intent guessing would be great, but realistically it would be nice to be able to provide a list of string replacements. That could also help with pronunciation of people or place names, by "invisibly" replacing them with a better string.

It could be handled by updating the text earlier in the pipeline, but it seems pretty specific to the speech output (and even per-voice in some cases.)

trunglebka commented 1 year ago

Preprocessing text before feeding to TTS engine is a practical choice.

synesthesiam commented 1 year ago

This is an artifact of espeak-ng. Any chance someone knows how to patch words in espeak-ng at runtime?

disconn3ct commented 1 year ago

Preprocessing text before feeding to TTS engine is a practical choice.

As mentioned, but it isn't actually sufficient beyond global replacements like abbreviations. Anything else requires knowing the configuration of the tts engine. The text input that causes Ryan to pronounce "Shaena" correctly is not the same as it is for the Southern-UK voice, or the google TTS. Preprocessing requires cloning the config into the text pipeline, dependent on hidden configuration in the engine.

Voice specific tuning should happen where the voice is generated.

(Architecturally, I dislike the addon locking down config so badly. The API/cli trivially changes voice config on the fly, and if I could shove that config into the tts call it would make this whole thing simple..)

ETA: https://github.com/espeak-ng/espeak-ng/issues/115 upstream related issue

synesthesiam commented 1 year ago

(Architecturally, I dislike the addon locking down config so badly. The API/cli trivially changes voice config on the fly, and if I could shove that config into the tts call it would make this whole thing simple..)

This was purely for speed and memory usage. Loading the voice often takes more time than synthesis on a Raspberry Pi with an SD card, so the voice needs to be pre-loaded. But these devices also have 2GB of RAM most of the time, so having more than one voice loaded into memory can cause problems elsewhere :/

My plan for the add-on is to allow specifying how many voices can be loaded at once, and then keep the most used voices pre-loaded.