rhasspy / piper

A fast, local neural text to speech system
https://rhasspy.github.io/piper-samples/
MIT License
5.7k stars 406 forks source link

English language issues in spoken output. #363

Open haydonryan opened 7 months ago

haydonryan commented 7 months ago

Thankyou for this software - it's great and I use it a ton.

For english language there seems to be some changes that need to be made to better support natural speaking.

EG: World War II reads as "World War roman 2". $500 reads as "Dollar five hundred"

Jr. reads as "Jay arr" instead of "junior" Mr reads as "m r", not Mister Mrs reads as "mrs" not missus

These numbers have special context that would indicate they are not prounced like a regular number 90210 (US zip code) reads as "ninety thoudsand two hundred and ten"... instead of "9 oh 2 1 oh" 1920 (date) reads as "one thousand nine hundred and twenty, not "19 20" 1908 (date) should read as "19 oh 8"

Some words are spelt the same but pronounced different based on sentence context EG: live - eg "1920s was a great year to live" pronounced closer to "liv" vs the website went "live" last week which is pronounced more like "laive"
(spechify/openai correctly handles these usecases "1920s were a great time to live. We went live with the website last week." - tested here: https://speechify.com/text-to-speech-online/)

At the moment I'm working around some of these (the ones that don't need to understand the sentence construct)these by doing a sed replacement based on regular expressions. I'm happy to provide more instances of misspoken as I find them.

colbec commented 7 months ago

Good comment, I am running into the same thing. In particular for me is the rendering of decimal fractions, which the current learning reports as a sentence termination followed by a number. The model has not yet figured out that to end a sentence the period symbol must be followed by a space or a newline or EOF, otherwise it means something else.

The issue I think lies at the level of training examples balance, which will be a bit beyond the reach of the Piper guys. And perhaps a new machine intelligible character for a decimal separator.

Many years ago Sebastian Thrun expressed the idea that machine processing of language would be a matter of machine learning and rules; perhaps we are seeing that here in action. It does mean that someone will need to check the output of machine processing to ensure that it meets current standards. Stumbles like this interfere with communication.

vortex1024 commented 7 months ago

piper uses espeak-ng for phonemization, this is where the issue lies. espeak-ng has both rule and list based dictionaries, try fixing it in there