mush42 / sonata-nvda

This add-on implements a speech synthesizer driver for NVDA using neural TTS models. It supports Piper
GNU General Public License v2.0
37 stars 8 forks source link

Piper TTS voices make a "puffing" sound instead of pronouncing the letter "L" as well as some punctuation marks. #27

Closed mikebayus closed 8 months ago

mikebayus commented 8 months ago

Hi,

I haven't reported this as I have used the Piper TTS Voices Add-on for NVDA as I thought that others might have already reported this, but as I found no other issue I am reporting this now.

When I use left and right arrows to scrole character by character, the letter "l makes a "puffing sound" rather than pronouncing the letter "l".

Try the word: "alleluia".

Some punctuation marks do this as well.

As I write this, I just had my Piper voice read the lines that had just the letter "l" in them and my voice parces the letter "l in the case of reading a sentence, it's just when using left and right arrows to scrole one letter at a time.

rmcpantoja commented 8 months ago

Hi, Unfortunately, unclear pronunciations are a defect of the VITS model. In this case, you could add slightly shorter audios to the dataset, and train it a little more. It can surely improve efficiency even for reading extremely short texts. It should be noted that efficiency is more notable in medium quality models.

mikebayus commented 8 months ago

I find pronunciation to be very clear and easily understood, and the reading styles of each of the different voices to be conversational.

My issue is the sound that the voices make when scroling letter by letter and saying the letter "l".

try reading this using the NVDA Screen Reader, and using the right and left arrows to spell the word "scroling.

Every letter is pronounced correctly except the letter "L".

As I say some punctuation marks make the same puffing sound instead of saying their names.

On Wed, Nov 1, 2023 at 8:25 PM Mateo Cedillo @.***> wrote:

Hi, Unfortunately, unclear pronunciations are a defect of the VITS model. In this case, you could add slightly shorter audios to the dataset, and train it a little more. It can surely improve efficiency even for reading extremely short texts. It should be noted that efficiency is more notable in medium quality models.

— Reply to this email directly, view it on GitHub https://github.com/mush42/piper-nvda/issues/27#issuecomment-1789872164, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKJ77NS5UVCKDHAPVIFA7DTYCLR6NAVCNFSM6AAAAAA6Z4MO5WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBZHA3TEMJWGQ . You are receiving this because you authored the thread.Message ID: @.***>

mush42 commented 8 months ago

Hi @mikebayus Unfortunately, we can do nothing about this. This is not an issue of the add-on per se, it is an issue of the underlying model and the dataset used to train it.

mush42 commented 8 months ago

@mikebayus Most of these voices are not designed with screen reader use in mind. The add-on itself can drive any piper-compatible model. I see no reason that our community comes together and create a dataset for screen readers and train a voice based on it.