rhasspy / piper

A fast, local neural text to speech system
https://rhasspy.github.io/piper-samples/
MIT License
4.37k stars 297 forks source link

How to store "more" data in the model? #442

Open bzp83 opened 1 month ago

bzp83 commented 1 month ago

I'm not sure if my question makes sense... but I'll try to explain.

I trained a voice with ~50 hours of audio divided into 57500 wav files in 44100Hz 16 bit mono in medium and high quality. I didn't notice any difference in quality between the 2 versions, except what words were mispronounced. Both versions had the mispronunciation issue.

Then I ran a tool over all these wav files to convert them to 16000Hz and I trained it again in medium quality. This time, except that the audio quality was degraded because of lowered sample rate, there were no mispronounced words.

I noticed that regardless of the size of the dataset you use for training, the medium quality onnx is always ~60MB and high quality is always ~108MB.

So I'm assuming that because the 16000Hz files are smaller than 44100Hz files, the final model had "more" data in the ~60MB medium quality file and thus it was able to figure out how to pronounce all the words correctly.

Is this assumption correct?

So what parameters should I change in the training script to make it store more data in the final model? For example, I don't mind having a ~200MB file for a "medium quality" model.

Thanks!

rmcpantoja commented 1 month ago

I'm not sure if my question makes sense... but I'll try to explain.

I trained a voice with ~50 hours of audio divided into 57500 wav files in 44100Hz 16 bit mono in medium and high quality. I didn't notice any difference in quality between the 2 versions, except what words were mispronounced. Both versions had the mispronunciation issue.

Then I ran a tool over all these wav files to convert them to 16000Hz and I trained it again in medium quality. This time, except that the audio quality was degraded because of lowered sample rate, there were no mispronounced words.

I noticed that regardless of the size of the dataset you use for training, the medium quality onnx is always ~60MB and high quality is always ~108MB.

So I'm assuming that because the 16000Hz files are smaller than 44100Hz files, the final model had "more" data in the ~60MB medium quality file and thus it was able to figure out how to pronounce all the words correctly.

Is this assumption correct?

So what parameters should I change in the training script to make it store more data in the final model? For example, I don't mind having a ~200MB file for a "medium quality" model.

Thanks!

Hi @bzp83,

Mispronunciation errors are related to transcription of each audio, not related to the model itself. It's necessary to review if the audio-text pairs are incorrect or needs to be fixed, for example, when the speaker makes pauses, and the pause isn't reflected in the text or an incorrect word between text and audio.

Doing this, the 44.1k model could work better.

bzp83 commented 1 month ago

Hi @rmcpantoja, thanks for your answer.

The dataset I have is from a project of an interactive audio book to teach our language, the relevant part is that it was already stored with a transcription of each audio file, I just wanted to write a script to create the csv in the right format.

The scrip was created with all kinds of ponctuation (such as commas, semicolons, periods, dashes, parentheses, quotes, etc.) consists of numbers, etc. some examples: They said it was "unlikely" to happen. They didn't see them... so they left. There are 4 people in the room; not many.

These are just a few examples, the dataset literally cover pretty much everything related to our language.

Do you think this is a problem? I mean, for example, the audio will have the "same" pauses when there are a semicolon, comma, words inside parentheses...

I'm happy to write a script to remove any problematic item from the dataset, but I can't find any good documentation of what a good dataset should look like.

I found posts where people say having numbers will cause issues, others saying there should have numbers... So I don't know what to do.

Any help is really appreciated!

Thanks!