rhasspy / piper

A fast, local neural text to speech system
https://rhasspy.github.io/piper-samples/
MIT License
5.57k stars 397 forks source link

Emotions / expressions? #150

Open gab-luz opened 1 year ago

gab-luz commented 1 year ago

Hi, guys.

I'm collecting speech samples in order to create a dataset to train a new pt_BR model. My question is: does piper tts support emotions / expressions in generated speech?

synesthesiam commented 1 year ago

There are two ways of doing this that I know about, but only one that I've tried.

The first way is training a "multi-speaker" model where each "speaker" is an emotion. I did this with a cool dataset provided by @thorstenMueller for Mimic 3 and I'm training a new voice like it for Piper now. The downside of this approach, of course, is that each sentence can only have one emotion.

The second way is to create new "phonemes" that represent the emotion. In Piper, this could be any UTF-8 codepoint that you add to the voice's phoneme_id_map. You'd need a dataset with emotion markings, and somehow translate those into phonemes (maybe a begin/end for each emotion?). I haven't tried this yet, since I don't know of any dataset that has emotions tagged in such a way.

mbonea-ewallet commented 11 months ago

But Can't you use SSML for emotion, and make it that each word can be an emotion. Like your Opentts Project?

<speak>
  <voice name="glow-speak:en-us_mary_ann_angry">
    <s>
      Kwaheri
    </s>
  </voice>

  <voice name="glow-speak:en-us_mary_ann_happy">
    <s>
      Kwaheri
    </s>
  </voice>
</speak>

  <voice name="glow-speak:en-us_mary_ann_happy">
    <s>
      Good bye
    </s>
  </voice>
</speak>
synesthesiam commented 11 months ago

I don't have a dataset where the audio from Mary Ann is split out by emotion.

Aws-killer commented 11 months ago

What I'm saying is with the right data you can do it

gab-luz commented 10 months ago

There are two ways of doing this that I know about, but only one that I've tried.

The first way is training a "multi-speaker" model where each "speaker" is an emotion. I did this with a cool dataset provided by @thorstenMueller for Mimic 3 and I'm training a new voice like it for Piper now. The downside of this approach, of course, is that each sentence can only have one emotion.

The second way is to create new "phonemes" that represent the emotion. In Piper, this could be any UTF-8 codepoint that you add to the voice's phoneme_id_map. You'd need a dataset with emotion markings, and somehow translate those into phonemes (maybe a begin/end for each emotion?). I haven't tried this yet, since I don't know of any dataset that has emotions tagged in such a way.

Oh, ok. So basically, I have to record it like multi speaker process... But how will piper "know" which speaker to play?

Haurrus commented 2 months ago

Hello @synesthesiam !

I have a huge dataset of 243,700 voices with emotion markers like this for 8 different emotions :

"😲"
"😠"
"😕"
"😐"
"😊"
"😒"
"😨"
"😢"

I'm trying to use emojis as markers for my dataset, but when training/converting, it translates the emojis into phonemes and says out loud "angry face It's locked for a reason. angry face":

😠 It's locked for a reason. 😠

[2024-06-02 00:57:24.706] [piper] [debug] Phonemizing text: 😠 It's locked for a reason. 😠
[2024-06-02 00:57:24.710] [piper] [debug] Converting 37 phoneme(s) to ids: ˈæŋɡɹi fˈeɪs ɪts lˈɑːkt fɚɹɚ ɹˈiːzən.
[2024-06-02 00:57:24.711] [piper] [debug] Converted 37 phoneme(s) to 77 phoneme id(s): 1, 0, 120, 0, 39, 0, 44, 0, 66, 0, 88, 0, 21, 0, 3, 0, 19, 0, 120, 0, 18, 0, 74, 0, 31, 0, 3, 0, 74, 0, 32, 0, 31, 0, 3, 0, 24, 0, 120, 0, 51, 0, 122, 0, 23, 0, 32, 0, 3, 0, 19, 0, 60, 0, 88, 0, 60, 0, 3, 0, 88, 0, 120, 0, 21, 0, 122, 0, 38, 0, 59, 0, 26, 0, 10, 0, 2,
[2024-06-02 00:57:24.711] [piper] [debug] Synthesizing audio for 77 phoneme id(s)
[2024-06-02 00:57:24.876] [piper] [debug] Synthesized 2.449705215419501 second(s) of audio in 0.165027265 second(s)
[2024-06-02 00:57:24.877] [piper] [debug] Converting 12 phoneme(s) to ids: ˈæŋɡɹi fˈeɪs
[2024-06-02 00:57:24.878] [piper] [debug] Converted 12 phoneme(s) to 27 phoneme id(s): 1, 0, 120, 0, 39, 0, 44, 0, 66, 0, 88, 0, 21, 0, 3, 0, 19, 0, 120, 0, 18, 0, 74, 0, 31, 0, 2,
[2024-06-02 00:57:24.878] [piper] [debug] Synthesizing audio for 27 phoneme id(s)

Should I modify how the phonemizing process occurs inside of piper-phonemize itself ? Do I need to make this type of UTF-8 character generate blank sound ? How can I achieve that ? Or maybe it's in the training process by adding these phonemes in config.json? But still, it will not use them as they are because it will first phonemize them, and they will translate into phonemes like "æŋɡɹi fˈeɪs".