mozilla / TTS

:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)
Mozilla Public License 2.0
9.23k stars 1.24k forks source link

Global Style Tokens #167

Closed twerkmeister closed 4 years ago

twerkmeister commented 5 years ago

Global Style Tokens are embeddings that capture prosodic styles across the training set. This allows the system to explicitly specify the desired prosody of a generated sequence, i.e. essentially how the sentence is spoken, e.g. with a certain emotion or whispering etc. Additionally it should help training because the text of an example has no hints to prosody. Thus, the TTS system currently has to guess prosody or factor it into the character/phoneme embeddings.

The main papers for this line of work are

  1. Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron for the reference encoder capturing the sequence prosody
  2. Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis for the style tokens and matching them with the prosody embedding

To implement this in Mozilla TTS I think the following steps are necessary:

  1. implement the reference encoder as an additional layer that takes in the spectrogram representation and returns the prosody embedding
  2. implement the style embeddings and their configuration, plus the attention mechanism matching prosody embedding and style embeddings
  3. replicate and concatenate the aggregated style embedding with the encoder outputs, then pass the result into the decoder
  4. during inference we would need some form of default style. Not sure how to go about this yet. The most robust might be to ship a calmly spoken text sequence that can be used as a reference. This approach should work even if style token embeddings and their ordering vary across trainings.

Thoughts?

OswaldoBornemann commented 5 years ago

It seems that this one has implement PyTorch version of GST. But i haven't tried it yet.

twerkmeister commented 5 years ago

@tsungruihon thanks for the pointer! Will have a look at it :+1:

erogol commented 5 years ago

I think we can test the GST module independently before merging it to the whole model architecture. So my idea is to train the GST encoder with computed spectrograms and see if it learns any speech traits by its token vectors. This can be visually seen by projecting final GST vectors with something like tsne algorithm.

Before, going into deeper, I need to read the paper once again.

twerkmeister commented 5 years ago

I am getting first results with GST and Tacotron. Gonna play around with it a little more and then post some audio.

Another thought I had was that it's kind of crazy to use the linear spectrograms for the loss in tacotron 1. First, half the linear spectrogram is just fairly random noise in the high frequencies. Second, linear spectrograms are a real memory drain: batch_size(32) X spec length (can easily be 500) X 1024 -> 15 Mio floats just for the linear specs... Just started another experiment with taco 1, style tokens, and using mel specs with a downsized taco 2 postnet. Let's see what comes out of this.

erogol commented 5 years ago

Cool thx for the update. It is interesting you run these experiments.

twerkmeister commented 5 years ago

So far I didn't get really robust results with GST and common voice, training and eval works reasonably well, but inference is very unstable. I think it's really the wrong dataset to use... I switched gears a bit and am now using GST and speaker embeddings with the entire German mailabs dataset with 5 different speakers. Just started the first training for that. I also have implemented using multiple datasets now. Dataset type, language and speaker id can be specified per dataset. It would be easier to contribute these things if smaller refactorings also get pulled in :D -> #192 .Then I don't have to work on some old state that I have already changed on my fork.

erogol commented 5 years ago

I try to train klarsson from mailabs but so far no success. Its clips are so noisy and poor quality.Let me know if you get any better.

erogol commented 5 years ago

Here are some of my initial results :

I trained GST Tacotron with a single dataset, and below share spectrograms of scaling tokens in inference time.

Normal Spec: grafik

Token 0: length of pauses between phonemes grafik

Token 5: Changes the tone of the speech. grafik

Other tokens also corressponce to deepness, speed and some other not quite obvious traits.

twerkmeister commented 5 years ago

Good stuff, which type of attention did you use for gst? I went back to the summation of tokens instead of multi-head attention since it seemed easier to control during inference. But things are really a bit fickle, sometimes the alignment just breaks off for certain token combinations or certain speakers. Have seen some similar effects to yours where the tokens influence length of pauses or commas or even which commas are attended to and which don't (e.g. first or last). And then I had it once that certain style made one speaker sound like another speaker. I tried so many things recently, I am sure there's something I can contribute back

erogol commented 5 years ago

I used multi-head attention as in the paper. I think to target speaker IDs, we can add an embedding layer to style encoder.

erogol commented 4 years ago

So it is implemented. I close this and we can open a new issue if there are more experiments.

donand commented 4 years ago

Here are some of my initial results :

I trained GST Tacotron with a single dataset, and below share spectrograms of scaling tokens in inference time.

Normal Spec: grafik

Token 0: length of pauses between phonemes grafik

Token 5: Changes the tone of the speech. grafik

Other tokens also corressponce to deepness, speed and some other not quite obvious traits.

Hi @erogol, I'm trying to understand how you implemented the inference with single tokens. I cannot find it in the code and I don't know how to integrate it with multi-head attention. Did you just took the token (which has size embed_dim / num_heads) and replicated it num_head times to get the size to embed_dim?

Thanks

erogol commented 4 years ago

I took the model into a notebook run it manually. There I rewrote the inference to use only the token I choose. Unfortunately, it is not here in the library but easy to replicate.

943274923 commented 4 years ago

I took the model into a notebook run it manually. There I rewrote the inference to use only the token I choose. Unfortunately, it is not here in the library but easy to replicate.

Can you share you notebook?

erogol commented 4 years ago

I don't keep it anymore unfortunately.