Closed twerkmeister closed 4 years ago
It seems that this one has implement PyTorch
version of GST
. But i haven't tried it yet.
@tsungruihon thanks for the pointer! Will have a look at it :+1:
I think we can test the GST module independently before merging it to the whole model architecture. So my idea is to train the GST encoder with computed spectrograms and see if it learns any speech traits by its token vectors. This can be visually seen by projecting final GST vectors with something like tsne algorithm.
Before, going into deeper, I need to read the paper once again.
I am getting first results with GST and Tacotron. Gonna play around with it a little more and then post some audio.
Another thought I had was that it's kind of crazy to use the linear spectrograms for the loss in tacotron 1. First, half the linear spectrogram is just fairly random noise in the high frequencies. Second, linear spectrograms are a real memory drain: batch_size(32) X spec length (can easily be 500) X 1024 -> 15 Mio floats just for the linear specs... Just started another experiment with taco 1, style tokens, and using mel specs with a downsized taco 2 postnet. Let's see what comes out of this.
Cool thx for the update. It is interesting you run these experiments.
So far I didn't get really robust results with GST and common voice, training and eval works reasonably well, but inference is very unstable. I think it's really the wrong dataset to use... I switched gears a bit and am now using GST and speaker embeddings with the entire German mailabs dataset with 5 different speakers. Just started the first training for that. I also have implemented using multiple datasets now. Dataset type, language and speaker id can be specified per dataset. It would be easier to contribute these things if smaller refactorings also get pulled in :D -> #192 .Then I don't have to work on some old state that I have already changed on my fork.
I try to train klarsson from mailabs but so far no success. Its clips are so noisy and poor quality.Let me know if you get any better.
Here are some of my initial results :
I trained GST Tacotron with a single dataset, and below share spectrograms of scaling tokens in inference time.
Normal Spec:
Token 0: length of pauses between phonemes
Token 5: Changes the tone of the speech.
Other tokens also corressponce to deepness, speed and some other not quite obvious traits.
Good stuff, which type of attention did you use for gst? I went back to the summation of tokens instead of multi-head attention since it seemed easier to control during inference. But things are really a bit fickle, sometimes the alignment just breaks off for certain token combinations or certain speakers. Have seen some similar effects to yours where the tokens influence length of pauses or commas or even which commas are attended to and which don't (e.g. first or last). And then I had it once that certain style made one speaker sound like another speaker. I tried so many things recently, I am sure there's something I can contribute back
I used multi-head attention as in the paper. I think to target speaker IDs, we can add an embedding layer to style encoder.
So it is implemented. I close this and we can open a new issue if there are more experiments.
Here are some of my initial results :
I trained GST Tacotron with a single dataset, and below share spectrograms of scaling tokens in inference time.
Normal Spec:
Token 0: length of pauses between phonemes
Token 5: Changes the tone of the speech.
Other tokens also corressponce to deepness, speed and some other not quite obvious traits.
Hi @erogol, I'm trying to understand how you implemented the inference with single tokens. I cannot find it in the code and I don't know how to integrate it with multi-head attention. Did you just took the token (which has size embed_dim / num_heads) and replicated it num_head times to get the size to embed_dim?
Thanks
I took the model into a notebook run it manually. There I rewrote the inference to use only the token I choose. Unfortunately, it is not here in the library but easy to replicate.
I took the model into a notebook run it manually. There I rewrote the inference to use only the token I choose. Unfortunately, it is not here in the library but easy to replicate.
Can you share you notebook?
I don't keep it anymore unfortunately.
Global Style Tokens are embeddings that capture prosodic styles across the training set. This allows the system to explicitly specify the desired prosody of a generated sequence, i.e. essentially how the sentence is spoken, e.g. with a certain emotion or whispering etc. Additionally it should help training because the text of an example has no hints to prosody. Thus, the TTS system currently has to guess prosody or factor it into the character/phoneme embeddings.
The main papers for this line of work are
To implement this in Mozilla TTS I think the following steps are necessary:
Thoughts?