mozilla / TTS

:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)
Mozilla Public License 2.0
9.35k stars 1.25k forks source link

Multi Speaker Embeddings #166

Closed twerkmeister closed 4 years ago

twerkmeister commented 5 years ago

Hi @erogol, I've been a bit off the radar for the past month because of vacation and other projects, but now I am back and ready for action! I am looking into how to do multi speaker embeddings, and here's my current plan of action:

  1. Have all preprocessors output items that also have a speaker ID to be used down the line. Formats that do not have explicit speaker ids, i.e. all current preprocessors, would use a uniform ID. This speaker ID must then be passed down by the dataset through the collate function and into the forward pass of the model.

  2. Add speaker embeddings to the model. An additional embedding with configurable number of speakers and embedding dimensionality. The embedding vector is retrieved based on speaker id and then replicated and concatenated to each encoder output. The result is passed to the decoder as before. Here we could also easily ignore speaker embeddings if we only deal with a single speaker.

  3. It might make sense to let speaker embeddings put some constraints on the train/dev/test split, i.e. every speaker in the dev/test set should at least have some examples in the train set, otherwise their embeddings are never learned. I could implement a check for that and issue a warning if this isn't the case.

Any thoughts or additional hints on this?

erogol commented 5 years ago

@twerkmeister welcome back then!

  1. It's a good idea. I'd suggest to be able to merge multiple preprocessors and fetch unique user IDs per preprocessor. Sor in config.json we can type dataset: ["datasetA", "datasetB"] and then we can easily merge two different datasets.

  2. architecturally I need to second read the papers, how successful implementations solve this problem but it looks reasonable to me.

  3. Yes in case of multi-speaker embedding, it is important to cover all speaker IDs, if the validation enabled. Otherwise if eval: false in config.json there is nothing to worry.

I am also prone to implement https://arxiv.org/abs/1803.09017 since it also enables to do one shot learning for new speakers and create new variants, in case it also interests you.

twerkmeister commented 5 years ago
  1. Ah yes, then we need to be able to also specify multiple data roots and meta files and so forth. And what you are getting at is that each dataset should have it's own unique speaker id(s), right?

  2. The speaker embedding found here is pretty straight forward https://arxiv.org/pdf/1803.09047.pdf, that's the way I would go for now, but happy to consider alternatives, too!

  3. good point!

Yeah that is the paper I also referenced in the other issue on global style tokens :+1:

twerkmeister commented 5 years ago

Re 1. I am tempted to push the multi dataset support into a seperate issue. Thinking it through it would probably need a lot of adjustment of the configuration options. With multiple datasets we'd have to deal with different dataset formats, languages, audio parameters etc.

For multiple datasets I would suggest to define an array of dataset objects in the config to keep things related to a single dataset close together:

{
    "...",
    "datasets": [
        {
         "data_path": "/tmp/LJ",
         "meta_file_train": "metadata_train.csv",
         "meta_file_val": "metadata_val.csv",
         "dataset_type": "ljspeech",
         "language": "en-us",
         "text_cleaner": "phoneme_cleaners",
         "samplerate": 22050
        },
       { ... }
],
    ...
}

We could also make sample rate, and text cleaners uniform across datasets

erogol commented 5 years ago
    "...",
    "datasets": [
        {
         "data_path": "/tmp/LJ",
         "meta_file_train": "metadata_train.csv",
         "meta_file_val": "metadata_val.csv",
         "dataset": "ljspeech",
         "language": "en-us",
        },
       { ... }
],
    ...
}

I'd template as above. That dictates user to define separate preprocessor per dataset, which is reasonable and more reliable. I don't really see the need for using different sample rates since for the consistency of the model it is better to keep the sample rate the same. Also, I guess, architecturally not really possible to train the network with different representations like phonemes vs graphemes.

twerkmeister commented 5 years ago

Some thoughts after the first multi speaker embedding training on the german common voice:

  1. Quality can vary by speaker. For some attention doesn't work as well as for others. For some there is a lot of noise in the generated audio. The amount of training material per speaker follows a power law, i.e. few speakers with a lot of material, many with little (at least 100 sentences so far). I haven't done any work correlating amount of training material and quality.

  2. I am wondering if Global Style Tokens can replace speaker embeddings completely. They were meant to capture mainly prosody, like speaking speed or emotions etc. But I can't see why they couldn't also capture things more related to the voice of a person, such as the pitch. One difficulty in dissecting the style tokens could be that some will end up being correlated, e.g. excitement and pitch.

erogol commented 5 years ago

There are some papers from Amazon with multi-speaker experiments. These might be useful to understand the dataset part of the problem https://developer.amazon.com/blogs/alexa/post/7ab9665a-0536-4be2-aaad-18281ec59af8/varying-speaking-styles-with-neural-text-to-speech

We might consider weighting loss values per person depending on the portion in the dataset.

Global Style Token is good but harder to implement and probably if there is a problem with the embedding approach it'd be there for GST as well. However, these are my guesses. But the good thing about GST it enables one-shot learning as in the paper where you provide a new speaker with a sample spectrogram and model generates her speech.

Also as far as I see, Common Voice dataset is not really reliable for TTS since its recordings are not really clear. Maybe, it is better to try MATLAB dataset. It has a handful of German speakers with comparable portions. It is more promising to test.

Sorry, my comments are scattered :)

twerkmeister commented 5 years ago

Cool stuff! Thanks for the pointers. I also just read the GST paper more in depth and am pretty excited about it. I think there's plenty of opportunity to experiment

mrgloom commented 5 years ago

What datasets are suitable for Multi Speaker setting, i.e. how clean and large they should be? What about https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/ ?

Is it feasible to train separate tacotron2 model per each person in m-ailabs dataset?

erogol commented 5 years ago

putting only speaker embedding does not give good results. Even the embedding on vectors shows no meaningful separation. pca_multi_speaker

erogol commented 5 years ago

I can also verify that the trained model works worse for speakers with less than 15 mins recording in LibriTTS and the average is 23 min.

m-toman commented 5 years ago

I've tried generating embeddings using https://github.com/CorentinJ/Real-Time-Voice-Cloning (but with another taco implementation) Generally works well but I'm also getting bad results for some speakers,especially weird speaking rates.

twerkmeister commented 5 years ago

Doing some single speaker trainings these days, I can confirm that those learn attention much faster than multi speaker models (up to 15 speakers with 20h+ each). Seeing decent attention in training after 15k steps, whereas for most of my multi speaker models that only emerged after around 100k-120k or so. That being said, I am not seeing proper attention at inference with a single speaker in my current experiments after 40k steps

m-toman commented 5 years ago

I'm mostly running LibriTTS 100+360 at the moment, so more than 1000 speakers and around 150k sentences. I was hoping that with the encoder it results in a nice "coverage" of the speaker embedding space with the 150k sentences. And while learning overall speaker characteristics (let's say, vocal tract features) seems to work very well, prosodic aspects seem to be a problem..

Using the same embedding from a single, random sentence (as described in https://arxiv.org/abs/1806.04558 ) I get sentences that are really slow and others that are really fast.

I wonder if instead of using per-sentence embeddings it might be better to use per-speaker averaged embeddings for all sentences of a speaker, to reduce the variation introduced by single sentences, especially at inference time. So it's more similar to your current approach of using a single speaker ID for all sentences of a speaker (but with some encoded knowledge instead of "randomly" assigned IDs).

Also perhaps trying a few speakers with lots of sentences instead of lots of speakers with few sentences. Did you test something like that?

erogol commented 5 years ago

@m-toman if you could plot speaker embedding vectors with TSNE that would be great to compare with a vanilla speaker embedding we have. You could easily use Tensorboard projector.

@twerkmeister we can also try to enable GST and speaker embedding at the same time. It would be architecturally similar to the paper referred above

erogol commented 5 years ago

There are a couple of notes from the paper above that we can also try.

As far as I can see from the results, simple speaker embedding works better on LibriTTS so maybe it is not worth to try their method before giving a second chance to embedding.

Some more ideas to discover:

m-toman commented 5 years ago

This is an UMAP projection of VCTK (picked 20 or so random sentences per speaker) as in here https://github.com/CorentinJ/Real-Time-Voice-Cloning/blob/master/toolbox/ui.py#L98

image

The vertical split is male/female as also reported in https://arxiv.org/pdf/1806.04558.pdf Figure 3.

My main issue with this approach is really that you select your target voice from a single sentence embedding.

erogol commented 5 years ago

@m-toman did you apply any cleaning for VCTK to remove silences and so?

erogol commented 5 years ago

Little Update. I saw that connecting speaker embedding to multiple stages of Tacotron improves the results. And concatenation works better than summation. I've connected speaker embedding to encoder outputs, prenet inputs and decoder inputs. This kind of structure improves the alignments as well.

erogol commented 5 years ago

Now I also implemented speaker embedding proposed in https://arxiv.org/abs/1710.10467. Here is the embedding results.

image

Next, I try to combine it with TTS and see results. Soon the implementation will be released under TTS.

m-toman commented 5 years ago

I only ran raw VCTK through the encoder for the figure above. I used the pretrained encoder from https://github.com/CorentinJ/Real-Time-Voice-Cloning which is trained from LibriSpeech. And I trained taco from LibriTTS, which should have property trimmed silence.

Meanwhile I also tried two different models with two speakers each: LJ and a male speaker with also roughly 13k sentences (silences trimmed with our own solution). I tried training with embeddings per sentence and also calculating embeddings for all sentences and then averaging them - so one embedding vector per speaker.

For both models LJ got the usual pretty good alignment, the male speaker struggled (training speaker-dependently just from his data definitely worked better).

The advantage and disadvantage of the per sentence approach is that the embedding you choose for synthesis also strongly transfers the prosody of the chosen sentence. And I suspect choosing a low quality sentence cases issues as well...

erogol commented 5 years ago

yeah it's a good trick to compute a single embedding per user. I was planning to use per sentence embedding but your approach is simpler.

lorinczb commented 5 years ago

We are trying to train the TacotronGST model with 10 style tokens and the speaker embedding enabled. We use 10 Romanian voices (parallel data, correctly aligned, about 45 minutes of speech from each speaker).

So far we trained the model for about 500 epochs and 150000 steps with a batch size of 16. Unfortunately we don't have better resources to run with a larger batch size.

The speaker embeddings are learnt, if we synthesize with a speaker index it has learned the speaker identity, but the speech is not intelligible.

Have you tried to train the GST model with speaker embeddings? Does it learn both the speaker identity and style tokens? Can you also let us know in your experience how much data and epochs are needed for the training?

We have previously trained the GST model without speaker embeddings and that works well, but we were wondering if you have tested with speaker embeddings and obtained good results.

erogol commented 5 years ago

@lorinczb so far speaker embedding is quite experimental here. To make use of just using the embedding layer, you need to have a large, balanced and good quality dataset. If it is not the case you need more model changes as I am currently investigating. If you like to try, I can push the branch of the updated version of multi-speaker model.

How does the attention module behave? Does it align for all the speakers? Could you maybe share Tensorboard shoot here?

lorinczb commented 5 years ago

@erogol if we synthesize sentences that are included in the training set, the speech is somewhat intelligible, a few words are understandable. If we synthesize any unknown sentence, one cannot understand any word of it, seems like the attention doesn't behave well for this text outside of the training data.

The dataset is of good quality, from each speaker we have the same 500 utterances recorded. The speaker identity is learned, we can use the speaker id and the synthesized speech will sound like the speaker with the selected id, but the speech is not intelligible.

I have attached some tensorboard screenshots of our training. If you could push your updated branch of the multi-speaker model we could test it on our end as well. Thanks!

Tensorboard_TacotronGST_Images Tensorboard_TacotronGST_Scalars

erogol commented 5 years ago

@lorinczb do you enable forward_attention ?

lorinczb commented 5 years ago

@erogol we do have it enabled in the config file:
"use_forward_attn": true, // if it uses forward attention. In general, it aligns faster. During training it is set to true, just verified that.

erogol commented 5 years ago

Try it False and just enable at inference. In general, it learns faster this way.

lorinczb commented 5 years ago

Tried training with False, but same results, the speech is not intelligible. We will check your new branch once it will be available.

erogol commented 5 years ago

I got it merged with the dev branch. You can try if you like.

BoneGoat commented 5 years ago

I tested TacotronGST with multiple speakers on the current dev branch and one of the threads keep eating RAM until the system kills it. This happens during the first epoch. Not seeing the same thing on master. Anyone else seeing this?

To add to the conversation, I'm training on about 950 speakers and 300h of audio. I seem to get better alignment than lorinczb.

Screenshot 2019-09-17 at 20 41 44 Screenshot 2019-09-17 at 20 41 17
lorinczb commented 5 years ago

We have trained on much less data, about 9 hours of speech from 10 speakers. We have a single GPU (GTX 970 with 4 GB of memory), and can run the training only with a batch_size of 16. Not sure if this is the cause of the worse alignments on our side. Haven't had the chance to test the dev branch yet, but will let you know as soon as we have.

erogol commented 5 years ago

Just keep in mind that multi speaker model needs to be trained for a long time. The One I trained took almost 1m steps to have a strong alignment.

BoneGoat commented 5 years ago

1m steps? I stopped after 300k because it seemed as it stopped learning and even got worse when I listened to the test audios. I have since added about 100h of audio and restarted the experiment. But as I mentioned, I cannot use the dev branch because it eats all system RAM.

BoneGoat commented 5 years ago

@erogol Can you confirm or deny the memory leak (RAM) in the dev branch? I have tried to compare master against dev but cannot find any changes that will produce any leaks.

Screenshot 2019-09-22 at 22 46 38

15min later...

Screenshot 2019-09-22 at 23 01 40

Mem usage has gone from 29.7 to 31.2 in 15min on one thread.

8 hours later...

Screenshot 2019-09-23 at 07 20 04

This does not happen on master.

The new experiment with about 950 speakers and 400h of audio (on the master branch) encountered an explosion in loss at about 300k steps but it's falling now so I'm keeping it running.

@lorinczb 4GB of GPU mem seem very low. I still have problems fitting batch sizes into 11GB and I'm dreaming of a Titan RTX.

erogol commented 5 years ago

I don't see a memory leak on my machine.

lorinczb commented 5 years ago

We managed to train the model with speaker embeddings using the weights of a pre-trained model with a single speaker. This way the speech is intelligible, it converges after a few epochs and it does learn the speaker identities.

The only issue we have is that our multi-speaker dataset does not contain expressive speech, and the output speech does not retain the prosody of the pre-trained model. Probably the network needs to be extended in order to learn both the prosody and speaker identities. If you have any suggestions for this please let us know :).

erogol commented 5 years ago

@BoneGoat how was the final performance of TacotronGST? I guess you trained on LibriTTS-360? I've tried with the inspiration from you but I could not manage to converge it.

@lorinczb good to see fine-tunning works. Is it the network with Speaker Embedding connected to encoder outputs only?

erogol commented 5 years ago

with the dev branch Tacotron, I am able to achieve better results on LibriTTS. At least, the embedding layer has a better separation for genders.

image

BoneGoat commented 5 years ago

I'm using this dataset in Swedish: https://www.nb.no/sprakbanken/show?serial=oai%3Anb.no%3Asbr-16&lang=en

I'm at 600k steps and the inference quality is much better than expected from what I've seen on the tensorboard. According to the test wavs on the tensorboard the quality started to decrease after about 250k.

It has learned the speaker IDs and most speech is intelligible. If I don't supply a style wav the speech tempo is way too high but the style slows it down.

erogol commented 5 years ago

@BoneGoat interesting. Thanks for the update.

erogol commented 5 years ago

Here are the first multi-speaker results with Tacotron and Speaker embedding on LibriTTS 360

https://soundcloud.com/user-565970875/sets/multi-speaker-examples-5093

The good thing is that this model is just 7m parameters as oppose to ~30m counterparts.

BoneGoat commented 5 years ago

Here are my results for TacotronGST + speaker embeddings on master at 600k steps:

https://soundcloud.com/user-839318192/sets/mozillatts-tacotrongst-swedish

erogol commented 5 years ago

@BoneGoat sounds quite good. Interesting that the same architecture does not work on LibriTTS 360 for English.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discourse page for further help. https://discourse.mozilla.org/c/tts

khu834 commented 4 years ago

@BoneGoat @erogol I'm interested to know what your thoughts are on the current status and future direction of multispeaker models? I looked through the resources and sample audio above, it appears that significant research and experimentation work is needed. What do you think is the path towards getting multispeaker to close to single speaker quality? Is it more / better quality data? Better speaker embeddings? Or additional architectures that needs to be implemented / tried? I'd be happy to put time and effort into this if you think it's promising and just requires some more cycles on it.

lorinczb commented 4 years ago

We have tried to train the multi-speaker model, but was unable to learn both prosody and speaker embedding separately. Using only the global style token layer the model was able to learn either prosody (if trained on an expressive voice from a single speaker) or speaker identities (if trained on data from multiple speakers). Sorry, we have not followed up, as we trained the model for quite long time without obtaining results. We would also be interested in any updates related to the topic, on our side unfortunately we did not find the solution for it.

erogol commented 4 years ago

@lorinczb thx for the update. I don't have resources to work on this issue for now. I'll be sharing news as I start again.

Edresson commented 4 years ago

@erogol Could you update me on that? I would like to contribute. I plan to explore the use of my Speech2Phone embedding model for the task. I intend to explore changes in the synthesized voices of the speakers by performing operations between embeddings.

erogol commented 4 years ago

@Edresson Sure that would be nice someone working on the multi-spearker model.

What exactyl like to know? I'd be happy to help.

We already have another speaker encoder. Maybe you can start with that.

I should say anything with multi-speaker setup is experimental for now but they all should work. I trained a couple of models on LibriTTS and it performed quite well but since then I focus on vocoder and single speaker cases.

Edresson commented 4 years ago

@erogol Let me see if I understand correctly. When we start training with args.restore_path=null/None the speaker embeddings are initialized as {'id1': 0, 'id2': 1, 'id3': 2, ...} and saved in OUT_PATH as speakers.json.

I believe that you must manually raise the file speakers.json to the ones which containing the embeddings extracted from the speaker_encoder model.

I think it would be more practical to pass the file speakers.json in config.json as a parameter (example speaker_embedding_file). If someone is going to run TTS with Multi-Speakers they must first run Speaker_encoder and get the file 'speakers.json', or pass args.restore_path that contains that file. After the load we can check if the file passed has all the speakers present in the dataset.

I believe that the size of the speaker embedding should be passed as a parameter to the setup_model, because in order to support other speaker encoders, it is not necessary to change the code just to pass a different speakers.json (so the user must generate the file in the appropriate format )

Sorry if I got it wrong, I quickly looked at the code.