mozilla / TTS

:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)
Mozilla Public License 2.0
9.24k stars 1.24k forks source link

Train a better Speaker Encoder #512

Closed erogol closed 3 years ago

erogol commented 4 years ago

Our current speaker encoder is trained with only LibriTTS (100, 360) datasets. However, we can improve its performance using other available datasets (VoxCeleb, LibriTTS-500, Common Voice etc.). It will also increase the performance of our multi-speaker model and makes it easier to adapt to new voices.

I can't really work on this alone due to the recent changes and the amount of work needed therefore I need some hand here to work together.

So I can list the TODO as follows and feel free to contribute to any part of it or suggest changes;

mueller91 commented 4 years ago

Hi Erogol, i'm up for it.

My code is not based on the TTS repo, but i'll try to integrate it and submit a PR in the upcoming days.

erogol commented 4 years ago

@mueller91 that is great. I can train the model if you can send me a script which does all the processing. I am not use if there is enough space to allocate all the datasets but I can try. BTW I also don't have SSD to fit all the dataset.

I think the latency is normal since for each batch it loads a lot of data and I don't think computing specs on the fly is not the cause of the problem.

One option is to keep a number of batches in memory and sample from it to fill the half of the next batch and load the rest from disk. I think that would reduce the requirements quite a lot. Does it make sense?

mueller91 commented 4 years ago

Hi @erogol , okay, I'll integrate my solution in mozilla_TTS and report back.

To minimize the data loaded from disk, your suggestion makes sense; but for all utterances we'd reuse, we'd have identical pairs in the GE2E loss matrix as in the batch before. Not sure if that's desirable ...

I was thinking to re-use a given batch two or three times, and just select a new random 1.6s section of the MFCC for each utterance. What do you think?

erogol commented 4 years ago

To minimize the data loaded from disk, your suggestion makes sense; but for all utterances we'd reuse, we'd have identical pairs in the GE2E loss matrix as in the batch before. Not sure if that's desirable ...

Let's assume our batch size is B. For instance if we keep N batches in memory and replace the Nth batch with the next batch and sample B/2 instances from in memory samples and sample the rest from the HDD instances, it is very likely that every batch is different from one another. That is more important than having same couples in the batch couple of times since the average gradient would be different. What do you think?

I was thinking to re-use a given batch two or three times, and just select a new random 1.6s section of the MFCC for each utterance. What do you think?

This also sounds like a good idea. Maybe we can combine these two ideas.

we can also induce identical random noise to each speaker in batch so even if we use the same speaker from the cache, model sees slightly different version of the speaker's voice.

mueller91 commented 4 years ago

Sound good, I'll implement those three ideas.

Also, i've just added the LibriTTS, VoxCeleb1+2 and CommonVoice datasets: This yields 25k speakers (without skipping those with <10 utters).

 > DataLoader initialization
 | > Number of instances : 2072676
 | > Sequence length: 25600
 | > Num speakers: 25514

Finally: Is there a reason do_trim_silence is set to false per default? My intuition is that removing silence gives the SE more 'information' to work with.

erogol commented 4 years ago

I just assumed that the datasets are preprocessed and like to keep a bit of silence to be robust against it. But might be set differently for different cases.

mueller91 commented 4 years ago

I implemented the three improvements we discussed above. The I/O overhead is reduced from about 3-5 seconds per batch to about 0-1 seconds, i can now train about 2000 steps in about 80 minutes (this is with 768 hidden neurons and a 64 speakers per batch, as in the original paper).

Attached are first tensorflow plots. You can find the source in my fork. I'll keep training this for a bit and then publish the model + config. Let me know if you are interested in a different set of hyper parameters 1 0

erogol commented 4 years ago

it is great!! Looks like the loss id smoothly going down.

How many samples do you have in total for this model? Have you done any particular changes to the model?

I was planning to remove the last ReLU layer that, in my opinion, skews the output distribution. Also with all these datasets, we could train a larger model.

You could also use AngleProtoLoss which @Edresson reported better results.

Are you planning to share the model and the code at the end? If you are, then I can work on the universal vocoder more and @Edresson is working on the multi-speaker TTS model. After we merge all these we would have the best model possible.

Edresson commented 4 years ago

@mueller91 this is very good congratulations :).

As @erogol commented, I got better results with Angular Prototypical (AngleProtoLoss) in my training. I recommend you try to use it :). The paper In defense of metric learning for speaker recognition also shows a superiority of Angular Prototypical.

mueller91 commented 4 years ago

With the datasets mentioned above, i have 25.5k speakers and 2M utterances. I'm familiar with Angular Prototyping and have enabled it in the current model. Also, i enabled trimming silences (since a lot of the datasets are not preprocessed) and am using LSTMWithProjection, with has linear output, not Relu - i agree that Relu skews the output. Maybe sigmoid would also be appropriate...

You can see my config here.

I've submitted a PR, and will be happy to share the model once trained.

Edresson commented 4 years ago

@mueller91 Could you train the model with the audio config of this config here?

This would allow us to use this model to calculate the loss in the TTS model and generate speakers with the voice closest to the originals :)

mueller91 commented 4 years ago

@Edresson Are you planning on using the speaker encoder to create an additional similarity loss term for the multi-speaker tacotron? I tried that for a while with my own implementation, didn't improve anything, but also my speaker encoder was bad back then. Google, in their original multi TTS paper, say they don't do that either, but there is another paper where the authors say it helped, so who knows. I'll give it a try with your parameters.

Most of the datasets are 16khz, so upsampling to 22050hz may slow the data loader down, i'll have to see how it turns out. Upsampling should not affect the MFCCs in a negative way, right?

erogol commented 4 years ago

I am not sure but the sampling rate in speaker encoder would not make an important difference. In the end, TTS model would learn what it needs to learn from the embedding regardless of the encoder rate. But maybe I am wrong.

Edresson commented 4 years ago

@Edresson Are you planning on using the speaker encoder to create an additional similarity loss term for the multi-speaker tacotron? I tried that for a while with my own implementation, didn't improve anything, but also my speaker encoder was bad back then. Google, in their original multi TTS paper, say they don't do that either, but there is another paper where the authors say it helped, so who knows. I'll give it a try with your parameters.

Most of the datasets are 16khz, so upsampling to 22050hz may slow the data loader down, i'll have to see how it turns out. Upsampling should not affect the MFCCs in a negative way, right?

Yes, exactly that I've tried this and the results improve even using a bad speaker encoder. Training with a better speaker encoder should improve even more, especially for speakers not seen during training.

Resample is really slow.

@erogol In some tests that I did when I test a 16khz audio on a 22khz trained encoder speaker upsample to 22khz, the performance drops a lot. However I didn't try without the upsampling.

@mueller91 @erogol Do you think it is feasible and make sense to train with audios at 22 kHz and 16 kHz at the same time?

mueller91 commented 3 years ago

Here is the current model: Trained to 180k on LibriTTS Clean 100, 360 and 500; VCTK, Voxceleb1+2 and Mozilla Common Voice; a total of >25k speakers with 64 speakers per batch. Loss is at 0.25k.

loss

You can download the model and config at: https://drive.google.com/file/d/1C8cXVEhra5WqEFArwTj-xFIgBn1GObxX/view?usp=sharing https://drive.google.com/file/d/1q-igIrHvtqoKj6rRNljE7ChNta8hJInA/view?usp=sharing

@Edresson I can't do training at 22khz and 16khz at the same time because i have access to only a single GPU, and the current model (with 768 hidden layers and 64 speaker per batch) does not fit on my GPU twice. Do you think Tacotron + Vocoder could work with 16khz?

Edresson commented 3 years ago

@mueller91 It should work, but the quality may not be as good for real applications. If it is just for data generation I believe it is a good one. Perhaps it would be interesting to test how the Speaker encoder behaves by receiving 22khz audio instead of 16khz (my test was the opposite, a 22khz trained speaker encoder received a 16khz sample that was upresampled ).

If the performance loss is not great, we can use the trained 16 kHz speaker encoder to calculate the distance between speakers during training (speaker encoder extra loss) for a model trained in 22 kHz :)

erogol commented 3 years ago

@mueller91 it is a great contributing. Thanks!

i see that it was still converging. I guess you need the GPU as you stopped training.

erogol commented 3 years ago

@Edresson i still dont think we need a different sampling rate for the encoder. you can always resample an audio before computing the embedding vector.

mueller91 commented 3 years ago

@erogol I'll keep training, this was only a snapshot.

@Edresson I have not forgotten your request. However, i have only a single GPU available, and i would like to train the current model a bit more before I start with your config. Upsampling to 22khz introduces significant overhead when data loading; would 16khz and 80 mel_channels be helpful to you? This paper reports SOTA with 16khz.

Edresson commented 3 years ago

@Edresson i still dont think we need a different sampling rate for the encoder. you can always resample an audio before computing the embedding vector.

@erogol The idea is to use it during the speaker encoder to calculate the loss during TTS training. And I don't know how to resample a spectrogram, so the ideal would be to have a speaker encoder trained in 22 kHz.

@mueller91 can focus on the 16khz Speaker encoder :). As I said above, there may not be a big difference in performance and we can use it in 22 kHz audio. I trained a 22khz model compatible with the TTS audio configuration on LibriTTS 360 and 100 clean, a while ago this model is not as good as yours but it works :).

erogol commented 3 years ago

@Edresson you dont need to resample spec. you resample audio and then compute the spec. Basically use separate Audio Processors for speaker encoder and the rest.

mueller91 commented 3 years ago

I have further optimized the DataLoader, and now incur zero overhead when loading the data from disk (see LoaderTime); i train 1000steps in about 15 minutes (1.25 steps per second).

| > Step:1140  Loss:0.93590  AvgLoss:1.20784  GradNorm:58.14209  StepTime:0.77  LoaderTime:0.00  AvGLoaderTime:0.01  LR:0.000100
| > Step:1160  Loss:1.16374  AvgLoss:1.20117  GradNorm:54.96458  StepTime:0.77  LoaderTime:0.00  AvGLoaderTime:0.01  LR:0.000100
| > Step:1180  Loss:0.92916  AvgLoss:1.19200  GradNorm:52.99776  StepTime:0.78  LoaderTime:0.01  AvGLoaderTime:0.01  LR:0.000100

@Edresson I have started training the 80mel, 16khz speaker encoder; i'll keep you updated. Is the speaker-encoder based similarity loss already implemented?

Edresson commented 3 years ago

@mueller91 Yes on one of my brachs. We intend to merge with TTS in the future :).

Are you training with this audio config here? Except the sample rate correct?

For the sample rate, @erogol had the idea of using interpolation as discussed in issue #520 , we can try this :).

mueller91 commented 3 years ago

@Edresson yes, i used your audio config, except for the sampling rate and do_trim_silence, which i set to true

Edit: I noticed that changing the win-length from 400 to 1024 results in less frames given 1.6s of audio. Do you thing it makes sense to increase the length of the audio to maybe 2 or 3s during training? As far as i remember, the original paper reported improvements for longer audio files (up to 5s) during inference.

Edresson commented 3 years ago

@mueller91 I believe that increasing the frames can be good. The model learns only to separate the speakers in space. If you have few frames, the probability of containing a lot of noise or silence is high. Depending on the dataset there may be selections of a few seconds at the beginning and end (VCTK is like that). Depending on the quantity this can be a problem. Recently I saw some models that even used silence to separate the speakers and this is not a desired feature. The model must learn to ignore silence.

erogol commented 3 years ago

@mueller91 for this run I'd prefer to keep the parameters as it is and let it train then we can use it as a baseline for anything we like to train.

Edresson commented 3 years ago

@mueller91 for this run I'd prefer to keep the parameters as it is and let it train then we can use it as a baseline for anything we like to train.

This looks good. After that we can do fine tuning :).

mueller91 commented 3 years ago

Here is the current model + config: 80mels, 16khz, 320k steps trained: training progress note that the plot's x axis is truncated to [120k, 320k]. Not quite sure what happend at 230k.

https://drive.google.com/file/d/1N-5uwhE87y1QWQMiNFOrc7WRwG9rHKC6/view?usp=sharing

Edresson commented 3 years ago

@mueller91 Very good, thank you :)

erogol commented 3 years ago

it is interesting that it still converges. How long has it been so far?

mueller91 commented 3 years ago

@Edresson You're welcome. Let me know how the training of your model goes; i'm especially interested if the additional speaker encoder loss really improves output similarity

@erogol I've reached 320k steps after about 100h of training. Considering that in their original paper, Google trained their speaker encoder to 50M steps, i'm not surprised to see my model still converging at at this point in time.

erogol commented 3 years ago

Can you also share the umap output? I guess it plots on tensorboard.

Edresson commented 3 years ago

I created this notebook here to check the feasibility of using interpolation to decrease the spectrogram sample rate (as done here #520 ) On the notebook we can see that using interpolation is better than passing in a 22 kHz spectrogram directly to the speaker encoder. However, with interpolation there is a loss of model performance (as expected). The distance between samples from the same speaker is increased with the use of interpolation and also the distance between samples from different speakers are closer to each other. The difference is small, the distance increased from 0.5334044 to 0.57418275 for samples from the same speaker. And it decreased from 1.3233587 to 1.2269596.

Still, I think it is viable to use the speaker encoder with interpolation for training the multispeaker model. What do you think @erogol @mueller91 ?

mueller91 commented 3 years ago

Here are the Umap Plots for 127k, 200k and 320k steps. You can see the clusters are getting tighter.

127k 197k 320k

However, i think the UMAP plot's usefulness is limited. Looking at the inter- and intra similarity (pairwise cosine-similarity of speakers belonging to different / same speakers), we see why we might like to continue training the model: The intra-similarity is still rather low (at around .71).

inter_sim intra_sim

@erogol @Edresson What is your opinion on this? @Edresson Why not simply use 16khz for the TTS model? Other papers use 16khz and report good results.

erogol commented 3 years ago

@Edresson i think interpolation does not make too much difference so we can use it for training a TTS model.

@mueller91 for writing a paper it is not a big difference but 16khz does not have the same natural timber as 20 or 22khz. Also, @Edresson trained all the networks so far with 22khz or 20khz so changing it means a ton of new trainings :)

WeberJulian commented 3 years ago

@mueller91 Those results are really interesting. Can you interpret the meaning of the first few PCA dimensions ? That would be a good indicator that the model is not learning to only differentiate noise but rather at least some useful features.

mueller91 commented 3 years ago

I haven't tried to interpret this, but feel free to take a look, both model source and checkpoint are available to download. I think that the model should have learned more than just differentiating noise. We can see in the UMAP plot that it succeeds in grouping together recordings from the same speakers (and for a given speaker, there often are recordings from different sources and thusly with different noise, see for example voxceleb 1+2)

nickjoodi commented 3 years ago

Hello @mueller91 . Incredible work. I was trying to swap in your model in @Edresson's multispeaker notebook, but I noticed that the outputted embedding dimension is 256, which is double the outputted embedding that @Edresson's Tacotron2 model expects. Did you have a Tacotron2 model that accepted 256 speaker embeddings?

Edresson commented 3 years ago

Hello @nickjoodi , you cannot just change, the model must be trained to adjust to the new embeddings. If you just change the result it will not be good (although the model works with randomly generated speaker embeddings).

We have a model with 256, I believe that soon @erogol should release a MultiSpeaker model.

nickjoodi commented 3 years ago

Ok thanks @Edresson !

erogol commented 3 years ago

@mueller91 done training?

mueller91 commented 3 years ago

@erogol I have not continued training much further than checkpoint uploaded above. I have sent @Edresson the latest model (around 360k steps).

erogol commented 3 years ago

@erogol I have not continued training much further than checkpoint uploaded above. I have sent @Edresson the latest model (around 360k steps).

great thanks for training this model. I'll check its perf and share soon.

erogol commented 3 years ago

@mueller91 how much space did the whole dataset took in your HDD? Is it possible to share through someplace for me to continue the training?

mueller91 commented 3 years ago

@erogol It's about 500gb. It's sitting on my personal machine, i'm not sure how much sense it makes to upload the data from there. It may be best to just download it from the original sources, it really was not that much of a hassle.

I understand that the model is not finished training (and will probably never be, Google reportet 50M steps). Did you find the quality of the speaker embeddings unsatisfactory? How much further would you like to train the model?

erogol commented 3 years ago

500GB is too big. Is it compressed?

embedding figures look better than what we have before (thanks to you) but I just want to train the model until the ultimate convergence to get the best out of it.

erogol commented 3 years ago

I've shared the model @mueller91 trained. Thx again.,

erogol commented 3 years ago

https://github.com/mozilla/TTS/wiki/Released-Models

WeberJulian commented 3 years ago

Hi @erogol, is the new Multi-Speaker-Tacotron2 DDC you released in the wiki using this new encoder ? I haven't seen a mention about the encoder used in the colab VCTK-Tacotron2_DDC-WaveGrad.ipynb. Thanks !

erogol commented 3 years ago

Hi @erogol, is the new Multi-Speaker-Tacotron2 DDC you released in the wiki using this new encoder ? I haven't seen a mention about the encoder used in the colab VCTK-Tacotron2_DDC-WaveGrad.ipynb. Thanks !

I should also mention that. Thanks for reminding me. Yes, I use the latest encoder.