Train a better Speaker Encoder

erogol commented 4 years ago

Our current speaker encoder is trained with only LibriTTS (100, 360) datasets. However, we can improve its performance using other available datasets (VoxCeleb, LibriTTS-500, Common Voice etc.). It will also increase the performance of our multi-speaker model and makes it easier to adapt to new voices.

I can't really work on this alone due to the recent changes and the amount of work needed therefore I need some hand here to work together.

So I can list the TODO as follows and feel free to contribute to any part of it or suggest changes;

[x] decide target datasets
[x] download and preprocess the datasets
[x] write preprocessors for new datasets
[x] increase the efficiency of the speaker encoder data-loader.
[x] training a model only using Eng datasets.
[x] training a model with all the available datasets.

WeberJulian commented 4 years ago

Great ! Huge thanks to you @erogol and to you as well @mueller91 for the impressive work.

george-roussos commented 3 years ago

Hi, does compute_embeddings.py not work with the model trained here? I tried to grab it and plug it in compute_embeddings.py, but I get

RuntimeError: Error(s) in loading state_dict for SpeakerEncoder:
    Missing key(s) in state_dict: "layers.0.lstm.weight_ih_l0", "layers.0.lstm.weight_hh_l0", "layers.0.lstm.bias_ih_l0", 
"layers.0.lstm.bias_hh_l0", "layers.0.linear.weight", "layers.1.lstm.weight_ih_l0", "layers.1.lstm.weight_hh_l0", "layers.1.lstm.bias_ih_l0", "layers.1.lstm.bias_hh_l0", "layers.1.linear.weight", "layers.2.lstm.weight_ih_l0", "layers.2.lstm.weight_hh_l0", "layers.2.lstm.bias_ih_l0", "layers.2.lstm.bias_hh_l0", "layers.2.linear.weight". 
    Unexpected key(s) in state_dict: "model", "optimizer", "step", "loss", "date".

lexkoro commented 3 years ago

It works, I just used it. Sure you are using the right model?

george-roussos commented 3 years ago

Yeah! I am using master and the models from the drive link (I tried all the models on the link) and compute_embeddings.py from a slightly older commit since it is not there now. I also tried model.py from dev but I got the same error

lexkoro commented 3 years ago

Ah, I used compute_embeddings.py from dev, which worked for me.

george-roussos commented 3 years ago

Which commits are you using? The compute_embeddings.py is not there anymore

lexkoro commented 3 years ago

current dev https://github.com/mozilla/TTS/blob/dev/TTS/bin/compute_embeddings.py guess the file was moved

george-roussos commented 3 years ago

Oh that's where it was 🤦🏻‍♂️ thanks mate. Very strange, still not working, even though I pulled the latest dev. It crashes at model.load_state_dict line and the only thing I changed was map the storage to CPU cos I am trying to load it on my laptop.

I just added strict=False and it seems to be doing the trick. Weird. Thanks a lot for trying to help. 🤗

WeberJulian commented 3 years ago

Also @erogol, could you please tell us which of the checkpoints on the gdrive did you use to train the multispeaker model please ?

Is it the last one you added 320k or the one with the most steps 330k or the best_model which is a moth older than the 320k one. Thanks

erogol commented 3 years ago

I guess it is 320K. @Edresson has computed the embedding.

lexkoro commented 3 years ago

@WeberJulian The best_model was trained to ~370k steps. So I would assume it should be better?

WeberJulian commented 3 years ago

@SanjaESC Yeah it's probably better but I'm fine-tuning the VTCK multispeaker model in my language so I need the exact checkpoint used to compute the embeddings even if they are worse or else my model won't work properly (I think).

lexkoro commented 3 years ago

Shouldn't a better speaker-encoder compute more accurate embeddings for your dataset and thus result in a more robust model?

WeberJulian commented 3 years ago

I don't know since the embbedings don't mean the same thing anymore. I don't have enough speakers in my dataset to make the model learn (slightly?) different embeddings. I need a model that already know how to interpret the embbedings. At least that's my intuition. But if you think the newer checkpoint might work better I may try this after this training ends. thanks for the advice

oytunturk commented 3 years ago

Hi,

Thanks for the great effort! I'm experimenting with various multi-speaker TTS recipes shared in this project. Has anyone tried training a Tacotron model with LibriSpeech/LibriTTS data? Or, any other large scale US English dataset? I'm able to get decent results with VCTK based Tacotron model but it's limited to UK English and speaker variety is not sufficient for my application. I'm aware that we can create random speaker embeddings or even random style tokens if it's a GST based model but still I think when Tacotron sees only a limited number of speakers as in VCTK, all you can generate is limited to that speaker set in terms of speaker variation. If a larger scale Tacotron model hasn't been done, I might be able to put some effort in it and share a pre-trained model if it goes well. Any thoughts?

WeberJulian commented 3 years ago

Hi, I think the model based on VCTK is the latest and greatest on this repo but it shouldn't be too hard to fine-tune on a larger dataset.

oytunturk commented 3 years ago

Hi, I think the model based on VCTK is the latest and greatest on this repo but it shouldn't be too hard to fine-tune on a larger dataset.

VCTK Tacotron model is based on UK English phoneme set. I don't exactly know what espeak does when you switch dialects but I'm guessing the phoneme sets will be different so training from scratch would be inevitable. Otherwise, Tacotron output will be based on UK English espeak pronunciations. It may not be as accurate as using US English, say if you are using LibriTTS for Tacotron training.

WeberJulian commented 3 years ago

I think en-uk has more similar phonemes with en-us than it has different ones. I just tried transfer learning from this model to french and it works reasonably well so you shouldn't have any trouble for your use-case. Try the faster one first and if it doesn't suit you you can always take the longer path.

oytunturk commented 3 years ago

I think en-uk has more similar phonemes with en-us than it has different ones. I just tried transfer learning from this model to french and it works reasonably well so you shouldn't have any trouble for your use-case. Try the faster one first and if it doesn't suit you you can always take the longer path.

Yes, makes sense. My naive guess is that it will perform better than using characters as input but maybe a bit worse than the 'correct' phoneme set. Definition of a 'correct' phoneme set is also a bit fuzzy. It all depends how well it represents the pronunciations of speakers in your training database which may contain accented speech etc that you might be unaware of.

george-roussos commented 3 years ago

Hi, couple questions, especially @mueller91. I am trying to recreate the experiment with the same config, the same datasets and a handful of private speakers (not more than 120, so deffo not a lot). However, I am having issues initiating training. It seems that it freezes 15 minutes in; the RAM starts going up slowly (CPU allocation looks healthy), then fills up and the entire thing freezes. I have tried both with 4&8 workers and it did not work. I have a machine 8 vCPUs, 32GB RAM and a V100.

Thanks! And thanks for the model. 😀😅

mueller91 commented 3 years ago

Are you using my code, where part of the samples is kept in-memory to reduce I/O? If yes, then it sounds like you're using up all your ram to cache the audio files. Have you tried decreasing the storage_size in the config?

george-roussos commented 3 years ago

I am using dev branch, so I guess it has this, yes. Because I also tried your fork and got the same problem. I tried to decrease the storage_size to 15, but it didn't really do anything. And if it is lower, the I/O increases a lot. How large should RAM be to cache all the wavs no problem?

mueller91 commented 3 years ago

try decreasing it to zero and see if the RAM problem persists. if you set storage_size to 15, it keeps 15 (num_loaders) batch_size utterances im memory (i think). Which is quite a lot.

and yeah, the I/O really is a problem. you really need SSDs for it.

george-roussos commented 3 years ago

Actually, the SSD is not a problem, because I never run on HDD. So the problems i am getting are all using an SSD? How large a RAM did you use? Setting storage_size to 1 (0 is not accepted) works, but then the loss jumps to 0.60, even though I use the training sets you use. Did you only use the caching because you have an HDD?

mueller91 commented 3 years ago

i had to use caching because i use a HDD. it is expected that the loss is larger for smaller storage size, since we re-use less of the training examples. every sample reused has been seen before, thus has been trained on before, thus lower loss. i have 120gb ram.

george-roussos commented 3 years ago

120 🤯 No wonder my small 30GB will not work! Thanks a lot for the clarification and for confirming it affects the loss.

erogol commented 3 years ago

you can use a swap space as an easy workaround

If you create it on SSD then it should be fast enough

george-roussos commented 3 years ago

Hi,

I was wondering if anybody has tried clustering in order to get a better understanding of what the network learns. I extracted some embeddings for my speakers and I tried clustering using HSBSCAN, but it only gives one (zero) label and then -1 which is apparently noise. This is what I have tried:

import numpy as np
import glob
import pandas as pd
import hdbscan
from joblib import Memory

embeddings = list()

for file in glob.glob("embeddings_o/*/*.npy"):
    embeddings.append(np.load(file))

dataframe = pd.DataFrame.from_records(np.vstack(embeddings))

clusterer = hdbscan.HDBSCAN(algorithm='best', alpha=1.0, approx_min_span_tree=True,
    gen_min_span_tree=False, leaf_size=40, memory=Memory(cachedir=None),
    metric='euclidean', min_cluster_size=15, min_samples=None, p=None).fit(dataframe)

clusterer = hdbscan.HDBSCAN(min_cluster_size=5).fit(dataframe)

print(clusterer.labels_)

and I get

[-1  0 -1 -1 -1 -1  1 -1 -1  0  0 -1  0 -1 -1  0 -1 -1 -1  0 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  1  1 -1
  0 -1  1  0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  0 -1 -1 -1 -1 -1 -1 -1  0
  0  0 -1 -1 -1 -1 -1  0 -1 -1 -1 -1 -1 -1 -1  0 -1  0 -1 -1 -1 -1 -1  0
  0 -1 -1 -1  1 -1 -1 -1 -1 -1 -1  1 -1 -1 -1  1 -1  0 -1 -1 -1  0 -1  0
 -1  0  0  0 -1 -1 -1  0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1]

I set the min_cluster_size to 5 because anything higher only gives back the noise label. Maybe it indeed only has one label (and it is the pitch), but isn't it a bit weird that it doesn't learn anything else?

george-roussos commented 3 years ago

@mueller91 Do you have a branch where the inter- and intra- losses are implemented? In a screenshot you shared above they are there, but they are not in dev or any other branch I tried and I am not sure how to implement them.

mozilla / TTS

Train a better Speaker Encoder #512