Multi Speaker Embeddings

Hi @erogol, I've been a bit off the radar for the past month because of vacation and other projects, but now I am back and ready for action! I am looking into how to do multi speaker embeddings, and here's my current plan of action:

Have all preprocessors output items that also have a speaker ID to be used down the line. Formats that do not have explicit speaker ids, i.e. all current preprocessors, would use a uniform ID. This speaker ID must then be passed down by the dataset through the collate function and into the forward pass of the model.
Add speaker embeddings to the model. An additional embedding with configurable number of speakers and embedding dimensionality. The embedding vector is retrieved based on speaker id and then replicated and concatenated to each encoder output. The result is passed to the decoder as before. Here we could also easily ignore speaker embeddings if we only deal with a single speaker.
It might make sense to let speaker embeddings put some constraints on the train/dev/test split, i.e. every speaker in the dev/test set should at least have some examples in the train set, otherwise their embeddings are never learned. I could implement a check for that and issue a warning if this isn't the case.

Any thoughts or additional hints on this?

mozilla / TTS

Multi Speaker Embeddings #166