resemble-ai / Resemblyzer

A python package to analyze and compare voices with deep learning
Apache License 2.0
2.77k stars 427 forks source link

Model path and hparams #9

Closed sberryman closed 5 years ago

sberryman commented 5 years ago

What are your thoughts on allowing you to pass in the model path (default to None) and override the hyper parameters? Or do you think the best route is sub-classing and override the init function?

https://github.com/resemble-ai/Resemblyzer/blob/cdd51df126dc5304a04ad6b01ca0811575c8809b/resemblyzer/voice_encoder.py#L12

sberryman commented 5 years ago

I've been playing around with the TED-LIUM dataset and after some quick spot checks I noticed most of the talks had more than 1 person talking (the speaker + Chris Anderson) or a panel discussion with multiple speakers. Figured I would take a look at the projections using your model and one I have trained up to 1.2M steps so far.

Elon Musk

2 talks (2013, 2017) image

Bono

2 talks (2005, 2013) image

Bill Gates

5 talks (2009, 2011, 2013, 2014, 2015) image

sberryman commented 5 years ago

Tested two languages it wasn't trained on, the colors used for each speaker are the same across both models. To be clear, there were 40 Swedish speakers in the 25,668 unique speakers used to train my model. It doesn't appear to generalize very well and the 40 Swedish speakers I used in my model didn't seem to make much of a difference.

Swedish

Speakers 0 - 40

image

Speakers 40 - 80

image

Norwegian

Speakers 0 - 40

image

Speakers 40 - 80

image

ViktorAlm commented 5 years ago

Interesting examples! They probably explain why im having such varied results on the the 256 pretrained encoder with the synthesizer on the swedish recordings.

I will do some tests later in the week! One thing i can think of is that the first sound of the scandinavian datasets is background noise. Some of them are just single words as well. That might affect. They're all recorded in a studio with good quality with little background noise. in alot of the examples different speakers are saying the exact same thing so if you're picking the first 10 theres a high chance theyre saying the exact same phrase in the same studio environment. I've been playing around a bit with word vectors with tsne and umap and as soon as i start adding alot of examples the results becomes very cluttered if just a few vectors are "off". Even on small samples with vectors it sometimes acts up but thats more on tsne. Even when i can do man king woman queen etc it can look distorted reduced to 2d space. Whats the results like if you limit the sample size? It does seem very odd, but I havent gotten around to reading up on how the encoder really works so Im just speculating based on nothing.

The most likely answer i guess based in nothing is that it doesnt generalize well and alof of the differences it picks up are more based on recording quality and background noise. How does it compare if you add other speakers into the mix with bill gates? does he group up or does he stay into separate clusters or are the other clusters other people? Since it seems based on the ok google dataset which prolly has the same voice in a lot of different settings compared to how the datasets looks like now. One voice in one setting split into multiple files. If theres duplicates its the same voice as two sets which forces it to look for other differences.

Again based on nothing maybe the encoder does this, but reducing noise and adding different noises making the same speakers recordings differ more while the voice stays the same could increase accuracy? Or maybe its just that its needs alot more data and alot more steps.

sberryman commented 5 years ago

@ViktorAlm the clusters look a lot better when you only project 10 utterances per speaker in all the models. What is surprising to me is that the English only model trained up to 260k steps is performing very good across the board. Given it is 768 hidden/embedding size vs the default of 256.

Bill Gates is actually multiple speakers, so the clusters are accurate. It shows a single name because it was from a single folder. TED Talks there is usually a moderator that will introduce the speaker or will be a panel where there are a minimum of 2 speakers on stage. I need to verify that each cluster is truly a different speaker but after scanning through the files quickly I remember hearing at least 4 different speakers through all of his talks.

Swedish

sve

Norweigan

no

ViktorAlm commented 5 years ago

That seems like the way to go! have you looked at how much the embedding differs? I mean does it use all 768 values with clearly activated single values? When i tested it on my small dataset 256 size 150k steps it only "activated" a few of the values leaving the embedding looking very flat even though it was very good att separating in the umap plot.

sberryman commented 5 years ago

Good question. Here are a few heatmaps for the embeddings.

English

TEDLIUM-3 - ElonMusk_2013_0019

ElonMusk_2013_0019_embeddings

TEDLIUM-3 - BillGates_2009_0001

BillGates_2009_0001_embeddings

Swedish

r4670017 - u0017005.wav

r4670017_u0017005_embeddings

r4670006 - u0006005.wav

r4670006_u0006005_embeddings

ViktorAlm commented 5 years ago

I think thats what needs to be monitored more than the differentiation unless you have very similar evalutation data. I guess the great results comes when that thing glows. Thats prolly why it needs that many steps

sberryman commented 5 years ago

Honestly I'm not sure what makes more of a difference. I haven't trained the synthesizer or vocoder yet. Hard to tell from only 4 utterances on how the embedding is utilized. Obviously there are more activation's in the default model which was trained to 1M steps. The mixed model is trained to 1.7M steps but with about 3x as many speakers and the mixed model is trained using the Swedish and Norwegian speakers from nasjonal-bank. The English model is a true 768 hidden/embedding size while the mixed is still training using 256 hidden layers which I'm sure is not ideal.

ViktorAlm commented 5 years ago

I tried running the synthesizer and vocoder about 100k steps on an encoder which had about 150k steps done on 4k examples and it was separating the clusters great but the embedding was very flat. It was not able to clone any voice. But training 100k steps on the synthesizer and vocoder with the pretrained encoder did produce similar voices to that degree that my friends could see the similarity with the one i was trying to clone but the result was far from great given a number from 1-10 most said 5-7. I got the synthesizer at 200k steps now and doing the vocoder at about 150k and its improved but i need to run the synthesizer until i get a spectogram that looks decent. Male voices is not similar. Still have not changed any params. Just trying to get a "feel" for how it works.

Astrid Lindgren(Swedish author, pippi longstockings) https://www.youtube.com/watch?v=GQIoSD_xvQE&t=108s https://vocaroo.com/i/s08PyR3L3KC7?fbclid=IwAR1ZxjCQJpeaXFeM3S1iveU-TJWTBSkR5XfmTLNuF-DWU00fOgOeWB3N4CQ

Annie Lööf(swedish politican) https://www.youtube.com/watch?v=ikOe1WfM50Y https://vocaroo.com/i/s1gnv3fudPk1?fbclid=IwAR1-24xnSUpFpkTLZHkgKEySrZo61MHKo3-gRWQogxMotTPjU_jsVKZBBVM

CorentinJ commented 5 years ago

By the way, you will have different clusters if you project a single speaker with UMAP. It's going to try to cluster by the most distinguishing feature, so it's natural that it clusters by speakers with multiple speakers; and it's also natural that it clusters by different recording environments with a single speaker.

sberryman commented 5 years ago

Thanks @CorentinJ. That makes complete sense that it clusters by different recording environments which would explain Bill Gates TED talks perfectly. They are recorded across different years at different venues. It also explains the two large clusters for Elon Musk across two different years and most likely different recording environments.

What do you think would happen if I used all the utterances for Bill Gates across different environments and years as part of the training data for a single person? The multiple years and environments will be a somewhat common occurrence if I include TED-LIUM in the training. Or would you use each talk by the same person as a separate speaker for encoder training?

CorentinJ commented 5 years ago

What's your goal with this idea? Training on a single speaker makes little sense, as the speaker encoder is trained on a speaker verification task. You're free to make it an "environment verification" task but I don't see anything fruitful coming out of that.

CorentinJ commented 5 years ago

Anyway, to answer your original question, I have a good module for handling hyperparameters but I certainly won't use it in this package because it's meant for production. The same goes for changing the model path, most users won't have custom models to use it with. You're free to modify the source code to meet your needs of course.

sberryman commented 5 years ago

"What is your goal with this idea?"

I guess I didn't explain it very well. I was wondering what would happen if the encoder is trained with data from mixed environments for the same speaker, will UMAP still cluster the same speaker by recording environments? Will the clusters for a single speaker still be separated?

"I have a good module for handling hyperparameters"

I'm guessing this is a module you have written and it is not open source? I have changed the code for my needs, I'll close this issue then as it doesn't sound like you want to make changing the hparams and model path configurable for Resemblyzer.