taylorlu / Speaker-Diarization

speaker diarization by uis-rnn and speaker embedding by vgg-speaker-recognition
Apache License 2.0
470 stars 121 forks source link

Speaker diarization gives more than two speaker while i have only two speakers in the audio file. #14

Open alamnasim opened 5 years ago

alamnasim commented 5 years ago

I have hindi-english mixed audio with almost 130 speakers each having 200 utterances of length between 4 sec -10 sec. I made d-vector using vgg-speaker recognition model (pre-trained given in vgg-speaker recognition). I trained model using train.py and when i tested then i find more than two speakers while each testing data has only two speakers. In some cases it gives 3 or 4 speakers. What should i do? Why i gives me more speakers? Where do i went wrong, do i need to train vgg-speaker recognition model on my own data.

Please help. Thanks.

Arroosh commented 5 years ago

you have same sampling rate for both hindi and english data files?

alamnasim commented 5 years ago

yes

Arroosh commented 5 years ago

Had better use the same language between your training data and evaluating data, since different language has different tempo and feature, unless there is a more robust speaker recognition method. Discussed in issue (Real time diarization #8)

giorgionanfa commented 5 years ago

Hi! About this topic...if we try to create a dataset in a specific language, can we set some parameters (a minimum number of speakers and utterrances, for example) in order to guarantee good final results (once these parameters are complied)? Thanks

alamnasim commented 5 years ago

Had better use the same language between your training data and evaluating data, since different language has different tempo and feature, unless there is a more robust speaker recognition method. Discussed in issue (Real time diarization #8)

Most of the audio (almost 90-92%) are hindi only, and rest are in english or mixed and spoken in indian ascent. I tried to find testing accuracy, i can get about 55-60 % accuracy.

alamnasim commented 5 years ago

As i checked, yes we can set number of speakers, and it selects randomly a range of utterances from each speaker. check line number 184 in generate_embeddings.py.

giorgionanfa commented 5 years ago

As i checked, yes we can set number of speakers, and it selects randomly a range of utterances from each speaker. check line number 184 in generate_embeddings.py.

Ok, but do you think that we can establish these parameters at the beginning? For example, i decide to create a dataset that contains 100 speakers and 2 utterances for each speaker. Can i be almost sure that the final performances will be acceptable? I hope i was clear. Thanks

Arroosh commented 5 years ago

I think more the utterances are for each speaker better will be your results.

alamnasim commented 5 years ago

We have filtered our data to hindi only and trained and tested again. Still getting more than two speakers (3 or 4 or 5 or 6). In my case, right now we are having hindi only audio, and we are sure that, have only two speakers so we want only two cluster. Is there any hardcode option in the code to specify number of speakers as two?

Do I need to train vgg-speaker model on my own dataset?

@taylorlu Thanks

Fritskee commented 5 years ago

We have filtered our data to hindi only and trained and tested again. Still getting more than two speakers (3 or 4 or 5 or 6). In my case, right now we are having hindi only audio, and we are sure that, have only two speakers so we want only two cluster. Is there any hardcode option in the code to specify number of speakers as two?

Do I need to train vgg-speaker model on my own dataset?

@taylorlu Thanks

I have the exact same question. Can somebody clarify this please?

alamnasim commented 5 years ago

I trained the vgg ghostvlad model on my own dataset, and then used that model to create dataset d-vector and then trained uisrnn. I am getting only one speaker.

Can I Fine tune the existing pre-trained model (vgg on voxceleb2) and I can retrain the whole or some layers on our own hindi dataset.

taylorlu commented 5 years ago

1: uisrnn didn't support to cluster given size of speakers, limited to the design of the method. 2: It's sure that the model could finetune by new dataset, but unlike the common CNN network, I've no idea which layers should be fixed, you can try. 3: Before trained uisrnn, you should test the ghostvlad model and decided whether the model was behaved correctly.

alamnasim commented 5 years ago

Thanks for the prompt reply. Sure, I will check.

vickianand commented 5 years ago

@alamnasim, I would also suggest to try spectral-clustering method instead of uis-rnn method for clustering. My personal experience has been that spectral-clustering method is better at predicting the true count of speakers (it would predict true count or lesser). Contrary, I have found that uis-rnn method almost always does over-clustering (more speaker predicted that true count) for my use cases.

vickianand commented 5 years ago

@taylorlu, I was wondering if you had tried the spectral-clustering method instead of this uis-rnn method? I find that spectral-clustering method is better both in terms of accuracy and speed.

taylorlu commented 5 years ago

@vickianand Thanks for your idea, I haven't try spectral-clustering method. But the difference between spectral-clustering and uisrnn is that the former doesn't support realtime clustering which means you should input the whole wav file at once. And another one, most clustering method need to define a clustering count, spectral-clustering seems support auto detect but I haven't try it.

alamnasim commented 5 years ago

@vickianand Thanks, I will try the spectral-clustering as well.

Arroosh commented 5 years ago

@vickianand Can you please provide me some guidelines how to replace uisrnn with spectral clustering in the given code?

feats = []
for spec in specs:
    spec = np.expand_dims(np.expand_dims(spec, 0), -1)
    v = network_eval.predict(spec)
    feats += [v]

**feats = np.array(feats)[:,0,:].astype(float)  # [splits, embedding dim]
clusterer = SpectralClusterer(
min_clusters=2,
max_clusters=100,
p_percentile=0.95,
gaussian_blur_sigma=1)

predicted_label = clusterer.predict(feats)**

predicted_label = uisrnnModel.predict(feats, inference_args)

uisrnn use feats and inference_args to find predicted_label but i provide only feats parameter to clusterer.predict to find the predicted labels.

As i have made following changes in speakerDiarization.py but it will not provide me accurate results as compared to uisrnn.

Thanks in advance

vickianand commented 5 years ago

@Arroosh, the changes that you have made looks correct to me. It should be giving at least giving an accuracy comparable to uisrnn method. How much difference in accuracy are you seeing?

vickianand commented 5 years ago

@taylorlu, I see that you have given a pretrained model for uisrnn in this repo - pretrained/saved_model.uisrnn_benchmark. I observe that this model does a lot of over-clustering on my test data. Can I ask what is dataset which you used for training this model?

Arroosh commented 5 years ago

@vickianand Spectral clustering is good for true count of speakers but not accurate in separating speakers from audio file. Can i change window size or any other parameter to improve the accuracy of results?

taylorlu commented 5 years ago

@vickianand, I used openslr38 as the dataset of pretrained model, since my propose was to deal with Chinese dialogue.

vickianand commented 5 years ago

@taylorlu, The given code for uisrnn model is written in such a way that it will load all the training data at once into the memory and then it can run training iterations by sampling some mini-batches out of that loaded data. So, I am wondering how did you train it with dataset like openslr38 which is reasonably big. Did you write a new function for training which uses a data-loader? Or did you use the existing function? Also I have some questions about the given training method, which I have asked here - https://github.com/google/uis-rnn/issues/53 If possible please help me understand it.

taylorlu commented 5 years ago

@vickianand Yes, I also found there was a point in the shuffled permutation, and the uisrnn didn't support dynamic batch input too, the only way perhaps was to modify the code of uisrnn.

chrisspen commented 5 years ago

@taylorlu Did you only train with openslr38 or did you include the English VCTK and Voxceleb datasets as well? The Voxceleb datasets are huge, and your pretrained model is surprisingly small. I trained a uis-rnn model on the Timit dataset, which has only about 300 speakers, and it was 5 times as large. How did you get your model so small?

priyankagutte commented 4 years ago

As i checked, yes we can set number of speakers, and it selects randomly a range of utterances from each speaker. check line number 184 in generate_embeddings.py.

can you please tell me what you have changed in line number 184 to set number of speakers?