Open alamnasim opened 5 years ago
you have same sampling rate for both hindi and english data files?
yes
Had better use the same language between your training data and evaluating data, since different language has different tempo and feature, unless there is a more robust speaker recognition method. Discussed in issue (Real time diarization #8)
Hi! About this topic...if we try to create a dataset in a specific language, can we set some parameters (a minimum number of speakers and utterrances, for example) in order to guarantee good final results (once these parameters are complied)? Thanks
Had better use the same language between your training data and evaluating data, since different language has different tempo and feature, unless there is a more robust speaker recognition method. Discussed in issue (Real time diarization #8)
Most of the audio (almost 90-92%) are hindi only, and rest are in english or mixed and spoken in indian ascent. I tried to find testing accuracy, i can get about 55-60 % accuracy.
As i checked, yes we can set number of speakers, and it selects randomly a range of utterances from each speaker. check line number 184 in generate_embeddings.py.
As i checked, yes we can set number of speakers, and it selects randomly a range of utterances from each speaker. check line number 184 in generate_embeddings.py.
Ok, but do you think that we can establish these parameters at the beginning? For example, i decide to create a dataset that contains 100 speakers and 2 utterances for each speaker. Can i be almost sure that the final performances will be acceptable? I hope i was clear. Thanks
I think more the utterances are for each speaker better will be your results.
We have filtered our data to hindi only and trained and tested again. Still getting more than two speakers (3 or 4 or 5 or 6). In my case, right now we are having hindi only audio, and we are sure that, have only two speakers so we want only two cluster. Is there any hardcode option in the code to specify number of speakers as two?
Do I need to train vgg-speaker model on my own dataset?
@taylorlu Thanks
We have filtered our data to hindi only and trained and tested again. Still getting more than two speakers (3 or 4 or 5 or 6). In my case, right now we are having hindi only audio, and we are sure that, have only two speakers so we want only two cluster. Is there any hardcode option in the code to specify number of speakers as two?
Do I need to train vgg-speaker model on my own dataset?
@taylorlu Thanks
I have the exact same question. Can somebody clarify this please?
I trained the vgg ghostvlad model on my own dataset, and then used that model to create dataset d-vector and then trained uisrnn. I am getting only one speaker.
Can I Fine tune the existing pre-trained model (vgg on voxceleb2) and I can retrain the whole or some layers on our own hindi dataset.
1: uisrnn didn't support to cluster given size of speakers, limited to the design of the method. 2: It's sure that the model could finetune by new dataset, but unlike the common CNN network, I've no idea which layers should be fixed, you can try. 3: Before trained uisrnn, you should test the ghostvlad model and decided whether the model was behaved correctly.
Thanks for the prompt reply. Sure, I will check.
@alamnasim, I would also suggest to try spectral-clustering method instead of uis-rnn method for clustering. My personal experience has been that spectral-clustering method is better at predicting the true count of speakers (it would predict true count or lesser). Contrary, I have found that uis-rnn method almost always does over-clustering (more speaker predicted that true count) for my use cases.
@taylorlu, I was wondering if you had tried the spectral-clustering method instead of this uis-rnn method? I find that spectral-clustering method is better both in terms of accuracy and speed.
@vickianand Thanks for your idea, I haven't try spectral-clustering method. But the difference between spectral-clustering and uisrnn is that the former doesn't support realtime clustering which means you should input the whole wav file at once. And another one, most clustering method need to define a clustering count, spectral-clustering seems support auto detect but I haven't try it.
@vickianand Thanks, I will try the spectral-clustering as well.
@vickianand Can you please provide me some guidelines how to replace uisrnn with spectral clustering in the given code?
feats = []
for spec in specs:
spec = np.expand_dims(np.expand_dims(spec, 0), -1)
v = network_eval.predict(spec)
feats += [v]
**feats = np.array(feats)[:,0,:].astype(float) # [splits, embedding dim]
clusterer = SpectralClusterer(
min_clusters=2,
max_clusters=100,
p_percentile=0.95,
gaussian_blur_sigma=1)
predicted_label = clusterer.predict(feats)**
uisrnn use feats and inference_args to find predicted_label but i provide only feats parameter to clusterer.predict to find the predicted labels.
As i have made following changes in speakerDiarization.py but it will not provide me accurate results as compared to uisrnn.
Thanks in advance
@Arroosh, the changes that you have made looks correct to me. It should be giving at least giving an accuracy comparable to uisrnn method. How much difference in accuracy are you seeing?
@taylorlu, I see that you have given a pretrained model for uisrnn in this repo - pretrained/saved_model.uisrnn_benchmark
. I observe that this model does a lot of over-clustering on my test data.
Can I ask what is dataset which you used for training this model?
@vickianand Spectral clustering is good for true count of speakers but not accurate in separating speakers from audio file. Can i change window size or any other parameter to improve the accuracy of results?
@vickianand, I used openslr38 as the dataset of pretrained model, since my propose was to deal with Chinese dialogue.
@taylorlu, The given code for uisrnn model is written in such a way that it will load all the training data at once into the memory and then it can run training iterations by sampling some mini-batches out of that loaded data. So, I am wondering how did you train it with dataset like openslr38 which is reasonably big. Did you write a new function for training which uses a data-loader? Or did you use the existing function? Also I have some questions about the given training method, which I have asked here - https://github.com/google/uis-rnn/issues/53 If possible please help me understand it.
@vickianand Yes, I also found there was a point in the shuffled permutation, and the uisrnn didn't support dynamic batch input too, the only way perhaps was to modify the code of uisrnn.
@taylorlu Did you only train with openslr38 or did you include the English VCTK and Voxceleb datasets as well? The Voxceleb datasets are huge, and your pretrained model is surprisingly small. I trained a uis-rnn model on the Timit dataset, which has only about 300 speakers, and it was 5 times as large. How did you get your model so small?
As i checked, yes we can set number of speakers, and it selects randomly a range of utterances from each speaker. check line number 184 in generate_embeddings.py.
can you please tell me what you have changed in line number 184 to set number of speakers?
I have hindi-english mixed audio with almost 130 speakers each having 200 utterances of length between 4 sec -10 sec. I made d-vector using vgg-speaker recognition model (pre-trained given in vgg-speaker recognition). I trained model using train.py and when i tested then i find more than two speakers while each testing data has only two speakers. In some cases it gives 3 or 4 speakers. What should i do? Why i gives me more speakers? Where do i went wrong, do i need to train vgg-speaker recognition model on my own data.
Please help. Thanks.