taylorlu / Speaker-Diarization

speaker diarization by uis-rnn and speaker embedding by vgg-speaker-recognition
Apache License 2.0
469 stars 121 forks source link
ghostvlad speaker-diarization speaker-recognition uis-rnn vgg-speaker-recognition

Speaker-Diarization

This project contains:

Outline

1. Speaker recognition.

cd ghostvlad
python predict.py

The confusion matrix of 4 persons utterances is as below

0.00  0.32  0.40  | 0.70  0.62  0.76  | 0.81  0.83  0.76  | 0.92  0.83  0.89  |

0.32  0.00  0.48  | 0.68  0.58  0.76  | 0.87  0.84  0.83  | 0.92  0.82  0.86  |

0.40  0.48  0.00  | 0.71  0.65  0.74  | 0.79  0.81  0.72  | 0.90  0.84  0.85  |

********************************************************************************

0.70  0.68  0.71  | 0.00  0.35  0.30  | 0.78  0.81  0.76  | 0.80  0.81  0.80  |

0.62  0.58  0.65  | 0.35  0.00  0.45  | 0.76  0.71  0.73  | 0.82  0.77  0.77  |

0.76  0.76  0.74  | 0.30  0.45  0.00  | 0.83  0.83  0.80  | 0.83  0.84  0.80  |

********************************************************************************

0.81  0.87  0.79  | 0.78  0.76  0.83  | 0.00  0.40  0.46  | 0.76  0.80  0.86  |

0.83  0.84  0.81  | 0.81  0.71  0.83  | 0.40  0.00  0.45  | 0.80  0.78  0.82  |

0.76  0.83  0.72  | 0.76  0.73  0.80  | 0.46  0.45  0.00  | 0.85  0.85  0.84  |

********************************************************************************

0.92  0.92  0.90  | 0.80  0.82  0.83  | 0.76  0.80  0.85  | 0.00  0.41  0.44  |

0.83  0.82  0.84  | 0.81  0.77  0.84  | 0.80  0.78  0.85  | 0.41  0.00  0.41  |

0.89  0.86  0.85  | 0.80  0.77  0.80  | 0.86  0.82  0.84  | 0.44  0.41  0.00  |

********************************************************************************

Thanks to the authors of VGG, they are kind enough to provide the code and pre-trained model. Their paper can refer to UTTERANCE-LEVEL AGGREGATION FOR SPEAKER RECOGNITION IN THE WILD
It's a novel idea that combines netvlad/ghostvlad which popularly used in image recognition to speaker recognition, the state-of-the-art in the past was i-vector based, which depended on the GMM model and pLDA.

About VGG speaker model, I have re-implemented in tensorflow, ghostvlad-speaker and corresponding pretrained model.

This project only shows how to generate speaker embeddings using pre-trained model for uis-rnn training in later.
The training project link to VGG-Speaker-Recognition

Dataset

  1. http://www.openslr.org/38 contains 855 speakers and 120 utterances of Chinese Mandarin in each, so there are 102600 utterances in total.
  2. VCTK contains 109 speakers of English.
  3. VoxCeleb1 contains 1251 speakers.
  4. VoxCeleb2 contains 6112 speakers.
    How to generate speaker embeddings for the next training stage:
    python generate_embeddings.py
    You may need to change the dataset path by your own.

2. Speaker diarization.

diarization

Training

python train.py

The speaker embeddings generated by vgg are all non-negative vectors, and contained many zero elements. The uis-rnn seems abnormally deal with these data somehow, shows as below

Iter: 0     Training Loss: nan    
Negative Log Likelihood: 7.3020 Sigma2 Prior: nan   Regularization: 0.0007
Iter: 10    Training Loss: nan    
Negative Log Likelihood: nan    Sigma2 Prior: nan   Regularization: nan
Iter: 20    Training Loss: nan    
Negative Log Likelihood: nan    Sigma2 Prior: nan   Regularization: nan

When I added an insignificate bias (e.g. 0.00001) to each element of vectors, error disappeared.

Iter: 0     Training Loss: -581.8732    
Negative Log Likelihood: 7.0125 Sigma2 Prior: -588.8864 Regularization: 0.0007
Iter: 10    Training Loss: -614.1193    
Negative Log Likelihood: 1.7536 Sigma2 Prior: -615.8737 Regularization: 0.0007
Iter: 20    Training Loss: -644.9244    
Negative Log Likelihood: 1.7123 Sigma2 Prior: -646.6375 Regularization: 0.0007

Clustering

python speakerDiarization.py

The Result is showing as below:(3 speakers)

========= 0 =========
0:00.288 ==> 0:04.406
0:07.699 ==> 0:16.461
0:33.921 ==> 0:35.8
========= 1 =========
0:04.406 ==> 0:07.699
0:16.461 ==> 0:19.594
0:30.371 ==> 0:33.921
0:41.19 ==> 0:44.185
========= 2 =========
0:19.594 ==> 0:30.371
0:35.8 ==> 0:41.19

The final result is influenced by the size of each window and the overlap rate. When the overlap is too large, the uis-rnn perhaps generates fewer speakers since the speaker embeddings changed smoothly, otherwise will generate more speakers. And also, the window size cannot be too short, it must contain enough information to generate more discrimitive speaker embeddings.