google / uis-rnn

This is the library for the Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm, corresponding to the paper Fully Supervised Speaker Diarization.
https://arxiv.org/abs/1810.04719
Apache License 2.0
1.56k stars 319 forks source link

The clustering performance influenced by overlap window size #41

Closed taylorlu closed 5 years ago

taylorlu commented 5 years ago

@wq2012 The overlap rate seems strongly influenced the number of speakers. Since when overlap size is larger, the speaker embedding will change more smoothly, and the change points will hard to detect, it's apt to generate fewer speakers. And the size of sliding window also affects a lot, although this problem is caused by the speaker embedding algorithm. This is my project integrates with the vgg-speaker-recognition algorithm: Speaker-Diarization Thanks a lot.

wq2012 commented 5 years ago

For the window size, please make sure you train the speaker embedding model using variable-length windows.

In section 4.5 of our paper, we have stated that variable-length windows for speaker embedding training is essential for good diarization performance.

About the overlap, if you use exactly the same overlap for UIS-RNN training and evaluation, it won't be a problem. Since the smoothness information is going to be learned into the UIS-RNN model.

ldeseynes commented 5 years ago

Hi @wq2012,

In the section 4.1 of your paper you you say you've retrained a new d-vector model V3 using variable-length windows using sizes within [240ms,1600ms]. Since the window size for the diarization system is 240ms, why didn't you choose the same size for the d-vector model ? Is it because the training was inefficient with fixed-length windows of such a small size ?

Thanks in advance

wq2012 commented 5 years ago

@ldeseynes Hi, it's simply because we use the same model for different applications. For diarization, the window size is 240ms. For other applications like recognition, the window size can be as large as 1.6s.

You can of course only train the model with 240ms windows, but that limits the usefulness of the model.

hbredin commented 5 years ago

I believe @ldeseynes has a point here.

It is not clear whether the performance improvement comes from training on variable length windows or on shorter windows. The paper seems to suggest the former, while you seem to suggest the latter.

Would be nice to see whether a model trained on fixed-length 240ms windows would reach the same performance.

wq2012 commented 5 years ago

@hbredin I see. Thanks for pointing it out.

What we really want to suggest is actually training with consistent window length as inference logic, not necessarily variable-length. Indeed we don't have experiments to show the diarization performance when speaker embedding model is trained only with 240ms windows. I will see if the team has some extra circles to add such experiments.

taylorlu commented 5 years ago

It confused me that could I generate speaker embedding (fixed model) by using variable length windows instead of retraining the speaker embedding module for more robust. Since the former is cheaper.

wq2012 commented 5 years ago

@taylorlu You can do both, since both are consistent with inference.

What we did is developing one model with variable length windows, and use it for both diarization (240ms windows) and other applications (larger windows). And the reason is exactly what you said - it's cheaper.

Only thing you should avoid is to use a window size for inference that has never been used in training.

Sorry for the confusions. Let me know if it's clear to you now.