google / uis-rnn

This is the library for the Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm, corresponding to the paper Fully Supervised Speaker Diarization.
https://arxiv.org/abs/1810.04719
Apache License 2.0
1.55k stars 320 forks source link

Handling unknown speakers? #65

Closed chrisspen closed 4 years ago

chrisspen commented 4 years ago

This is a followup to this.

Your answer was a little confusing, and you closed the question before I could ask for clarification. The Pytorch embedding tool doesn't look like it can handle an arbitrary number of speakers, and so I don't see how uis-rnn can do that either. The tool's config.yaml file even has a setting to specify the number of speakers it's being trained on, which by default is 4. How can uis-rnn handle more than 4 speakers if the embedding tool can't?

By UBM, I mean can uis-rnn classify something as "ubm" instead of a specific speaker, to indicate it's likely never seen the speaker before in any of its training data? For example, say I have train uis-rnn on data for Bob, Joe and Sue, but then I test it on data for Larry, who sounds nothing like the other three speakers. Will uis-rnn try to fit the label for Bob, Joe or Sue onto the Larry data, or will it return an "unknown" label to indicate it's never seen this pattern before?

For some tools and algorithms, this can be accomplished by simply re-passing in all training data with a "ubm" label, so a new speaker matches against that label instead of one of the known speakers. However, that doesn't work with some algorithms.

wq2012 commented 4 years ago

I thought my answers were clear so I closed that. You could re-open that of course if you need more clarifications.

The Pytorch embedding tool doesn't look like it can handle an arbitrary number of speakers, and so I don't see how uis-rnn can do that either. The tool's config.yaml file even has a setting to specify the number of speakers it's being trained on, which by default is 4. How can uis-rnn handle more than 4 speakers if the embedding tool can't?

We don't own the PyTorch embedding tool. And we are not responsible for the correctness of that. Our speaker embedding technique is very clearly explained in the paper "Generalized End-to-End Loss for Speaker Verification". The text-independent speaker encoder takes an audio as input, and outputs an embedding. This audio does not have to be from a speaker in the training set.

By UBM, I mean can uis-rnn classify something as "ubm" instead of a specific speaker, to indicate it's likely never seen the speaker before in any of its training data? For example, say I have train uis-rnn on data for Bob, Joe and Sue, but then I test it on data for Larry, who sounds nothing like the other three speakers. Will uis-rnn try to fit the label for Bob, Joe or Sue onto the Larry data, or will it return an "unknown" label to indicate it's never seen this pattern before?

Think about using k-means for clustering the speaker embeddings. The prediction API of UIS-RNN is similar to that of a k-means algorithm. Every utterance is totally independent of others. The speaker labels are only meaning inside an utterance. Speaker A only means it is different from speaker B in one utterance, it has nothing to do with speaker A in another utterance.

If you are still very confused, I highly suggest you read our previous papers and watch this video first: https://www.youtube.com/watch?v=pjxGPZQeeO4