google / uis-rnn

This is the library for the Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm, corresponding to the paper Fully Supervised Speaker Diarization.
https://arxiv.org/abs/1810.04719
Apache License 2.0
1.56k stars 319 forks source link

Confusion about predicted labels #68

Closed clabornd closed 5 years ago

clabornd commented 5 years ago

My background

Have I read the README.md file?

Have I searched for similar questions from closed issues?

Have I tried to find the answers in the paper Fully Supervised Speaker Diarization?

Have I tried to find the answers in the reference Speaker Diarization with LSTM?

Have I tried to find the answers in the reference Generalized End-to-End Loss for Speaker Verification?

Describe the question

I just wanted to confirm that the output of model.predict(test_sequences, inference_args) should be a _completelyarbitrary list of labels. My worry is that I expected the labels to look like [0,0,1,1,2, 3 ... 7,1,2,10]. Specifically, that the starting speaker is always the arbitrary '0' speaker, and we will then encounter arbitrary speaker '1', '2' etc. However the output always looks something like: [2, 7, 2, 2, 2, 2, 5, 7, 4, 4]. Is it okay for me to just interpret this as transitions between the 4 arbitrary speakers '2', '4', '5', and '7' and ignore that '0', '1', '3' and '6' are missing?

I tried crawling through the beamsearch implementation but its a bit too dense.

wq2012 commented 5 years ago

@clabornd What's your --test_iteration argument?

If it is large than 1, it may happen that 0,1,3,6 only appeared in your first iteration.

clabornd commented 5 years ago

Ah yes, thank you, I had --test_iteration=2. Setting it to 1 produces the 'expected' output. So the multiple iterations just produces a more stable result, and is it appropriate to interpret the output of [2, 7, 2, 2, 2, 2, 5, 7, 4, 4] as 'this utterance contained 4 speakers that spoke in order X'? Thanks for the quick response.

wq2012 commented 5 years ago

@clabornd

is it appropriate to interpret the output of [2, 7, 2, 2, 2, 2, 5, 7, 4, 4] as 'this utterance contained 4 speakers that spoke in order X'?

Yes, it is correct.