google / uis-rnn

This is the library for the Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm, corresponding to the paper Fully Supervised Speaker Diarization.
https://arxiv.org/abs/1810.04719
Apache License 2.0
1.55k stars 320 forks source link

Understanding diarization labels #51

Closed zyc1310517843 closed 5 years ago

zyc1310517843 commented 5 years ago

Describe the question

A clear and concise description of what the question is.

My background

Have I read the README.md file?

Have I searched for similar questions from closed issues?

Have I tried to find the answers in the paper Fully Supervised Speaker Diarization?

Have I tried to find the answers in the reference Speaker Diarization with LSTM?

Have I tried to find the answers in the reference Generalized End-to-End Loss for Speaker Verification?

Hello, we used third-party tools to generate train_sequence and train_cluster_id, and completed the training. We trained 46 people and tested one of them. The prediction accuracy of the model was 98%. We can't understand the relationship between real tags and predictive tags. Although the accuracy is high, it makes it impossible to find out who the speaker is. We don't understand the label of demo you gave us. Thank you for your guidance.

wq2012 commented 5 years ago

I really don't understand your questions. Please clarify.

We can't understand the relationship between real tags and predictive tags.

Which part you don't understand?

it makes it impossible to find out who the speaker is.

What do you mean?

zyc1310517843 commented 5 years ago

For example, I used 46 people to train the model, where train_cluster_id is [0,0,0............... 45,45,45], and then I used Forty-sixth people to predict, where test_cluster_id is [0,0,0,0,0...]. The predicted result is [0, 0, 0, 0, 0...]. My question is, shouldn't the predicted label be [45, 45, 45...]? I hope you can understand what I said.

wq2012 commented 5 years ago

In diarization, the labels are not absolute labels, but relative labels. It is identity-agnostic.

Labels are meaningless across utterances.

For example, in an utterance, the labels are [0, 0, 1], it means first two segments are from one speaker, while the last segment is from a different speaker. It does NOT refer to any specific speaker.

if another utterance has labels [0, 1, 1], the two speakers in this utterance has no connection with the speakers in the previous utterance.

zyc1310517843 commented 5 years ago

I understand exactly what you said. Can I get the absolute label? Because I want to know who the speaker is.Thank you。

zyc1310517843 commented 5 years ago

I understand exactly what you said. Can I get the absolute label? Because I want to know who the speaker is.Thank you。

---Original--- From: "Quan Wang"notifications@github.com Date: Mon, Jun 10, 2019 11:39 AM To: "google/uis-rnn"uis-rnn@noreply.github.com; Cc: "Author"author@noreply.github.com;"zyc1310517843"1310517843@qq.com; Subject: Re: [google/uis-rnn] Understanding diarization labels (#51)

In diarization, the labels are not absolute labels, but relative labels. It is identity-agnostic.

Labels are meaningless across utterances.

For example, in an utterance, the labels are [0, 0, 1], it means first two segments are from one speaker, while the last segment is from a different speaker. It's does refer to any specific speaker.

if another utterance has labels [0, 1, 1], the two speakers in this utterance has no connection with the speakers in the previous utterance.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

wq2012 commented 5 years ago

If you want the absolute labels, you are looking at the wrong technique and the wrong repo. It's not the problem diarization is trying to solve. It's speaker recognition, which is much easier than diarization. You can simply compute cosine similarity with different embeddings.