google / uis-rnn

This is the library for the Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm, corresponding to the paper Fully Supervised Speaker Diarization.
https://arxiv.org/abs/1810.04719
Apache License 2.0
1.55k stars 320 forks source link

[Question] How to prepare embedding data for training UIS-RNN? #37

Closed OpenCVnoob closed 5 years ago

OpenCVnoob commented 5 years ago

Describe the question

Hi, thank you for open source it ! I have read the 'README.md' file and almost all the issues under this repo. But I 'm still in a puzzle about data pre-processing.

My understanding is that before the training of the UIS-RNN, a speaker embedding network should be trained with some single-speaker utterance-level features , as is mentioned in the paper of GE2E loss, in advance. After that , input frame-level features generated from raw data to the embedding network to generate frame-level embeddings. And then I can use them to train my UIS-RNN. Am I right about that? I 'm wondering whether these frame-level embeddings are 'continuous d-vector embeddings (as sequences) ' you said here.

I am a new comer of speaker diarization and the question I asked really confused me, so I 'd be very grateful if you can help me. Thanks :)

My background

Have I read the README.md file?

Have I searched for similar questions from closed issues?

Have I tried to find the answers in the paper Fully Supervised Speaker Diarization?

Have I tried to find the answers in the reference Speaker Diarization with LSTM?

Have I tried to find the answers in the reference Generalized End-to-End Loss for Speaker Verification?

wq2012 commented 5 years ago

Hi,

Short answer

You should NOT use frame-level embeddings. You should use segment-level embeddings, and the corresponding segment-level speaker labels.

Why?

  1. Frame-level embeddings are too many, making the sequence too long, thus: a. Too expensive to train. b. Too much information for GRU to memorize.
  2. The GE2E training is based on windows. Only last frame output of speaker encoder is used in training. So in inference, you should also only use window embeddings. We use aggregated segment embeddings instead of window embeddings for UIS-RNN mostly for speed. Technically you can also directly use window embeddings.
OpenCVnoob commented 5 years ago

Thanks for your reply! I got it

Aurora11111 commented 5 years ago

@OpenCVnoob hello ,I meet the same problem as you,have you solved it out? Can you tell me how to del this issue?

OpenCVnoob commented 5 years ago

oh, sorry  I didn't notice this until now. I am still trying to find a good way to segment audio into single-speaker-segmentation, besides,there is no suitable dataset available for me. So I 'm not sure     when will I solve this issue. 18210240147 邮箱18210240147@fudan.edu.cn 签名由 网易邮箱大师 定制 On 02/28/2019 16:28, Aurora11111 wrote: @OpenCVnoob hello ,I meet the same problem as you,have you solved it out? Can you tell me how to del this issue? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Aurora11111 commented 5 years ago

@OpenCVnoob I run the project with myown datasets, the print out result is bad.