google / uis-rnn

This is the library for the Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm, corresponding to the paper Fully Supervised Speaker Diarization.
https://arxiv.org/abs/1810.04719
Apache License 2.0
1.55k stars 320 forks source link

[Question]About UIS-RNN d-vector #67

Closed BarCodeReader closed 4 years ago

BarCodeReader commented 4 years ago

Describe the question

Hi, I have been working on this issue for almost a month. I finally manage to get a good EER on the training of LSTM, and now training on UIS-RNN.

I have a question about the d-vector.

so here the specification for your training is: sampling rate: 16K for mel-transform: [25ms window, 10ms hop length] training LSTM [140-180 frame], lets say we fix to 160 frames training UIS-RNN: 400ms segment-level d-vector

so here 25ms audio will become 1 frame of mel-spectrum, 160 frame is roughly 1.6s(with hop) means for training the LSTM, we are acutally feeding 1.6s audio into LSTM.

However, for UIS-RNN, you mention we need to use VAD to segment the audio into maximum 400ms segments.

so a 400ms segmented audio after [25ms window, 10ms hop length] mel-transform will only gives you around 40 frames, and how can we generate multiple d-vectors from this 40 frame data in order to get a segment-level d-vector? (because you mention we need to L2 norm and average them).

I really appreciate that you can give me some instruction on this point.

My background

Have I read the README.md file?

Have I searched for similar questions from closed issues?

Have I tried to find the answers in the paper Fully Supervised Speaker Diarization?

Have I tried to find the answers in the reference Speaker Diarization with LSTM?

Have I tried to find the answers in the reference Generalized End-to-End Loss for Speaker Verification?

divyeshrajpura4114 commented 4 years ago

@BarCodeReader Have you solved this confusion?? I am also confused about this.

The flowchart of our diarization system is provided in Fig. 1. In this system, audio signals are first transformed into frames of width 25ms and step 10ms, and log-mel-filterbank energies of dimension 40 are extracted from each frame as the network input. We build sliding windows of a fixed length on these frames, and run the LSTM network on each window. The last-frame output of the LSTM is then used as the d-vector representation of this sliding window.

We use a Voice Activity Detector (VAD) to determine speech segments from the audio, which are further divided into smaller nonoverlapping segments using a maximal segment-length limit (e.g. 400ms in our experiments), which determines the temporal resolution of the diarization results. For each segment, the corresponding d-vectors are first L2 normalized, then averaged to form an embedding of the segment.

What I understand is first paragraph above refers to training of network for d-vector. Second paragraph above is mentioning procedure for inference i.e testing. During training, we are giving average 1.6s audio (160 frames) as input to network i.e. we are training with window-level embeddings. However, during testing, we are giving segment of 400ms audio (40 frames) i.e. we are working with segement-level embeddings.

Am I correct?

BarCodeReader commented 4 years ago

@divyeshrajpura4114 yes you are right. for my side I give up on trying this method as I found some other approaches. Training on such huge dataset is really time consuming and Google they have their own dataset to boost the accuracy. I just feel it is almost impossible for me to reach the same level, and I change my direction.

But, yes, your understanding is correct and, good luck.

divyeshrajpura4114 commented 4 years ago

Thank you for your reply and clearing doubt. It could be the case that we can not achieve what they have achieved with huge amount of data, because of limited resources. But, we get better than that with small amount of data. Let's hope for the best.

008karan commented 4 years ago

@divyeshrajpura4114 Have you got good results with this implementation. I am starting with this UIS-RNN. Is only audio having timestamped is required for training? Or can we use unlabeled data to generate d-vector as in the case of SincNet? Any other suggestion you want to share which can help in implementation is welcome... Thanks!

divyeshrajpura4114 commented 4 years ago

@008karan I am trying this, but I got another work inbetween, so right now my implementation is on hold. In the paper UIS-RNN, they have used two architectures,

  1. extracting d-vector Generalized End-to-End Loss for Speaker Verification
  2. generating speaker homogeneous cluster (which is actual UIS-RNN model).

Complete process

  1. Implement model for extracting window-level d-vector which just finds the embeddings with same process as in Generalized End-to-End Loss for Speaker Verification (In this paper they have extracted utterance-level embeddings where as in UIS-RNN we need window-level embeddings, because now we are also interested in finding speaker changes in future part).

  2. Now, you have to implement UIS-RNN, which takes d-vectors generated by above architecture to automate clustering process. So this UIS-RNN architecture requires label (in terms of change in speaker).

Is only audio having timestamped is required for training? Or can we use unlabeled data to generate d-vector as in the case of SincNet?

I am not very much aware about SincNet, but no you do not need label (in terms of change in speaker) to generate d-vector. You just need some dataset having some minimum number of utterances for each speaker such as TIMIT, VOXCELEB etc.

008karan commented 4 years ago

thanks for a brief explanation. So is code for d-vector generation and UIS-RNN is there in this repo or both are separate?

divyeshrajpura4114 commented 4 years ago

Original paper for d-vector is Generalized End-to-End Loss for Speaker Verification. They haven't released code for d-vector but you can follow third party implementation which is given on above link.