google / uis-rnn

This is the library for the Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm, corresponding to the paper Fully Supervised Speaker Diarization.
https://arxiv.org/abs/1810.04719
Apache License 2.0
1.56k stars 319 forks source link

Embedding Extraction Procedure #71

Closed divyeshrajpura4114 closed 4 years ago

divyeshrajpura4114 commented 4 years ago

The flowchart of our diarization system is provided in Fig. 1. In this system, audio signals are first transformed into frames of width 25ms and step 10ms, and log-mel-filterbank energies of dimension 40 are extracted from each frame as the network input. We build sliding windows of a fixed length on these frames, and run the LSTM network on each window. The last-frame output of the LSTM is then used as the d-vector representation of this sliding window.

We use a Voice Activity Detector (VAD) to determine speech segments from the audio, which are further divided into smaller nonoverlapping segments using a maximal segment-length limit (e.g. 400ms in our experiments), which determines the temporal resolution of the diarization results. For each segment, the corresponding d-vectors are first L2 normalized, then averaged to form an embedding of the segment.

I am not able to understand procesure of extracting embeddings. Embeddings used during training are window-level, segment-level or utterance-level embeddings? Is the above second paragraph correpond to procesure during testing?

If you could give some detailed explanation on extracting embeddings, it would be very helpful.

Thank You.

Describe the question

A clear and concise description of what the question is.

My background

Have I read the README.md file?

Have I searched for similar questions from closed issues?

Have I tried to find the answers in the paper Fully Supervised Speaker Diarization?

Have I tried to find the answers in the reference Speaker Diarization with LSTM?

Have I tried to find the answers in the reference Generalized End-to-End Loss for Speaker Verification?

wq2012 commented 4 years ago

Embeddings used for training uis-rnn are segment-wise embeddings.

But embeddings themselves are trained from variable length windows, see "Generalized End-to-End Loss for Speaker Verification".