Embedding Extraction Procedure

The flowchart of our diarization system is provided in Fig. 1. In this system, audio signals are first transformed into frames of width 25ms and step 10ms, and log-mel-filterbank energies of dimension 40 are extracted from each frame as the network input. We build sliding windows of a fixed length on these frames, and run the LSTM network on each window. The last-frame output of the LSTM is then used as the d-vector representation of this sliding window.

We use a Voice Activity Detector (VAD) to determine speech segments from the audio, which are further divided into smaller nonoverlapping segments using a maximal segment-length limit (e.g. 400ms in our experiments), which determines the temporal resolution of the diarization results. For each segment, the corresponding d-vectors are first L2 normalized, then averaged to form an embedding of the segment.

I am not able to understand procesure of extracting embeddings. Embeddings used during training are window-level, segment-level or utterance-level embeddings? Is the above second paragraph correpond to procesure during testing?

If you could give some detailed explanation on extracting embeddings, it would be very helpful.

Thank You.

Describe the question

A clear and concise description of what the question is.

My background

Have I read the README.md file?

yes/no - if you answered no, please stop filing the issue, and read it first

Have I searched for similar questions from closed issues?

yes/no - if you answered no, please do it first

Have I tried to find the answers in the paper Fully Supervised Speaker Diarization?

yes/no

Have I tried to find the answers in the reference Speaker Diarization with LSTM?

yes/no

Have I tried to find the answers in the reference Generalized End-to-End Loss for Speaker Verification?

yes/no

google / uis-rnn

Embedding Extraction Procedure #71

Describe the question

My background