[Question] Are input d-vectors for training assumed L2-normalized?

google / uis-rnn

This is the library for the Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm, corresponding to the paper Fully Supervised Speaker Diarization.

https://arxiv.org/abs/1810.04719

Apache License 2.0

1.55k stars 320 forks source link

[Question] Are input d-vectors for training assumed L2-normalized? #80

Closed vadimkantorov closed 3 years ago

vadimkantorov commented 3 years ago

Are input d-vectors for training assumed L2-normalized?

In Generalized End-to-End Loss for Speaker Verification they are defined as L2-normalized in eq. 4.

In sample toy_training_data.npz, they are also L2-normalized:

cd uis-rnn
python3 -c 'import numpy; train_data = numpy.load("./data/toy_training_data.npz", allow_pickle=True); print((tr
ain_data["train_sequence"] ** 2).sum(axis = 1))'
# [1. 1. 1. ... 1. 1. 1.]

But eq. 11 from Fully Supervised Speaker Diarization models segment speaker embeddings as normally-distributed vectors and does not assume unit-length explicitly (if it did, maybe a von Mises–Fisher distribution would be a better distribution).

Thank you!

wq2012 commented 3 years ago

Yes, d-vectors are all assumed to have been L2 normalized.

And yes, you are correct, von Mises–Fisher distribution might be a better distribution here to use for UIS-RNN.

We just used Normal distribution as an approximation here due to its simplicity.

vadimkantorov commented 3 years ago

thanks for replying so fast! :)

vadimkantorov commented 3 years ago

I was struggling to overfit a single utterance segmentation without prior L2-normalization. I will let you know if now overfitting works as expected

vadimkantorov commented 3 years ago

With L2-normalized speaker embeddings, given a single sequence to overfit (with only two speakers and some silence speaker), uis-rnn improves upon the initial accuracy (~35%) to ~55%, but does not reach higher accuracies. I'm using default transition_bias estimation and sigma2 initialization and adjustment.

vadimkantorov commented 3 years ago

One more question: what is the semantics of "segments"? "nonoverlapping segments with max length of 400ms."?

wq2012 commented 3 years ago

One more question: what is the semantics of "segments"? "nonoverlapping segments with max length of 400ms."?

Yes. Also, it really doesn't have to be 400ms. The length 400ms is what we found that works well on our dev / eval datasets.

vadimkantorov commented 3 years ago

One more question about speaker embeddings. Are their values non-negative (i.e. ReLU is used prior to averaging / L2-normalization)?

I'm reimplementing a spectral clustering baseline from https://arxiv.org/abs/1710.10468 and https://github.com/wq2012/SpectralCluster, and depending on whether the values can be negative, the diffusion step may or may not be interpreted as random walk posterior probability after one step.

wq2012 commented 3 years ago

They can be negative. We don't have ReLU after the last 256-dim linear layer.