Closed vadimkantorov closed 3 years ago
Yes, d-vectors are all assumed to have been L2 normalized.
And yes, you are correct, von Mises–Fisher distribution might be a better distribution here to use for UIS-RNN.
We just used Normal distribution as an approximation here due to its simplicity.
thanks for replying so fast! :)
I was struggling to overfit a single utterance segmentation without prior L2-normalization. I will let you know if now overfitting works as expected
With L2-normalized speaker embeddings, given a single sequence to overfit (with only two speakers and some silence speaker), uis-rnn improves upon the initial accuracy (~35%) to ~55%, but does not reach higher accuracies. I'm using default transition_bias
estimation and sigma2
initialization and adjustment.
One more question: what is the semantics of "segments"? "nonoverlapping segments with max length of 400ms."?
One more question: what is the semantics of "segments"? "nonoverlapping segments with max length of 400ms."?
Yes. Also, it really doesn't have to be 400ms. The length 400ms is what we found that works well on our dev / eval datasets.
One more question about speaker embeddings. Are their values non-negative (i.e. ReLU is used prior to averaging / L2-normalization)?
I'm reimplementing a spectral clustering baseline from https://arxiv.org/abs/1710.10468 and https://github.com/wq2012/SpectralCluster, and depending on whether the values can be negative, the diffusion step may or may not be interpreted as random walk posterior probability after one step.
They can be negative. We don't have ReLU after the last 256-dim linear layer.
Are input d-vectors for training assumed L2-normalized?
In Generalized End-to-End Loss for Speaker Verification they are defined as L2-normalized in eq. 4.
In sample
toy_training_data.npz
, they are also L2-normalized:But eq. 11 from Fully Supervised Speaker Diarization models segment speaker embeddings as normally-distributed vectors and does not assume unit-length explicitly (if it did, maybe a von Mises–Fisher distribution would be a better distribution).
Thank you!