Closed PiotrTa closed 5 years ago
You are correct: one can use speaker change detection directly [1] or distance between speaker embeddings of two adjacent sliding windows [2]. Both approaches need a thresholding step at the end.
As long as your dataset comes with "who speaks when" annotations, you do not need to train a speech activity detection first.
[1] R. Yin and H. Bredin and C. Barras. Speaker Change Detection in Broadcast TV using Bidirectional Long Short-Term Memory Networks. Interspeech 2017 [2] H. Bredin. TristouNet: Triplet Loss for Speaker Turn Embedding. ICASSP 2017
So if I now have a new test file without any annotation and would like to apply change detection on it, do I have to start with speech activity detection first for labelling speech regions?
Yes.
At one point, I would like to train a network that does both at once, though.
Hi, I am trying to build a speaker change detection algorithm based on pyannote and my own dataset. Do I understand correctly that I could either use speaker-change-detection directly or speaker-embedding (+distance between two sliding windows + threshold + ...)?
Is there any dependecy during training between those two methods? Do they use the previously trained speech activity detection while training or they base on annotations?
Thank you for the answers.