Theory background for pyannote-speech-detection

pyannote / pyannote-audio

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

MIT License

6.41k stars 790 forks source link

Hi, Bredin,

I am a newer to LSTM. After reading your paper(in the citation) about tristounet, I think I have got the basic idea how it works for speaker change detection.However, I still confused about the theory background for command: pyannote-speech-detection.

Can I say that speech detection is similar to speaker turns detection if non-speech segments are considered from a special 'speaker' while speech parts are from another 'speaker'. In this case, speech boundary detection is the same to speaker change detection.

However, I am still wondered why the parameter setting in config.yml ( n_classes: 2 ) for speech activity detection is different with the setting in config.yml ( n_classes: 1 ). To be honest, I don't know the meaning of this parameter(n_classes). Could I have other introduction or tutorials about the theory background for this pyannote-speech-detection command?

Thank you for your time and patience.

Liyong Guo

pyannote-speech-detection can be used to do "voice activity detection" (VAD) pyannote-change-detection can be used to do "speaker change detection" (SCD)

Both tasks can be seen as 2-class classification problems and are addressed using the same LSTM approach described in https://github.com/yinruiqing/change_detection/blob/master/doc/change-detection.pdf

The only difference is in how the groundtruth sequence is defined:

VAD : class 0 = non-speech, class 1 = speech SCD : class 0 = no speaker change, class 1 = speaker change

However, in practice, SCD is implemented as a 1-dimensional regression task with class 1 only -- hence the n_classes = 1.

Sorry I don't have time to provide more details...

pyannote / pyannote-audio

Theory background for pyannote-speech-detection #67