Feasability of the implementation of a Speaker Enrollment pipeline.

pyannote / pyannote-audio

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

http://pyannote.github.io

MIT License

5.88k stars 752 forks source link

Feasability of the implementation of a Speaker Enrollment pipeline. #391

Closed hadware closed 4 years ago

hadware commented 4 years ago

Is your feature request related to a problem? Please describe. The title is pretty self-explanatory. I'd just like to know how much work would be needed to implement a pipeline for a speaker unrolling task: are all the required building blocks already here in your opinion? If there isn't too much digging involved, i'd probably be willing to do it myself :)

hbredin commented 4 years ago

Can you please define "speaker unrolling"? I am not familiar with this wording.

Did you mean "speaker enrollment"? If so, what do you have in mind exactly? Speaker identification?

Rachine commented 4 years ago

Hello,

Yes, it would be speaker enrollment. Based on variable amount of target speakers, find all the segments in the audio from these speakers. We were thinking to a pipeline that look like this https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41939.pdf

hbredin commented 4 years ago

I didn't go through the paper but I think the SpeechTurnClosestAssignment pipeline might get you started.

Enrollment

Basic idea: gather all speaker embedding for each target and take the average.

https://github.com/pyannote/pyannote-audio/blob/06f76a2c5a37c79cf42710167c7b7404658879d3/pyannote/audio/pipeline/speech_turn_assignment.py#L94-L113

Recognition

Basic idea: for each test speech turn (or, here, speaker cluster), find closest target speaker (by comparing their average embedding. You might also want to consider the reject option if even the closest target speaker is too far.

https://github.com/pyannote/pyannote-audio/blob/06f76a2c5a37c79cf42710167c7b7404658879d3/pyannote/audio/pipeline/speech_turn_assignment.py#L115-L144

Rachine commented 4 years ago

Amazing! Thank you! We will let you know how this goes and if it works on our 'special' data.

hbredin commented 4 years ago

Closing this issue as I believe the original question has been answered. I'd still be interested in knowing how it went 👀

Rachine commented 4 years ago

Hey! This is still ongoing. We have the pipeline up and running but we are having a hard time to finetune correctly a spk emb model on our smallish dataset 😿