Feature: Extracting speaker embeddings during diarization

k2-fsa / sherpa-onnx

Speech-to-text, text-to-speech, speaker diarization, and VAD using next-gen Kaldi with onnxruntime without Internet connection. Support embedded systems, Android, iOS, Raspberry Pi, RISC-V, x86_64 servers, websocket server/client, C/C++, Python, Kotlin, C#, Go, NodeJS, Java, Swift, Dart, JavaScript, Flutter, Object Pascal, Lazarus, Rust

https://k2-fsa.github.io/sherpa/onnx/index.html

Apache License 2.0

3.51k stars 411 forks source link

Feature: Extracting speaker embeddings during diarization #1460

Open WilliamVenner opened 1 week ago

WilliamVenner commented 1 week ago

My task combines both speaker diarization and speaker identification.

Since speaker embeddings are extracted during diarization anyway, it would be fantastic if the user could extract speaker embeddings from the speaker diarization segments/labels as well.

This would allow users to perform speaker identification against an existing speaker diarization result, thereby applying their own identified labels for the speakers and therefore simplify this task's pipeline.

csukuangfj commented 1 week ago

You can either return embeddings from the two lines below https://github.com/k2-fsa/sherpa-onnx/blob/a5295aad10ea932279b415cd573e57273926a69b/sherpa-onnx/csrc/offline-speaker-diarization-pyannote-impl.h#L146-L147

or you can use the diarization results to re-compute the embeddings.

WilliamVenner commented 1 week ago

I managed to DIY it. Probably not the best implementation, so I won't PR, but here it is: https://github.com/WilliamVenner/sherpa-onnx/commit/0d533de5451b9ba1f204428b8d154580b707d835#diff-dabb58cf56f7c8b62cb621374dc40f77696e653c14af9bb62ef1790d66d4b174