juanmc2005 / diart

A python package to build AI-powered real-time audio applications
https://diart.readthedocs.io
MIT License
903 stars 76 forks source link

Speaker Identity Resolution #230

Closed jfernandrezj closed 4 months ago

jfernandrezj commented 6 months ago

Thank you very much @juanmc2005 for this library, much much appreciated. One question I have for this speaker-aware transcription is whether a custom plugin / observer / sink could be implemented for speaker identity resolution, and what would be the best pattern to achieve this. Ideally on each buffer iteration or speaker change, a speaker resolution prediction, based on a model (probably like faiss / weaviate), could be added either to the rttm or to another file.

Any input would be much appreciated, thank you!

juanmc2005 commented 5 months ago

Hi @jfernandrezj, you could try recovering the internal speaker centroids of OnlineSpeakerClustering (centers attribute) to match them with other speakers as you mentioned. For this to work you'd need to use the same embedding model used in diart.

If you want to use a different speaker matching method/model, you can always incorporate it into the pipeline to either replace or complement diart's speaker embedding block, but this could be quite expensive in terms of latency. I would suggest to send audio to a separate speaker matching service and listening to it to label each speaker centroid at display time (e.g. speaker0 -> John).

jfernandrezj commented 4 months ago

Thank you very much @juanmc2005