Feature speaker detection?

HI, I really appreciate your tool. It's such a great solution to make recordings more accessible for further investigations :smiley:

I read that Vosk has also a speaker identification / detection and I'm wondering, if you could add this to mp4grep as well? For myself there are a lot of nice usecases to track / analyse discussions (TV shows, movies, phone recordings, podcasts, web conferences, ...) and that allow great research like NLP or knowledge base and making multimedia content more accessible to users with handicaps. Done with privacy in mind and not contributing to major tech company algorithms.

My understanding so far is, that Vosk needs fingerprinting for different speakers and maybe multiple fingerprints per person. So we will need a way to assign lines within a transcription to fingerprinted speakers and to label this fingerprints with human readable labels. In a second step, there might be a final processing, that assigns this labels to every transcription line. Maybe we need also an extended transcription format like WebVTT to share this assigned lines and timecodes?

o-oconnell / mp4grep

Feature speaker detection? #16