o-oconnell / mp4grep

mp4grep is a CLI for transcribing and searching audio/video files
GNU General Public License v3.0
281 stars 6 forks source link

Feature speaker detection? #16

Open Matthias84 opened 2 years ago

Matthias84 commented 2 years ago

HI, I really appreciate your tool. It's such a great solution to make recordings more accessible for further investigations :smiley:

I read that Vosk has also a speaker identification / detection and I'm wondering, if you could add this to mp4grep as well? For myself there are a lot of nice usecases to track / analyse discussions (TV shows, movies, phone recordings, podcasts, web conferences, ...) and that allow great research like NLP or knowledge base and making multimedia content more accessible to users with handicaps. Done with privacy in mind and not contributing to major tech company algorithms.

My understanding so far is, that Vosk needs fingerprinting for different speakers and maybe multiple fingerprints per person. So we will need a way to assign lines within a transcription to fingerprinted speakers and to label this fingerprints with human readable labels. In a second step, there might be a final processing, that assigns this labels to every transcription line. Maybe we need also an extended transcription format like WebVTT to share this assigned lines and timecodes?

o-oconnell commented 2 years ago

This sounds like a great idea and I'm definitely interested in implementing it for the next version. We have been doing a lot of experimentation with languages and environments, and hope to add several improvements. It is on my todo list. Thanks!