octimot / StoryToolkitAI

An editing tool that uses AI to transcribe, understand content and search for anything in your footage, integrated with ChatGPT and other AI models
GNU General Public License v3.0
626 stars 51 forks source link

Speaker ID training #164

Open clpStephen opened 4 months ago

clpStephen commented 4 months ago

Is your feature request related to a problem? Please describe. I work with a lot of reality television. And I'm deep in the south for a couple of my shows. The speakers are often mumbling in a deep vernacular. I would like to be able to submit some clips to the expand the ability to identify these individual speakers. The model actually does much better than I expected at figuring out what they are saying, but not much luck in differentiating people, or just breaking off into a new speaker mid-sentence. I'm not actually a programmer although I'm not bad at pretending to be one. I would like to be able to improve the tool to be bale to recognize the people I have in the show the most.

Describe the solution you'd like It looks like pyannote is what you are using for identification. So I'm asking more for clarity for us laymen. Is it possible to train the tool to recognize certain people better in order to increase transcription efficiency? If so, would that be something that I can do directly via StoryToolkit or would I need to work directly through pyannote? Based on your answer I can research how to actually accomplish this, I'm mostly just looking for a direction to go in.

Describe alternatives you've considered Cry

Additional context

octimot commented 4 months ago

This is a good question!

Right now we're not technically using pyannote in the tool, but only some functionality to embed how speakers "sound like".

This is how the algorithm works behind the scenes:

  1. OpenAI Whisper does the transcription The result is saved in the segments dictionary inside the transcription.json file. Each transcript line is saved in the in this dictionary with its start and end times, along with the actual text It's also saved as a transcription object in the memory, so the tool can fetch stuff faster and more efficiently.

  2. Once we have all the segments from Whisper, we take each segment and embed them into vector space using the speechbrain/spkrec-ecapa-voxceleb model. What this means is that each audio snipped is assigned a vector representation from its start to its end time.

  3. When we're "translating" each audio snipped to its vector, we're also comparing to the previous segment. If their representations are close to each other, we assume it's the same speaker, if not, we mark that a "speaker change" happened, and we basically add it as a "meta speaker" segment in the transcription.

  4. While doing these comparisons, the tool also checks if there were similar vectors (or audio snippets) and that's why you might see that some speakers reappear on the transcription sometimes, but the speechbrain model isn't too well tuned in doing that.

I hope this makes sense!

Now, pyannote is quite powerful in actually identifying speakers. The plan is to go further with this, so that we're able to:

  1. Correctly identify each transcript speaker and mark them accordingly on the transcription. This will result with only a few speakers, which will be easy to rename in a few steps using the CMD/CTRL+F, or even directly in in the exported file...

  2. And then, for phase 2: save the speaker vector for further identification in each StoryToolkitAI Project, so that if you ingest new footage, the tool will be able to name the speakers correctly for you in the new transcriptions.

We were focused on integrating other stuff until now, but hopefully will be able to get to that soon.

The main reason why we chose to code speaker change detection first instead of speaker identification is because the vpyannote models have a few special licensing needs (for e.g. you have to have a hugging face account and accept the author's terms before being able to download it), and I wanted to dig a bit further into what it would mean for the final user who has no idea about how these things work. Nevertheless, it's probably doable... it just needs a bit of attention to make it work.

After writing all this down, I'm not sure that I answered your question, so feel free to ping me back!!

Cheers

clpStephen commented 3 months ago

I haven't looked much at the indexing side of your tool, but I do know that tens of thousands of dollars are spent per season of many of these unscripted shows on the transcriptions. The limited Speaker ID you are doing now is already what is separating you from similar tools, for me. If you could make that work first I think you would get some attention.