m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 2-Clause "Simplified" License
10.17k stars 1.07k forks source link

Keeping the same speaker in different files #777

Open wallaceblaia opened 3 months ago

wallaceblaia commented 3 months ago

First off, thank you for your fantastic work here. I am working on a project where I aim to translate and dub YouTube live streams almost in real-time. I've managed to achieve a delay of 3 minutes, but I'm looking to reduce this even further.

In my implementation, I capture the live stream and create segments of approximately 1 minute each because I use a technique to make cuts in speech only between words. After processing this audio with Demucs, I send it to the Whisperx pipeline. However, the speaker data varies across each audio file. I am interested in knowing if there is a way to preserve the embedding data of speakers across multiple audio files, with the same speakers, because I use the flags returned from diarization to dub in another language. But in each audio, I would have different flags for the same speaker.

SeeknnDestroy commented 2 months ago

Hey @wallaceblaia could you find any solution for this?

rosebbb commented 1 month ago

Hi @wallaceblaia, are you using Demucs to make cuts between words?