First off, thank you for your fantastic work here. I am working on a project where I aim to translate and dub YouTube live streams almost in real-time. I've managed to achieve a delay of 3 minutes, but I'm looking to reduce this even further.
In my implementation, I capture the live stream and create segments of approximately 1 minute each because I use a technique to make cuts in speech only between words. After processing this audio with Demucs, I send it to the Whisperx pipeline. However, the speaker data varies across each audio file. I am interested in knowing if there is a way to preserve the embedding data of speakers across multiple audio files, with the same speakers, because I use the flags returned from diarization to dub in another language. But in each audio, I would have different flags for the same speaker.
First off, thank you for your fantastic work here. I am working on a project where I aim to translate and dub YouTube live streams almost in real-time. I've managed to achieve a delay of 3 minutes, but I'm looking to reduce this even further.
In my implementation, I capture the live stream and create segments of approximately 1 minute each because I use a technique to make cuts in speech only between words. After processing this audio with Demucs, I send it to the Whisperx pipeline. However, the speaker data varies across each audio file. I am interested in knowing if there is a way to preserve the embedding data of speakers across multiple audio files, with the same speakers, because I use the flags returned from diarization to dub in another language. But in each audio, I would have different flags for the same speaker.