[Feature Request]: Speaker labels (Diarization)

thewh1teagle commented 6 months ago

Goal

Provide speaker labels along with the transcriptions (eg. Speaker1: ..., Speaker2: ...) Do it in the same time when transcribing efficient and lightweight.

Research

https://github.com/wq2012/awesome-diarization

Possible ways: Use c/c++ diarization libs in Rust using bindgen Replicate pyannote-audio to Rust with tch-rs

Use ONNX runtime with ort

https://github.com/pykeio/ort/discussions/208

https://github.com/pyannote/pyannote-audio/issues/1322

Best combination: pyannote-segmentation-30 WespeakerVoxcelebResnet34LM

florianchevallier commented 6 months ago

I don't know if it helps but here is the most successful Notebook I know to perform this, maybe it's adaptable in rust?

https://github.com/MahmoudAshraf97/whisper-diarization/tree/main

thewh1teagle commented 5 months ago

Thanks! Most of the implementation in python uses pyannotate Since we use Rust in vibe it's more challenging as I can't find any oss project that does it.

We'll probably use onnx ai runtime with https://github.com/pykeio/ort for segmentation and https://github.com/nkeenan38/voice_activity_detector for vad

oleole39 commented 5 months ago

Thanks! Most of the implementation in python uses pyannotate Since we use Rust in vibe it's more challenging as I can't find any oss project that does it.

We'll probably use onnx ai runtime with https://github.com/pykeio/ort for segmentation and https://github.com/nkeenan38/voice_activity_detector for vad

Or cheat a bit ? https://rustpython.github.io (never used it myself though)

thewh1teagle commented 5 months ago

Or cheat a bit ? https://rustpython.github.io (never used it myself though)

Looks like a useful crate! but I hope we can continue to avoid using Python for as long as possible to maintain top-notch performance and quality.

altunenes commented 4 months ago

I'm not an expert in this area maybe this could be very stupid but I tried to create a minimal Python script using an onnx model (heavily dependent on https://github.com/pengzhendong/pyannote-onnx) to gain more insight into the process of converting this to rust (using ort for rust).

Converting this script into the Rust doesn't seem like a big deal to me, but of course, I might be missing something critical here (especially on the segmentation part) haha

warning: I tested this only with this file: https://github.com/pengzhendong/pyannote-onnx/blob/master/data/test_16k.wav so may not be appropriate with a general solutions...


import numpy as np
import onnxruntime as ort
import soundfile as sf
from itertools import permutations

class MinimalSpeakerDiarization:
    def __init__(self, model_path):
        self.num_classes = 4
        self.vad_sr = 16000
        self.duration = 10 * self.vad_sr
        self.session = ort.InferenceSession(model_path)

    def sample2frame(self, x):
        return (x - 721) // 270

    def frame2sample(self, x):
        return (x * 270) + 721

    def sliding_window(self, waveform, window_size, step_size):
        start = 0
        num_samples = len(waveform)
        while start <= num_samples - window_size:
            yield waveform[start : start + window_size]
            start += step_size
        if start < num_samples:
            last_window = np.pad(waveform[start:], (0, window_size - (num_samples - start)))
            yield last_window

    def reorder(self, x, y):
        perms = [np.array(perm).T for perm in permutations(y.T)]
        diffs = np.sum(np.abs(np.sum(np.array(perms)[:, : x.shape[0], :] - x, axis=1)), axis=1)
        return perms[np.argmin(diffs)]

    def process_audio(self, audio_path):
        wav, sr = sf.read(audio_path)
        if sr != self.vad_sr:
            raise ValueError(f"Audio sample rate {sr} does not match required {self.vad_sr}")

        wav = wav.astype(np.float32)

        step = 5 * self.vad_sr
        step = max(min(step, int(0.9 * self.duration)), self.duration // 2)
        overlap = self.sample2frame(self.duration - step)
        overlap_chunk = np.zeros((overlap, self.num_classes), dtype=np.float32)

        results = []
        for window in self.sliding_window(wav, self.duration, step):
            window = window.astype(np.float32)
            ort_outs = np.exp(self.session.run(None, {"input": window[None, None, :]})[0][0])
            ort_outs = np.concatenate(
                (
                    1 - ort_outs[:, :1],  # speech probabilities
                    self.reorder(
                        overlap_chunk[:, 1 : self.num_classes],
                        ort_outs[:, 1 : self.num_classes],
                    ),  # speaker probabilities
                ),
                axis=1,
            )
            if len(results) > 0:
                ort_outs[:overlap, :] = (ort_outs[:overlap, :] + overlap_chunk) / 2
            overlap_chunk = ort_outs[-overlap:, :]
            results.extend(ort_outs[:-overlap])

        return np.array(results)
    def get_speech_segments_with_speakers(self, results, threshold=0.5, min_speech_duration_ms=100):
        speech_prob = results[:, 0]
        speaker_probs = results[:, 1:]
        segments = []
        in_speech = False
        start = 0

        # First, determine active speakers
        speech_duration = np.sum(speaker_probs > threshold, axis=0)
        speech_duration_ms = self.frame2sample(speech_duration) * 1000 / self.vad_sr
        active_speakers = np.where(speech_duration_ms > min_speech_duration_ms)[0]

        for i, (speech, speakers) in enumerate(zip(speech_prob, speaker_probs)):
            if not in_speech and speech >= threshold:
                start = i
                in_speech = True
            elif in_speech and speech < threshold:
                speaker_index = np.argmax(np.mean(speaker_probs[start:i], axis=0))
                if speaker_index in active_speakers:
                    speaker = f'speaker{np.where(active_speakers == speaker_index)[0][0] + 1}'
                    segments.append({
                        'start': self.frame2sample(start) / self.vad_sr,
                        'end': self.frame2sample(i) / self.vad_sr,
                        'speaker': speaker
                    })
                in_speech = False
        if in_speech:
            speaker_index = np.argmax(np.mean(speaker_probs[start:], axis=0))
            if speaker_index in active_speakers:
                speaker = f'speaker{np.where(active_speakers == speaker_index)[0][0] + 1}'
                segments.append({
                    'start': self.frame2sample(start) / self.vad_sr,
                    'end': self.frame2sample(len(speech_prob)) / self.vad_sr,
                    'speaker': speaker
                })
        return segments, len(active_speakers)

    def get_num_speakers(self, results, threshold=0.5, min_speech_duration_ms=100):
        speaker_probs = results[:, 1:]
        speech_duration = np.sum(speaker_probs > threshold, axis=0)
        speech_duration_ms = self.frame2sample(speech_duration) * 1000 / self.vad_sr
        return np.sum(speech_duration_ms > min_speech_duration_ms)

if __name__ == "__main__":
    model_path = "segmentation-3.0.onnx"
    audio_path = "test_16k.wav"

    diarizer = MinimalSpeakerDiarization(model_path)
    results = diarizer.process_audio(audio_path)

    speech_segments, num_speakers = diarizer.get_speech_segments_with_speakers(results)
    print("Speech segments with speakers:")
    for segment in speech_segments:
        print(f"Start: {segment['start']:.2f}s, End: {segment['end']:.2f}s, Speaker: {segment['speaker']}")

    print(f"Number of speakers detected: {num_speakers}")`

    note model: https://github.com/pengzhendong/pyannote-onnx/blob/master/pyannote_onnx/segmentation-3.0.onnx

thewh1teagle commented 4 months ago

@altunenes

Thanks for helping :) I got another awesome idea that may be easy to start with. We can get word timestamps from whisper for each word (using max_len=1 and split_on_word=true. After getting the segments, we can go through them, each one has start and stop timestamp, then we can run speech embedding model such as spkrec-ecapa-voxceleb with onnx runtime using pykeio/ort and this way we'll have speaker label for each word segment. then we can easily construct the sentences back from it. We don't even need VAD (voice activity detector) or segmenting the audio. whisper does the heavy lifting already The downside is that we'll run the model on each word instead of entire segment. less efficient. What do you think?

altunenes commented 4 months ago

@altunenes

Thanks for helping :) I got another awesome idea that may be easy to start with. We can get word timestamps from whisper for each word (using max_len=1 and split_on_word=true. After getting the segments, we can go through them, each one has start and stop timestamp, then we can run speech embedding model such as spkrec-ecapa-voxceleb with onnx runtime using pykeio/ort and this way we'll have speaker label for each word segment. then we can easily construct the sentences back from it. We don't even need VAD (voice activity detector) or segmenting the audio. whisper does the heavy lifting already The downside is that we'll run the model on each word instead of entire segment. less efficient. What do you think?

Very creative!! this probably provides more accurate diarization across various languages and word lengths.

thewh1teagle commented 4 months ago

Great :) I started working on https://github.com/thewh1teagle/sherpa-rs to replicate https://github.com/k2-fsa/sherpa-onnx/blob/master/python-api-examples/speaker-identification.py

thewh1teagle commented 4 months ago

I implemented diarization in sherpa-rs/examples/diarize.rs and added it to Vibe source code. It works pretty good and fast. The only missing thing is that sherpa has small issue with the vad model https://github.com/k2-fsa/sherpa-onnx/issues/1084 that speech is not detected sometimes. Once it will be fixed, I'll change whisper logic, to transcribe only the diarized parts and it will work in vibe.

altunenes commented 4 months ago

nice! and thank you for your contributions maybe we should continue to talk in sherpa-rs...

csukuangfj commented 4 months ago

I implemented diarization in sherpa-rs/examples/diarize.rs and added it to Vibe source code. It works pretty good and fast. The only missing thing is that sherpa has small issue with the vad model k2-fsa/sherpa-onnx#1084 that speech is not detected sometimes. Once it will be fixed, I'll change whisper logic, to transcribe only the diarized parts and it will work in vibe.

It is fixed in https://github.com/k2-fsa/sherpa-onnx/pull/1099

atsalyuk commented 3 months ago

@thewh1teagle any chance this will be added soon? hoping to use this app for a project im working on instead of my current workflow. thanks!

thewh1teagle commented 3 months ago

@thewh1teagle any chance this will be added soon? hoping to use this app for a project im working on instead of my current workflow. thanks!

Thanks for interest :) It's very hard feature to add. But in short it works like this:

We enable diarization from advanced options in vibe (already added) - once it clicked it asks to download required models
When diarization enabled we tell whisper to enable word_timestamps meaning every word has timestamp
After transcription completed, we starting diarize the audio. meaning we want to know exactly when there are speeches, and after we know when there are speeches we want to identify who speech there
After we have that information we can know in each whisper transcripted word who spoke
To know when there's speech I used silero vad in sherpa-rs but it doesn't work well, and to konw who speech I used nemo_en_speakerverification_speakernet in sherpa-rs which works ok.
Also I integrated custom whisper.cpp so the whisper timestamps will be accurate https://github.com/thewh1teagle/vibe/issues/152

I think the best chance is to add pyannote to https://github.com/k2-fsa/sherpa-onnx/issues/1197

atsalyuk commented 3 months ago

ah i see. thank you! appreciate the quick response

thewh1teagle commented 3 months ago

Some updates: I created simple diarization solution in pyannote-rs And even added it to Vibe in another branch. It's accurate and also make the transcription much more accurate. The only issue is that it makes the transcription slower since whisper is optimized for chunks of 30s but often speech is shorter.

on macOS with medium model for 40s audio it takes 7s normally and 15s with diarization. We can also just feed whisper normally with big chunks and get the timestamps from it. but it's timestamps aren't accurate.

The diarization is fast. like 30s for 1 hour.

Todo: download models instead of embedding into the exe to keep the exe lightweight.

thewh1teagle commented 3 months ago

Speaker diarization released! (Beta) You can try it here https://github.com/thewh1teagle/vibe/releases/tag/v2.4.0-beta.0

Few things about it

Enable through transcription options in main window
It's recommend to run the tiny model instead of the medium. the diarization makes the transcription slower
It's recommend to choose max speakers through transcription settings

altunenes commented 3 months ago

exciting news!

oleole39 commented 3 months ago

Speaker diarization released! (Beta) You can try it here https://github.com/thewh1teagle/vibe/releases/tag/v2.4.0-beta.0

Interesting, but FYI can't run it on Ubuntu 22.04 based OS #207

thewh1teagle commented 3 months ago

Interesting, but FYI can't run it on Ubuntu 22.04 based OS #207

I just released stable release including for 22.04 https://github.com/thewh1teagle/vibe/releases/tag/v2.4.0

By the way on Linux I strongly recommend to use the tiny model for speed See pre built section in latest release

altunenes commented 3 months ago

Interesting, but FYI can't run it on Ubuntu 22.04 based OS #207

I just released stable release including for 22.04 https://github.com/thewh1teagle/vibe/releases/tag/v2.4.0

By the way on Linux I strongly recommend to use the tiny model for speed See pre built section in latest release

tiny model is nice, especially in terms of speed, but in terms of the accuracy of transcription, I found the Sherpa version models nicer on my tests at least. :)

Note: maybe I should play with params more...

oleole39 commented 3 months ago

I just released stable release including for 22.04

Thanks now it works.

tiny model is nice, especially in terms of speed, but in terms of the accuracy of transcription

Same here, tiny model gives too inaccurate result to be actually practical. Even with higher temperature than default.

By the way on Linux I strongly recommend to use the tiny model for speed

Fortunately, medium model works too on Linux with diarization, even if slow.

Diarization works well! However in the context of a podcast where the host talks for a while, and then interviews other people (I tested only with 2 speakers in total), it behaves likes this (using text format for the output):

Speaker 1:
blablablabla

Speaker 1:
again blablablabla

Speaker 1: 
blablablabla ?

Speaker 2: 
Yes blablalbalba

Speaker 2: 
And blablablalba

To my mind it should Ideally not split the successive content of a same speaker into several labels, but rather several paragraphs under the same unique label, i.e.

Speaker 1:
blablablabla
again blablablabla
blablablabla ?

Speaker 2: 
Yes blablalbalba 
And blablablalba

thewh1teagle commented 3 months ago

However in the context of a podcast where the host talks for a while, and then interviews other people (I tested only with 2 speakers in total), it behaves likes this (using text format for the output):

Could you please open a separate issue for this?

Puhhee, that was a huge feature! Now vibe supporting diarization, we can close this issue :)

altunenes commented 3 months ago

However in the context of a podcast where the host talks for a while, and then interviews other people (I tested only with 2 speakers in total), it behaves likes this (using text format for the output):

Could you please open a separate issue for this?

Puhhee, that was a huge feature! Now vibe supporting diarization, we can close this issue :)

and also in rust!
congrats!!!!

thewh1teagle / vibe

[Feature Request]: Speaker labels (Diarization) #74

Goal

Research