Closed thewh1teagle closed 3 months ago
I don't know if it helps but here is the most successful Notebook I know to perform this, maybe it's adaptable in rust?
https://github.com/MahmoudAshraf97/whisper-diarization/tree/main
Thanks! Most of the implementation in python uses pyannotate Since we use Rust in vibe it's more challenging as I can't find any oss project that does it.
We'll probably use onnx ai runtime with https://github.com/pykeio/ort for segmentation and https://github.com/nkeenan38/voice_activity_detector for vad
Thanks! Most of the implementation in python uses pyannotate Since we use Rust in vibe it's more challenging as I can't find any oss project that does it.
We'll probably use onnx ai runtime with https://github.com/pykeio/ort for segmentation and https://github.com/nkeenan38/voice_activity_detector for vad
Or cheat a bit ? https://rustpython.github.io (never used it myself though)
Or cheat a bit ? https://rustpython.github.io (never used it myself though)
Looks like a useful crate! but I hope we can continue to avoid using Python for as long as possible to maintain top-notch performance and quality.
I'm not an expert in this area maybe this could be very stupid but I tried to create a minimal Python script using an onnx model (heavily dependent on https://github.com/pengzhendong/pyannote-onnx) to gain more insight into the process of converting this to rust (using ort for rust).
Converting this script into the Rust doesn't seem like a big deal to me, but of course, I might be missing something critical here (especially on the segmentation part) haha
warning: I tested this only with this file: https://github.com/pengzhendong/pyannote-onnx/blob/master/data/test_16k.wav so may not be appropriate with a general solutions...
import numpy as np
import onnxruntime as ort
import soundfile as sf
from itertools import permutations
class MinimalSpeakerDiarization:
def __init__(self, model_path):
self.num_classes = 4
self.vad_sr = 16000
self.duration = 10 * self.vad_sr
self.session = ort.InferenceSession(model_path)
def sample2frame(self, x):
return (x - 721) // 270
def frame2sample(self, x):
return (x * 270) + 721
def sliding_window(self, waveform, window_size, step_size):
start = 0
num_samples = len(waveform)
while start <= num_samples - window_size:
yield waveform[start : start + window_size]
start += step_size
if start < num_samples:
last_window = np.pad(waveform[start:], (0, window_size - (num_samples - start)))
yield last_window
def reorder(self, x, y):
perms = [np.array(perm).T for perm in permutations(y.T)]
diffs = np.sum(np.abs(np.sum(np.array(perms)[:, : x.shape[0], :] - x, axis=1)), axis=1)
return perms[np.argmin(diffs)]
def process_audio(self, audio_path):
wav, sr = sf.read(audio_path)
if sr != self.vad_sr:
raise ValueError(f"Audio sample rate {sr} does not match required {self.vad_sr}")
wav = wav.astype(np.float32)
step = 5 * self.vad_sr
step = max(min(step, int(0.9 * self.duration)), self.duration // 2)
overlap = self.sample2frame(self.duration - step)
overlap_chunk = np.zeros((overlap, self.num_classes), dtype=np.float32)
results = []
for window in self.sliding_window(wav, self.duration, step):
window = window.astype(np.float32)
ort_outs = np.exp(self.session.run(None, {"input": window[None, None, :]})[0][0])
ort_outs = np.concatenate(
(
1 - ort_outs[:, :1], # speech probabilities
self.reorder(
overlap_chunk[:, 1 : self.num_classes],
ort_outs[:, 1 : self.num_classes],
), # speaker probabilities
),
axis=1,
)
if len(results) > 0:
ort_outs[:overlap, :] = (ort_outs[:overlap, :] + overlap_chunk) / 2
overlap_chunk = ort_outs[-overlap:, :]
results.extend(ort_outs[:-overlap])
return np.array(results)
def get_speech_segments_with_speakers(self, results, threshold=0.5, min_speech_duration_ms=100):
speech_prob = results[:, 0]
speaker_probs = results[:, 1:]
segments = []
in_speech = False
start = 0
# First, determine active speakers
speech_duration = np.sum(speaker_probs > threshold, axis=0)
speech_duration_ms = self.frame2sample(speech_duration) * 1000 / self.vad_sr
active_speakers = np.where(speech_duration_ms > min_speech_duration_ms)[0]
for i, (speech, speakers) in enumerate(zip(speech_prob, speaker_probs)):
if not in_speech and speech >= threshold:
start = i
in_speech = True
elif in_speech and speech < threshold:
speaker_index = np.argmax(np.mean(speaker_probs[start:i], axis=0))
if speaker_index in active_speakers:
speaker = f'speaker{np.where(active_speakers == speaker_index)[0][0] + 1}'
segments.append({
'start': self.frame2sample(start) / self.vad_sr,
'end': self.frame2sample(i) / self.vad_sr,
'speaker': speaker
})
in_speech = False
if in_speech:
speaker_index = np.argmax(np.mean(speaker_probs[start:], axis=0))
if speaker_index in active_speakers:
speaker = f'speaker{np.where(active_speakers == speaker_index)[0][0] + 1}'
segments.append({
'start': self.frame2sample(start) / self.vad_sr,
'end': self.frame2sample(len(speech_prob)) / self.vad_sr,
'speaker': speaker
})
return segments, len(active_speakers)
def get_num_speakers(self, results, threshold=0.5, min_speech_duration_ms=100):
speaker_probs = results[:, 1:]
speech_duration = np.sum(speaker_probs > threshold, axis=0)
speech_duration_ms = self.frame2sample(speech_duration) * 1000 / self.vad_sr
return np.sum(speech_duration_ms > min_speech_duration_ms)
if __name__ == "__main__":
model_path = "segmentation-3.0.onnx"
audio_path = "test_16k.wav"
diarizer = MinimalSpeakerDiarization(model_path)
results = diarizer.process_audio(audio_path)
speech_segments, num_speakers = diarizer.get_speech_segments_with_speakers(results)
print("Speech segments with speakers:")
for segment in speech_segments:
print(f"Start: {segment['start']:.2f}s, End: {segment['end']:.2f}s, Speaker: {segment['speaker']}")
print(f"Number of speakers detected: {num_speakers}")`
note model: https://github.com/pengzhendong/pyannote-onnx/blob/master/pyannote_onnx/segmentation-3.0.onnx
@altunenes
Thanks for helping :)
I got another awesome idea that may be easy to start with.
We can get word timestamps from whisper for each word (using max_len=1
and split_on_word=true
.
After getting the segments, we can go through them, each one has start and stop timestamp, then we can run speech embedding model such as spkrec-ecapa-voxceleb with onnx runtime using pykeio/ort and this way we'll have speaker label for each word segment.
then we can easily construct the sentences back from it.
We don't even need VAD (voice activity detector) or segmenting the audio. whisper does the heavy lifting already
The downside is that we'll run the model on each word instead of entire segment. less efficient.
What do you think?
@altunenes
Thanks for helping :) I got another awesome idea that may be easy to start with. We can get word timestamps from whisper for each word (using
max_len=1
andsplit_on_word=true
. After getting the segments, we can go through them, each one has start and stop timestamp, then we can run speech embedding model such as spkrec-ecapa-voxceleb with onnx runtime using pykeio/ort and this way we'll have speaker label for each word segment. then we can easily construct the sentences back from it. We don't even need VAD (voice activity detector) or segmenting the audio. whisper does the heavy lifting already The downside is that we'll run the model on each word instead of entire segment. less efficient. What do you think?
Very creative!! this probably provides more accurate diarization across various languages and word lengths.
Great :) I started working on https://github.com/thewh1teagle/sherpa-rs to replicate https://github.com/k2-fsa/sherpa-onnx/blob/master/python-api-examples/speaker-identification.py
I implemented diarization in sherpa-rs/examples/diarize.rs and added it to Vibe source code. It works pretty good and fast. The only missing thing is that sherpa has small issue with the vad model https://github.com/k2-fsa/sherpa-onnx/issues/1084 that speech is not detected sometimes. Once it will be fixed, I'll change whisper logic, to transcribe only the diarized parts and it will work in vibe.
nice! and thank you for your contributions maybe we should continue to talk in sherpa-rs...
I implemented diarization in sherpa-rs/examples/diarize.rs and added it to Vibe source code. It works pretty good and fast. The only missing thing is that sherpa has small issue with the vad model k2-fsa/sherpa-onnx#1084 that speech is not detected sometimes. Once it will be fixed, I'll change whisper logic, to transcribe only the diarized parts and it will work in vibe.
It is fixed in https://github.com/k2-fsa/sherpa-onnx/pull/1099
@thewh1teagle any chance this will be added soon? hoping to use this app for a project im working on instead of my current workflow. thanks!
@thewh1teagle any chance this will be added soon? hoping to use this app for a project im working on instead of my current workflow. thanks!
Thanks for interest :) It's very hard feature to add. But in short it works like this:
I think the best chance is to add pyannote to https://github.com/k2-fsa/sherpa-onnx/issues/1197
ah i see. thank you! appreciate the quick response
Some updates: I created simple diarization solution in pyannote-rs And even added it to Vibe in another branch. It's accurate and also make the transcription much more accurate. The only issue is that it makes the transcription slower since whisper is optimized for chunks of 30s but often speech is shorter.
on macOS with medium model for 40s audio it takes 7s normally and 15s with diarization. We can also just feed whisper normally with big chunks and get the timestamps from it. but it's timestamps aren't accurate.
The diarization is fast. like 30s for 1 hour.
Todo: download models instead of embedding into the exe to keep the exe lightweight.
Speaker diarization released! (Beta) You can try it here https://github.com/thewh1teagle/vibe/releases/tag/v2.4.0-beta.0
Few things about it
exciting news!
Speaker diarization released! (Beta) You can try it here https://github.com/thewh1teagle/vibe/releases/tag/v2.4.0-beta.0
Interesting, but FYI can't run it on Ubuntu 22.04 based OS #207
Interesting, but FYI can't run it on Ubuntu 22.04 based OS #207
I just released stable release including for 22.04 https://github.com/thewh1teagle/vibe/releases/tag/v2.4.0
By the way on Linux I strongly recommend to use the tiny model for speed See pre built section in latest release
Interesting, but FYI can't run it on Ubuntu 22.04 based OS #207
I just released stable release including for 22.04 https://github.com/thewh1teagle/vibe/releases/tag/v2.4.0
By the way on Linux I strongly recommend to use the tiny model for speed See pre built section in latest release
tiny model is nice, especially in terms of speed, but in terms of the accuracy of transcription, I found the Sherpa version models nicer on my tests at least. :)
Note: maybe I should play with params more...
I just released stable release including for 22.04
Thanks now it works.
tiny model is nice, especially in terms of speed, but in terms of the accuracy of transcription
Same here, tiny model gives too inaccurate result to be actually practical. Even with higher temperature than default.
By the way on Linux I strongly recommend to use the tiny model for speed
Fortunately, medium model works too on Linux with diarization, even if slow.
Diarization works well! However in the context of a podcast where the host talks for a while, and then interviews other people (I tested only with 2 speakers in total), it behaves likes this (using text format for the output):
Speaker 1:
blablablabla
Speaker 1:
again blablablabla
Speaker 1:
blablablabla ?
Speaker 2:
Yes blablalbalba
Speaker 2:
And blablablalba
To my mind it should Ideally not split the successive content of a same speaker into several labels, but rather several paragraphs under the same unique label, i.e.
Speaker 1:
blablablabla
again blablablabla
blablablabla ?
Speaker 2:
Yes blablalbalba
And blablablalba
However in the context of a podcast where the host talks for a while, and then interviews other people (I tested only with 2 speakers in total), it behaves likes this (using text format for the output):
Could you please open a separate issue for this?
Puhhee, that was a huge feature! Now vibe supporting diarization, we can close this issue :)
However in the context of a podcast where the host talks for a while, and then interviews other people (I tested only with 2 speakers in total), it behaves likes this (using text format for the output):
Could you please open a separate issue for this?
Puhhee, that was a huge feature! Now vibe supporting diarization, we can close this issue :)
and also in rust!
congrats!!!!
Goal
Provide speaker labels along with the transcriptions (eg.
Speaker1: ...
,Speaker2: ...
) Do it in the same time when transcribing efficient and lightweight.Research
https://github.com/wq2012/awesome-diarization
Possible ways: Use
c/c++
diarization libs in Rust using bindgen Replicate pyannote-audio toRust
with tch-rsUse ONNX runtime with ort
https://github.com/pykeio/ort/discussions/208
https://github.com/pyannote/pyannote-audio/issues/1322
Best combination: pyannote-segmentation-30 WespeakerVoxcelebResnet34LM