SD is performed using pyannote's pre-trained SD pipeline, which returns timestamps for 'who spoke when'
ASR is performed using HF's pipeline, which returns utterance-level transcriptions and their corresponding timestamps. One can specify the checkpoint of any pre-trained ASR model on the Hub to use for inference (e.g. Whisper tiny)
The SD and ASR timestamps are aligned to retrieve speaker-segmented transcriptions
Example:
import torch
from speechbox import ASRDiarizationPipeline
from datasets import load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
pipeline = ASRDiarizationPipeline.from_pretrained("openai/whisper-tiny", device=device)
# load dataset of concatenated LibriSpeech samples
concatenated_librispeech = load_dataset("sanchit-gandhi/concatenated_librispeech", split="train", streaming=True)
# get first sample
sample = next(iter(concatenated_librispeech))
out = pipeline(sample["audio"])
# format the transcriptions nicely for printout
print("\n\n".join([chunk["speaker"] + " " + str((round(chunk["timestamp"][0], 1), round(chunk["timestamp"][1], 1))) + chunk["text"] for chunk in out]))
Print Output:
SPEAKER_01 (0.0, 15.0) Chapter 16 I might have told you of the beginning of this liaison in a few lines, but I wanted you to see every step by which we came. I to agree to whatever Mark Reid wished.
SPEAKER_00 (15.0, 22.0) He was in a fevered state of mind, owing to the blight his wife's action threatened to cast upon his entire future.
Adds pipeline for automatic speech recognition (ASR) + speaker diarization (SD)
Example:
Print Output: